the numerical algorithms group

Over 30 years of mathematical excellence

The Numerical Algorithms GroupCombining mathematics and technology for enhanced performance

Unlocking the Power of OpenMPDr Stef SalviniNAG [email protected]

Stef Salvini 2

Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003

Acknowledgements

Lawrence MulhollandEdward SmythThemos TsikasAnne E. TrefethenJeremy Du CrozRobert TongAnd too many others to mention here!

Stef Salvini 3


Contents

Opening RemarksSMP Systems, Parallelism and OpenMPTuning OpenMP:

General PointsNag Specific Case: Needs, Solutions and Case Studies

OpenMP and Hybrid ParallelismSome Other Considerations and Conclusions

Stef Salvini 4


Contents



OpenMP in Hybrid ParallelismSome Other Considerations and Conclusions

Stef Salvini 5


NAG and HPC

NAG ProductsParallel Library Release 3 (MPI)NAG SMP Library Release 2 (Open MP)

Collaborations with external agenciesVendors (e.g. ACML Library for AMD Opteron)Research and academic institutionsIndustrial, commercial and financial concerns

Consultancy activities

Stef Salvini 6


Who Can Benefit from Parallelism?

Anybody with large computationally intensive problems

Academic institutionsIndustry (Aerospace, car, etc.)Financial institutions (Forecasts, etc.)

Increasingly commercial systemsDatabasesOn-Line TransactionsData miningWeb Servers, etc

They want solutions!

Stef Salvini 7


Some thoughts…

Short Life CycleHardware

Long Life CycleScientific Software

Is Software the Real Capital Investment ?

Stef Salvini 8


The Changing World of HPC

The “(ir)resistible” rise of PC ClustersPrice/performance (claims or substantial?)Originally in-house built, now part of mainstreamIncreasing penetration of the server market

MPI as New LegacyIs Parallelism crystallised into MPI?Is anybody interested in other types of paralellism?

Hybrid systemsThe de facto standard for high-end systemsDo we need multi-level parallelism?

Stef Salvini 9


Contents




Stef Salvini 10


Why SMPs?

HardwareRe-usable technologyModular technologyIncreasingly partitionable hardwareReliable Technology (minimum downtime)

Commercial applicationsDatabasesOLTPWeb Servers

Numerical and Scientific ApplicationsTremendous potential

Stef Salvini 11


SMP Model

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

Memory

Interconnect Subsystem

Single memory visible to all processors

Memory can be physically partitioned

(NUMA systems)

Stef Salvini 12


SMP Parallelism in a Nutshell

Multi-threaded Parallelism (Parallelism-on-demand)

Parallel Region

Spawn threads Destroy threads

Multi-threading•Parallel execution•Serial Interfaces•Details of parallelismare hidden outside theparallel region

Serial executionSerial execution

Parallelism carried out in distinct parallel regions

Stef Salvini 13


The Data View

Local Memory

Primary Cache

Registers

Secondary Cache

Global Memory

Data must migrate through the different levels of memory in a

very coordinated fashion

Spee

d of

dat

a ac

cess

Size of available data space

Stef Salvini 14


Memory and Data Transfer

Memory StructureMultiple-levels (increasing)Caches invaluable, essential and difficult

Some difficulties with data accessSingle processor effects

Cache misses and thrashingTLB misses

Multi-processor effectsFalse SharingRequired synchronisations

NUMA SystemsData allocation and distributionPage misses and migration

Stef Salvini 15


SMP Parallelism: Dynamic View of Data

Computation Stage 1 Computation Stage 2Computation Stage 3

Data

Processor 1

Processor 2

Processor 3

Processor 4

Stef Salvini 16


From SMP to OpenMP

OpenMP embodies SMP MechanismsConcise notation

Not all parallel structures representedSimple

To implement: hence wide acceptance by vendorsto understand (?)

Compiler DirectivesCompile cleanly on serial systems

However:“Local” references onlySome system calls

Stef Salvini 17


Contents




Stef Salvini 18


Ensuring Efficiency on Modern Systems

Algorithm Must take into accountParallelismData access (multi-level memory layout)

Algorithms must haveDynamic load balancingSome strategy for serial or quasi-serial bottlenecks (beating Amdahl’s law?)Parametrised for easy configuration & porting

Should also algorithms take into account ofMulti-level memory (e.g. NUMA, clusters, etc)?Contingent (history of the computation) data layout?

Stef Salvini 19


Levels of Parallelism

Coarse GrainApplication drivenMore potential for parallelismCloser to optimal performanceLess overheadsDesign complexityImplementation costsMaintenance costsRemoved from “serial” world“Non-local” data access problemNon modular designExpandable to non-SMP systems

Coarser

Fine

r Fine GrainClose to “elementary” algorithmsModular designDirect path from “serial” worldTop-down refinementHigher overheadsLess potential for parallelismSerial or quasi-serial bottlenecks

Stef Salvini 20


Level of Parallelism

Coarse/Fine trade-off dictated byNature of applicationTechnological feasibilityAvailability of ComponentsExpertise and experienceTime scale and resources for developmentDeadlines for results

Stef Salvini 21


The “reductionist approach”

Postulate: Parallelise the basic computational kernels

BLAS (Basic Linear Algebra Subroutines)Level-3 (Matrix-matrix product)Level-2 (Matrix-vector product)Level-1 (vector operations)

Basic FFTsTheorem: Anything built on them will have adequate parallelism

Proof by oral tradition

Stef Salvini 22


Level-3 BLAS

N3 OperationsN2 Data ReferencesGood data re-use (cache friendly)Good parallelisability

Stef Salvini 23


Level-2 and Level-1 BLAS

N2 and N OperationsN2 and N data ReferencesPoor data re-use Dubious parallelisabilityProblems if sequences of BLAS are applied to the same data space

= x

Stef Salvini 24


DSYMV (Symm. Matrix Vector Product)

Requires Synchronisations, etcMany vendors do not parallelise it!

Stef Salvini 25


Contents




Stef Salvini 26


NAG SMP Library

Our needsFill the gap between Serial and Parallel codesAllow maximum reuse of our products (NAG Library)Best performance achievable (parallelism must make a difference!)

Our ApproachBuild the Library based on our NAG Library (serial)Keep identical serial interfacesKeep identical functionality and numericsHide all details of parallelismUse parametrisable algorithms for easy portingThat dictates our level of granularity!

Stef Salvini 27


NAG SMP Library Release 2

All the NAG Library Mark 19Over 1200 user-callable components

Parallelised numerical Routines inDense linear algebraSparse Linear Algebra FFTsRandom-number generationAll other routines dependent on thye above

Stef Salvini 28


New at NAG SMP Library Release 3 (Soon)

Sparse TechnologyDirect MethodsPreconditioners for Iterative SolversEnhanced Iterative SolversBand SolversEigensolution of Large Sparse Matrices

Extended Linear Algebra CoverageMulti-Dimensional Quadrature

Stef Salvini 29


LU Factorisation: A “Serial” Algorithm

Stef Salvini 30


LU Factorisation: LAPACK Style

Active submatrix

Already Factorised

L Factor

U Factor

Permute the Rows (Level-1

BLAS)

Update the Trailing

Submatrix(Level-3 BLAS)

Factorise the pivot block

(Level-2 BLAS)

Solve the triangular

system (Level-3 BLAS)

U Factor

Stef Salvini 31


About serial or quasi-serial bottlenecks

That is what we do

Identify serial bottlenecksIdentify memory access bottlenecksRemove the above or …“Hide” them using a “look ahead” strategy“Locally asynchronous” algorithms?

Stef Salvini 32


About serial or quasi-serial bottlenecksLive with them …The reductionist approach

Further parallelisation of the bottlenecksPerhaps nested parallelism?

Use knowledge about the algorithmsPredecessor/successorTask queues

Use toolsInstrument the code to generate a stack of tasksProfile and analyse previous runs

“Hide” them using a “look ahead” strategy“Locally asynchronous” algorithms

Stef Salvini 33


Look-ahead (Beating Amdahl’s Law?)

SerialParallelisable Parallelisable

Serial Execution

Parallel Execution

Processor 1

Processor 2

Processor 3

Processor 1

Processor 2

Processor 3

Parallel Execution with Look-Ahead

Stef Salvini 34


LU Factorisation: LAPACK Style

Active submatrix

Already Factorised

L Factor

U Factor

Permute the Rows (Level-1

BLAS)

Update the Trailing

Submatrix(Level-3 BLAS)

Factorise the pivot block

(Level-2 BLAS)

Solve the triangular

system (Level-3 BLAS)

U Factor

Stef Salvini 35


LU Factorisation: SMP Style (1)

Active submatrix

Already FactorisedU Factor

321 On ALL processors:•Permute the rows•Solve the Triangular system• Update

Pivot Block,Factorise on 1

After the update on 1 is done

Stef Salvini 36


LU Factorisation: SMP Style (2)

12

31

23

Apply all the permutations from the

right in parallel

Stef Salvini 37


LU Factorisation

0

10000

20000

30000

40000

50000

6000010

0

200

500

1000

2000

3000

4000

6000

8000 10

0

200

500

1000

2000

3000

4000

6000

8000

124816243248

Problem size (N)

Perf

orm

ance

(Mflo

ps)Sun PerflibSun F15K – 1050 MHz

N. Procs

NAG SMP Release 2

Stef Salvini 38


QR Factorisation

0

5000

10000

15000

20000

25000

100

200

500

1000

2000

4000

8000 100

200

500

1000

2000

4000

6000

8000 100

200

500

1000

2000

4000

6000

8000

1248121620

Problem size (N)

Perf

orm

ance

(Mflo

ps) N. Procs

Sun Perflib

Sun E6800 – 900 MHz

NAG SMP Release 2LAPACK

Stef Salvini 39


NEC SX4, LU Factorization

1000 2000 4000 1000 2000 4000

12481214

Problem size (n)

Perf

orm

ance

(Mflo

ps/s

ec)

LAPACK

NAG SMP Library

N. Procs

Stef Salvini 40


0.0

50.0

100.0

150.0

200.0

250.0

1 2 4 8 16 24 32 48

100200

5001000

20003000

40006000

8000

LU Factorisation

Number of Processors

Exe

cutio

n T

ime

(sec

s)

Prob

lem S

ize (N

)

Sun PerflibSun F15K – 1050 MHz

Stef Salvini 41


Other cases

TridiagonalisationHalf operations through DSYMVHalf operations through rank-k symmetric update (Level-3 BLAS)The gateway to symmetric eigenproblem

Full SVD (QR Algorithm) of a bidiagonalmatrix

Computational kernel: line rotations (Level-1 BLAS DROT)Statistics, LLS problems, rank-deficiency, etc

Stef Salvini 42


0

5000

10000

15000

20000

25000

3000010

0

200

500

1000

2000

4000 100

200

500

1000

2000

4000

124816243248

Tridiagonalisation (Upper variant)

Problem size (N)

Perf

orm

ance

(Mflo

ps)

N. Procs

Sun PerflibSun F15K – 1050 MHz

NAG SMP Release 2

Stef Salvini 43


0

2000

4000

6000

8000

10000

12000

1400010

0

200

500

1000

2000

4000 100

200

500

1000

2000

4000

124816243248

Full SVD of Bidiagonal Matrix (QR Algor.)

Problem size (N)

Perf

orm

ance

(Mflo

ps)

N. Procs

Sun PerflibSun F15K – 1050 MHz NAG SMP Release 2

Stef Salvini 44


Full SVD of Bidiagonal Matrix (QR Algor.)

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16 24 32 48 1 2 4 8 16 24 32 48

100200

5001000

20004000

N. Procs

Sun Perflib

Sun F15K – 1050 MHz

NAG SMP Release 2

Proble

m Size

(N)

Exe

cutio

n T

ime

(sec

s)

Stef Salvini 45


Contents



OpenMP and Hybrid ParallelismSome Other Considerations and Conclusions

Stef Salvini 46


Clusters of SMPs

Interconnection Sub-System

FutureHigh-endMedium-endLow-end

Technologyre-usableupgradeableLinux boxes?

Hybrid (Mixed) Model?NAG currently actively involved

Stef Salvini 47


Some Considerations

Enormous increase of CPU performanceLess marked improvements to memory subsystemsUsing modular components

Relatively higher latency than currently

Stef Salvini 48


Hybrid Model Paradigm

Currently:All processors the

same (MPI,etc)

“flattening mountains” …

Stef Salvini 49


Hybrid Parallelism: Why?

High-Latency systemsIncreased levels of memoryPart-Serialisation of Message-PassingIncreased Number of Processors Competing for Communication

Interconnection Sub-System

Stef Salvini 50


Mixed Mode Parallelism: A Model’s Goals

Maximize code re-useE.g., retain message-passing main code architectureUse existing SMP techniques and technology

Allow some form of top-down refinementIdentify bottlenecks in isolation from rest of code and improve their efficiency

Exploit a problem different levels of granularityCoarse granularity: mapped onto message-passingFine granularity: mapped onto SMP

‘Hide’ communication costs (look ahead again)Reduce load unbalancePerhaps best with problems consisting of loosely coupled components

Stef Salvini 51


SMP on Clusters

OS-Level SSDMSCompiler-level“Architecture-aware” OPenMP

Explicit page allocation, etcRetracing HPF?Also, very much the topic of tomorrow’s panel

Stef Salvini 52


Contents




Stef Salvini 53


Performance of Real Applications

Memory boundMPI “faster” than OpenMP

Data “segregation”Better access to memoryLimited cross-processor memory effects

We need

Block algorithmsBetter data locality

User-specified prefetching

Stef Salvini 54


Example: Very Sparse Large Problems

Random matrix entries, almost diagonally dominant

Diagonal preconditioningVirtually removes the effects of preconditioning from the performance analysis

4 Case studiesNeither matrix nor vectors fit in secondary cache

N = 1000000, NNZ = 10000000, random patternN = 1000000, NNZ = 10000000, random pattern within narrow band bandwidth = 200)

Matrix does not fit in secondary cacheN = 153600, NNZ = 15000000, random patternN = 153600, NNZ = 15000000, random pattern within narrow band bandwidth = 2000)

Stef Salvini 55


Some Scalability Results

0.0

1.0

2.0

3.0

4.0

1 2 3 4

Number of Processors

Spee

d-up

TFQMR Random, n=1000000

TFQMR Narrow band, n = 1000000

Bi-CGSTAB (4) Random, n =153600

Bi-CGSTAB (4) Narrow band, n =153600

Stef Salvini 56


Performance of Matrix-Vector Product

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

Case 1 Case 2 Case 3 Case 4

Arb

itrar

y U

nits

Relative CPU times

Stef Salvini 57


Some Food for Thought

Algorithms producersNew algorithms required

Block algorithmslatency-tolerantparallel-adaptivedynamically load-balanceable

Other aspectsExpertise

Dearth of SMP and mixed-mode parallel expertiseIncreasing need

Stef Salvini 58


What do we need in OpenMP?More flexible synchronisations

“Level crossings” (some wait, one releases)?Partial barriers, relationships of precedence?

More flexible work sharing mechanismsData allocation/distribution

Avoid HPF constructsIs good page migration sufficient on NUMA?

User-specified prefetchingEssential for performancePortable API (or part of OpenMP)Compiler writers rather against it

Some message passing mechanism?

Stef Salvini 59


Multi-Level Parallelism: Do We Need it?

Multi-level applicationsLoosely coupled componentsSMP Nested parallelism?

Heterogeneous applicationsVery difficult to map on OpenMP, currentlySMP nested parallelism?

Hardware requirements (Hybrid systems)Clusters of SMPs

Stef Salvini 60


Summary

Tuning of Open MPDifficult but feasible

Considerable gainsGateway between serial and parallel worlds

Current algorithms may need revisionGood prefetching essential in the future

Challenges aheadDeveloping “look-ahead” strategies for numericalalgorithms Mapping existing numerical algorithm to future architectures (clusters of SMPs)Developing new “multi-level” algorithms

Stef Salvini 61


Thank you

Thank youfor

your attention

Stef [email protected]

the numerical algorithms group

Documents