the numerical algorithms group
Post on 12-Sep-2021
18 Views
Preview:
TRANSCRIPT
Over 30 years of mathematical excellence
The Numerical Algorithms GroupCombining mathematics and technology for enhanced performance
Unlocking the Power of OpenMPDr Stef SalviniNAG Ltdstef.salvini@nag.co.uk
Stef Salvini 2
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Acknowledgements
Lawrence MulhollandEdward SmythThemos TsikasAnne E. TrefethenJeremy Du CrozRobert TongAnd too many others to mention here!
Stef Salvini 3
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Contents
Opening RemarksSMP Systems, Parallelism and OpenMPTuning OpenMP:
General PointsNag Specific Case: Needs, Solutions and Case Studies
OpenMP and Hybrid ParallelismSome Other Considerations and Conclusions
Stef Salvini 4
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Contents
Opening RemarksSMP Systems, Parallelism and OpenMPTuning OpenMP:
General PointsNag Specific Case: Needs, Solutions and Case Studies
OpenMP in Hybrid ParallelismSome Other Considerations and Conclusions
Stef Salvini 5
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
NAG and HPC
NAG ProductsParallel Library Release 3 (MPI)NAG SMP Library Release 2 (Open MP)
Collaborations with external agenciesVendors (e.g. ACML Library for AMD Opteron)Research and academic institutionsIndustrial, commercial and financial concerns
Consultancy activities
Stef Salvini 6
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Who Can Benefit from Parallelism?
Anybody with large computationally intensive problems
Academic institutionsIndustry (Aerospace, car, etc.)Financial institutions (Forecasts, etc.)
Increasingly commercial systemsDatabasesOn-Line TransactionsData miningWeb Servers, etc
They want solutions!
Stef Salvini 7
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Some thoughts…
Short Life CycleHardware
Long Life CycleScientific Software
Is Software the Real Capital Investment ?
Stef Salvini 8
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
The Changing World of HPC
The “(ir)resistible” rise of PC ClustersPrice/performance (claims or substantial?)Originally in-house built, now part of mainstreamIncreasing penetration of the server market
MPI as New LegacyIs Parallelism crystallised into MPI?Is anybody interested in other types of paralellism?
Hybrid systemsThe de facto standard for high-end systemsDo we need multi-level parallelism?
Stef Salvini 9
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Contents
Opening RemarksSMP Systems, Parallelism and OpenMPTuning OpenMP:
General PointsNag Specific Case: Needs, Solutions and Case Studies
OpenMP in Hybrid ParallelismSome Other Considerations and Conclusions
Stef Salvini 10
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Why SMPs?
HardwareRe-usable technologyModular technologyIncreasingly partitionable hardwareReliable Technology (minimum downtime)
Commercial applicationsDatabasesOLTPWeb Servers
Numerical and Scientific ApplicationsTremendous potential
Stef Salvini 11
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
SMP Model
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
Memory
Interconnect Subsystem
Single memory visible to all processors
Memory can be physically partitioned
(NUMA systems)
Stef Salvini 12
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
SMP Parallelism in a Nutshell
Multi-threaded Parallelism (Parallelism-on-demand)
Parallel Region
Spawn threads Destroy threads
Multi-threading•Parallel execution•Serial Interfaces•Details of parallelismare hidden outside theparallel region
Serial executionSerial execution
Parallelism carried out in distinct parallel regions
Stef Salvini 13
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
The Data View
Local Memory
Primary Cache
Registers
Secondary Cache
Global Memory
Data must migrate through the different levels of memory in a
very coordinated fashion
Spee
d of
dat
a ac
cess
Size of available data space
Stef Salvini 14
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Memory and Data Transfer
Memory StructureMultiple-levels (increasing)Caches invaluable, essential and difficult
Some difficulties with data accessSingle processor effects
Cache misses and thrashingTLB misses
Multi-processor effectsFalse SharingRequired synchronisations
NUMA SystemsData allocation and distributionPage misses and migration
Stef Salvini 15
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
SMP Parallelism: Dynamic View of Data
Computation Stage 1 Computation Stage 2Computation Stage 3
Data
Processor 1
Processor 2
Processor 3
Processor 4
Stef Salvini 16
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
From SMP to OpenMP
OpenMP embodies SMP MechanismsConcise notation
Not all parallel structures representedSimple
To implement: hence wide acceptance by vendorsto understand (?)
Compiler DirectivesCompile cleanly on serial systems
However:“Local” references onlySome system calls
Stef Salvini 17
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Contents
Opening RemarksSMP Systems, Parallelism and OpenMPTuning OpenMP:
General PointsNag Specific Case: Needs, Solutions and Case Studies
OpenMP in Hybrid ParallelismSome Other Considerations and Conclusions
Stef Salvini 18
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Ensuring Efficiency on Modern Systems
Algorithm Must take into accountParallelismData access (multi-level memory layout)
Algorithms must haveDynamic load balancingSome strategy for serial or quasi-serial bottlenecks (beating Amdahl’s law?)Parametrised for easy configuration & porting
Should also algorithms take into account ofMulti-level memory (e.g. NUMA, clusters, etc)?Contingent (history of the computation) data layout?
Stef Salvini 19
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Levels of Parallelism
Coarse GrainApplication drivenMore potential for parallelismCloser to optimal performanceLess overheadsDesign complexityImplementation costsMaintenance costsRemoved from “serial” world“Non-local” data access problemNon modular designExpandable to non-SMP systems
Coarser
Fine
r Fine GrainClose to “elementary” algorithmsModular designDirect path from “serial” worldTop-down refinementHigher overheadsLess potential for parallelismSerial or quasi-serial bottlenecks
Stef Salvini 20
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Level of Parallelism
Coarse/Fine trade-off dictated byNature of applicationTechnological feasibilityAvailability of ComponentsExpertise and experienceTime scale and resources for developmentDeadlines for results
Stef Salvini 21
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
The “reductionist approach”
Postulate: Parallelise the basic computational kernels
BLAS (Basic Linear Algebra Subroutines)Level-3 (Matrix-matrix product)Level-2 (Matrix-vector product)Level-1 (vector operations)
Basic FFTsTheorem: Anything built on them will have adequate parallelism
Proof by oral tradition
Stef Salvini 22
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Level-3 BLAS
N3 OperationsN2 Data ReferencesGood data re-use (cache friendly)Good parallelisability
Stef Salvini 23
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Level-2 and Level-1 BLAS
N2 and N OperationsN2 and N data ReferencesPoor data re-use Dubious parallelisabilityProblems if sequences of BLAS are applied to the same data space
= x
Stef Salvini 24
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
DSYMV (Symm. Matrix Vector Product)
Requires Synchronisations, etcMany vendors do not parallelise it!
Stef Salvini 25
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Contents
Opening RemarksSMP Systems, Parallelism and OpenMPTuning OpenMP:
General PointsNag Specific Case: Needs, Solutions and Case Studies
OpenMP in Hybrid ParallelismSome Other Considerations and Conclusions
Stef Salvini 26
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
NAG SMP Library
Our needsFill the gap between Serial and Parallel codesAllow maximum reuse of our products (NAG Library)Best performance achievable (parallelism must make a difference!)
Our ApproachBuild the Library based on our NAG Library (serial)Keep identical serial interfacesKeep identical functionality and numericsHide all details of parallelismUse parametrisable algorithms for easy portingThat dictates our level of granularity!
Stef Salvini 27
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
NAG SMP Library Release 2
All the NAG Library Mark 19Over 1200 user-callable components
Parallelised numerical Routines inDense linear algebraSparse Linear Algebra FFTsRandom-number generationAll other routines dependent on thye above
Stef Salvini 28
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
New at NAG SMP Library Release 3 (Soon)
Sparse TechnologyDirect MethodsPreconditioners for Iterative SolversEnhanced Iterative SolversBand SolversEigensolution of Large Sparse Matrices
Extended Linear Algebra CoverageMulti-Dimensional Quadrature
Stef Salvini 29
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
LU Factorisation: A “Serial” Algorithm
Stef Salvini 30
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
LU Factorisation: LAPACK Style
Active submatrix
Already Factorised
L Factor
U Factor
Permute the Rows (Level-1
BLAS)
Update the Trailing
Submatrix(Level-3 BLAS)
Factorise the pivot block
(Level-2 BLAS)
Solve the triangular
system (Level-3 BLAS)
U Factor
Stef Salvini 31
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
About serial or quasi-serial bottlenecks
That is what we do
Identify serial bottlenecksIdentify memory access bottlenecksRemove the above or …“Hide” them using a “look ahead” strategy“Locally asynchronous” algorithms?
Stef Salvini 32
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
About serial or quasi-serial bottlenecksLive with them …The reductionist approach
Further parallelisation of the bottlenecksPerhaps nested parallelism?
Use knowledge about the algorithmsPredecessor/successorTask queues
Use toolsInstrument the code to generate a stack of tasksProfile and analyse previous runs
“Hide” them using a “look ahead” strategy“Locally asynchronous” algorithms
Stef Salvini 33
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Look-ahead (Beating Amdahl’s Law?)
SerialParallelisable Parallelisable
Serial Execution
Parallel Execution
Processor 1
Processor 2
Processor 3
Processor 1
Processor 2
Processor 3
Parallel Execution with Look-Ahead
Stef Salvini 34
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
LU Factorisation: LAPACK Style
Active submatrix
Already Factorised
L Factor
U Factor
Permute the Rows (Level-1
BLAS)
Update the Trailing
Submatrix(Level-3 BLAS)
Factorise the pivot block
(Level-2 BLAS)
Solve the triangular
system (Level-3 BLAS)
U Factor
Stef Salvini 35
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
LU Factorisation: SMP Style (1)
Active submatrix
Already FactorisedU Factor
321 On ALL processors:•Permute the rows•Solve the Triangular system• Update
Pivot Block,Factorise on 1
After the update on 1 is done
Stef Salvini 36
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
LU Factorisation: SMP Style (2)
12
31
23
Apply all the permutations from the
right in parallel
Stef Salvini 37
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
LU Factorisation
0
10000
20000
30000
40000
50000
6000010
0
200
500
1000
2000
3000
4000
6000
8000 10
0
200
500
1000
2000
3000
4000
6000
8000
124816243248
Problem size (N)
Perf
orm
ance
(Mflo
ps)Sun PerflibSun F15K – 1050 MHz
N. Procs
NAG SMP Release 2
Stef Salvini 38
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
QR Factorisation
0
5000
10000
15000
20000
25000
100
200
500
1000
2000
4000
8000 100
200
500
1000
2000
4000
6000
8000 100
200
500
1000
2000
4000
6000
8000
1248121620
Problem size (N)
Perf
orm
ance
(Mflo
ps) N. Procs
Sun Perflib
Sun E6800 – 900 MHz
NAG SMP Release 2LAPACK
Stef Salvini 39
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
NEC SX4, LU Factorization
1000 2000 4000 1000 2000 4000
12481214
Problem size (n)
Perf
orm
ance
(Mflo
ps/s
ec)
LAPACK
NAG SMP Library
N. Procs
Stef Salvini 40
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
0.0
50.0
100.0
150.0
200.0
250.0
1 2 4 8 16 24 32 48
100200
5001000
20003000
40006000
8000
LU Factorisation
Number of Processors
Exe
cutio
n T
ime
(sec
s)
Prob
lem S
ize (N
)
Sun PerflibSun F15K – 1050 MHz
Stef Salvini 41
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Other cases
TridiagonalisationHalf operations through DSYMVHalf operations through rank-k symmetric update (Level-3 BLAS)The gateway to symmetric eigenproblem
Full SVD (QR Algorithm) of a bidiagonalmatrix
Computational kernel: line rotations (Level-1 BLAS DROT)Statistics, LLS problems, rank-deficiency, etc
Stef Salvini 42
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
0
5000
10000
15000
20000
25000
3000010
0
200
500
1000
2000
4000 100
200
500
1000
2000
4000
124816243248
Tridiagonalisation (Upper variant)
Problem size (N)
Perf
orm
ance
(Mflo
ps)
N. Procs
Sun PerflibSun F15K – 1050 MHz
NAG SMP Release 2
Stef Salvini 43
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
0
2000
4000
6000
8000
10000
12000
1400010
0
200
500
1000
2000
4000 100
200
500
1000
2000
4000
124816243248
Full SVD of Bidiagonal Matrix (QR Algor.)
Problem size (N)
Perf
orm
ance
(Mflo
ps)
N. Procs
Sun PerflibSun F15K – 1050 MHz NAG SMP Release 2
Stef Salvini 44
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Full SVD of Bidiagonal Matrix (QR Algor.)
0
200
400
600
800
1000
1200
1400
1600
1 2 4 8 16 24 32 48 1 2 4 8 16 24 32 48
100200
5001000
20004000
N. Procs
Sun Perflib
Sun F15K – 1050 MHz
NAG SMP Release 2
Proble
m Size
(N)
Exe
cutio
n T
ime
(sec
s)
Stef Salvini 45
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Contents
Opening RemarksSMP Systems, Parallelism and OpenMPTuning OpenMP:
General PointsNag Specific Case: Needs, Solutions and Case Studies
OpenMP and Hybrid ParallelismSome Other Considerations and Conclusions
Stef Salvini 46
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Clusters of SMPs
Interconnection Sub-System
FutureHigh-endMedium-endLow-end
Technologyre-usableupgradeableLinux boxes?
Hybrid (Mixed) Model?NAG currently actively involved
Stef Salvini 47
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Some Considerations
Enormous increase of CPU performanceLess marked improvements to memory subsystemsUsing modular components
Relatively higher latency than currently
Stef Salvini 48
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Hybrid Model Paradigm
Currently:All processors the
same (MPI,etc)
“flattening mountains” …
Stef Salvini 49
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Hybrid Parallelism: Why?
High-Latency systemsIncreased levels of memoryPart-Serialisation of Message-PassingIncreased Number of Processors Competing for Communication
Interconnection Sub-System
Stef Salvini 50
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Mixed Mode Parallelism: A Model’s Goals
Maximize code re-useE.g., retain message-passing main code architectureUse existing SMP techniques and technology
Allow some form of top-down refinementIdentify bottlenecks in isolation from rest of code and improve their efficiency
Exploit a problem different levels of granularityCoarse granularity: mapped onto message-passingFine granularity: mapped onto SMP
‘Hide’ communication costs (look ahead again)Reduce load unbalancePerhaps best with problems consisting of loosely coupled components
Stef Salvini 51
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
SMP on Clusters
OS-Level SSDMSCompiler-level“Architecture-aware” OPenMP
Explicit page allocation, etcRetracing HPF?Also, very much the topic of tomorrow’s panel
Stef Salvini 52
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Contents
Opening RemarksSMP Systems, Parallelism and OpenMPTuning OpenMP:
General PointsNag Specific Case: Needs, Solutions and Case Studies
OpenMP in Hybrid ParallelismSome Other Considerations and Conclusions
Stef Salvini 53
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Performance of Real Applications
Memory boundMPI “faster” than OpenMP
Data “segregation”Better access to memoryLimited cross-processor memory effects
We need
Block algorithmsBetter data locality
User-specified prefetching
Stef Salvini 54
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Example: Very Sparse Large Problems
Random matrix entries, almost diagonally dominant
Diagonal preconditioningVirtually removes the effects of preconditioning from the performance analysis
4 Case studiesNeither matrix nor vectors fit in secondary cache
N = 1000000, NNZ = 10000000, random patternN = 1000000, NNZ = 10000000, random pattern within narrow band bandwidth = 200)
Matrix does not fit in secondary cacheN = 153600, NNZ = 15000000, random patternN = 153600, NNZ = 15000000, random pattern within narrow band bandwidth = 2000)
Stef Salvini 55
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Some Scalability Results
0.0
1.0
2.0
3.0
4.0
1 2 3 4
Number of Processors
Spee
d-up
TFQMR Random, n=1000000
TFQMR Narrow band, n = 1000000
Bi-CGSTAB (4) Random, n =153600
Bi-CGSTAB (4) Narrow band, n =153600
Stef Salvini 56
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Performance of Matrix-Vector Product
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
Case 1 Case 2 Case 3 Case 4
Arb
itrar
y U
nits
Relative CPU times
Stef Salvini 57
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Some Food for Thought
Algorithms producersNew algorithms required
Block algorithmslatency-tolerantparallel-adaptivedynamically load-balanceable
Other aspectsExpertise
Dearth of SMP and mixed-mode parallel expertiseIncreasing need
Stef Salvini 58
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
What do we need in OpenMP?More flexible synchronisations
“Level crossings” (some wait, one releases)?Partial barriers, relationships of precedence?
More flexible work sharing mechanismsData allocation/distribution
Avoid HPF constructsIs good page migration sufficient on NUMA?
User-specified prefetchingEssential for performancePortable API (or part of OpenMP)Compiler writers rather against it
Some message passing mechanism?
Stef Salvini 59
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Multi-Level Parallelism: Do We Need it?
Multi-level applicationsLoosely coupled componentsSMP Nested parallelism?
Heterogeneous applicationsVery difficult to map on OpenMP, currentlySMP nested parallelism?
Hardware requirements (Hybrid systems)Clusters of SMPs
Stef Salvini 60
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Summary
Tuning of Open MPDifficult but feasible
Considerable gainsGateway between serial and parallel worlds
Current algorithms may need revisionGood prefetching essential in the future
Challenges aheadDeveloping “look-ahead” strategies for numericalalgorithms Mapping existing numerical algorithm to future architectures (clusters of SMPs)Developing new “multi-level” algorithms
Stef Salvini 61
Over 30 years of mathematical excellenceEWOMP 2003 - 22-26 September 2003
Thank you
Thank youfor
your attention
Stef Salvinistef.salvini@nag.co.uk
top related