the price performance of performance models
TRANSCRIPT
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 1
Felix Wolf, Technical University of DarmstadtModSim 2021
The price performance of performance models
Application System
Photo: Alex Becker / TU Darmstadt
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 2
Acknowledgement
TU Darmstadt• Alexandru Calotoiu
• Alexander Geiß
• Alexander Graf
• Daniel Lorenz
• Benedikt Naumann
• Thorsten Reimann
• Marcus Ritter
• Sergei Shudler
ETH Zurich• Alexandru Calotoiu
• Marcin Copik
• Tobias Grosser
• Torsten Hoefler
• Nicolas Wicki
LLNL• David Beckingsale
• Christopher Earl
• Ian Karlin
• Martin Schulz
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 3
Scaling your code can harbor performance surprises*…
CommunicationCom
putat
ionCommunication
Compu
tation
*Goldsmith et al., 2007
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 4
Performance model
29 210 211 212 2130
3
6
9
12
15
18
21
3¨ 10
´4 p2 ` c
Processes
Tim
ers
s
Formula that expresses a relevant performance metric as a function of one or more execution parameters
Identify kernels
• Incomplete coverage
Create models
• Laborious, difficult
Analytical (i.e., manual) creation challenging for entire programs
𝑡 = 𝑓 𝑝𝑡 = 3 & 10!"𝑝# + 𝑐
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 5
Empirical performance modeling
Performance measurements with different execution parameters x1,...,xn
t1 t2t3
tn-2 tn-1tn
….
Machine learning
𝑡 = 𝑓 𝑥$, … , 𝑥%
Alternative metrics: FLOPs, data volume…
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 6
Challenges
Applications
System
Run-to-run variation / noise
Cost of the required experiments
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 7
How to deal with noisy data
• Introduce prior into learning process• Assumption about the probability distribution generating the data
• Computation• Memory access• Communication• I/O~
Time Effort
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 8
Typical algorithmic complexities in HPCComputation
Communication
Samplesort
Naïve N-body
FFT
LU
Samplesort
Naïve N-body
FFT
LU
… …
t(p) ~ p2
t(p) ~ p
t(p) ~ c
t(p) ~ c
t(p) ~ p2 log22 (p)
t(p) ~ p
t(p) ~ log2(p)
t(p) ~ c
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 9
Performance model normal form (PMNF)
𝑓 𝑥 = .&'$
%
𝑐& & 𝑝(! & 𝑙𝑜𝑔#)! 𝑥
Single parameter [Calotoiu et al., SC13]
𝑓 𝑥$, … , 𝑥* = .&'$
%
𝑐&2+'$
*
𝑥+(!" & 𝑙𝑜𝑔#
)!" 𝑥+
Multiple parameters [Calotoiu et al., Cluster’16]
Heuristics to reduce search
space
Parameter selection
Search space
configuration
Linear regression +
cross-validation
Quality assurance
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 10
Extra-P 4.0
Available at: https://github.com/extra-p/extrap
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 11
MPI implementations[Shudler et al., IEEE TPDS 2019]
Platform Juqueen Juropa Piz DaintAllreduce [s] Expectation: O (log p)Model O (log p) O (p0.5) O (p0.67 log p)R2 0.87 0.99 0.99Match ✔ ~ ✘!Comm_dup [B] Expectation: O (1)Model 2.2e5 256 3770 + 18pR2 1 1 0.99Match ✔ ✔ ✘
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 12
Kripke - example w/ multiple parameters
SweepSolver
Main computation kernel
Expectation – Performance depends on problem size
Actual model:
MPI_Testany
Main communicationkernel: 3D wave-frontcommunication pattern
Expectation – Performance depends on cubic root of process count
Actual model:
Kernels must wait on each other
*Coefficients have been rounded for convenience
Smaller compounded effect discovered
t ~ p3t ~ d ⋅ g
t = 5+ d ⋅ g+ 0.005 ⋅ p3 ⋅d ⋅ g t = 7+ p3 + 0.005 ⋅ p3 ⋅d ⋅ g
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 13
Experiments can be expensiveNeed experiments, = #parameters
Prob
lem
siz
e pe
r pro
cess
Processes
Low memory High jitter
50
20
40
30
10
4 8 16 32 64
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 14
Multi-parameter modeling in Extra-P
Generation of candidate models and selection of best fit
Find best single-parameter model
Combine them in the most plausible way
(+, *, none)
𝑇𝑖𝑚𝑒 = 𝑐$ & 𝑛 & 𝑙𝑜𝑔 𝑛 & 𝑝
𝐹𝐿𝑂𝑃𝑠 = 𝑐# & 𝑛 & 𝑙𝑜𝑔 𝑛
c1 + c2 ⋅ p
c1 + c2 ⋅ p2
c1 + c2 ⋅ log(p)c1 + c2 ⋅ p ⋅ log(p)
c1 + c2 ⋅ p2 ⋅ log(p)
c1 ⋅ log(p)+ c2 ⋅ pc1 ⋅ log(p)+ c2 ⋅ p ⋅ log(p)c1 ⋅ log(p)+ c2 ⋅ p
2
c1 ⋅ log(p)+ c2 ⋅ p2 ⋅ log(p)
c1 ⋅ p+ c2 ⋅ p ⋅ log(p)c1 ⋅ p+ c2 ⋅ p
2
c1 ⋅ p+ c2 ⋅ p2 ⋅ log(p)
c1 ⋅ p ⋅ log(p)+ c2 ⋅ p2
c1 ⋅ p ⋅ log(p)+ c2 ⋅ p2 ⋅ log(p)
c1 ⋅ p2 + c2 ⋅ p
2 ⋅ log(p)
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 15
How many data points do we really need?Pr
oble
m s
ize
per p
roce
ss
Processes
50
20
40
30
10
4 8 16 32 64
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 16
Learning cost-effective sampling strategies[Ritter et al., IPDPS’20]
Function generator
Noise module
Reinforcement learning agent
Selectedparametervalues
Syntheticmeasurement
Extra-P
Evaluation
Feedback
Prediction
Ground truth
Empirical model
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 17
Heuristic parameter-value selection strategy
Measure min. amount points
required for modeling
Create a model using Extra-P
Gather on additional measurement
(assumed to be cheapest)
Create new model
Final model
If cost <
budget
yes
no
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 18
Synthetic data evaluation
5%10%
15%20%
-5%-10%
-15%-20%
Training Evaluation
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 19
Synthetic evaluation results
20%
40%
60%
80%
100%
Perc
enta
ge o
f acc
urat
e m
odel
s*
1 parameter, 5% noise 2 parameters, 5% noise
Measurements used / Percentage of cost5 / 100% 9 / 12% 11 / 13% 15 / 17% 25 / 100%
Repetitions2 4 6
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 20
Synthetic evaluation results
20%
40%
60%
80%
100%
Perc
enta
ge o
f acc
urat
e m
odel
s*
Measurements used / Percentage of cost13 / 1.7% 15 / 1.8% 25 / 2.2% 75 / 11% 125 / 100%
Repetitions2 4 6
3 parameters, 5% noise
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 21
Synthetic evaluation results
20%
40%
60%
80%
100%
Perc
enta
ge o
f acc
urat
e m
odel
s*
4 parameters, 5% noise
Measurements used / Percentage of cost
17 / 0.1% 18 / 0.12% 125 / 1.2% 250 / 4.2% 625 / 100%
Repetitions2 4 6
25 / 0.17%
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 22
Synthetic evaluation results
20%
40%
60%
80%
100%
Perc
enta
ge o
f acc
urat
e m
odel
s*
4 parameters, 1% noise
Measurements used / Percentage of cost
17 / 0.1% 18 / 0.12% 125 / 1.2% 250 / 4.2% 625 / 100%
Repetitions2 4 6
25 / 0.17%
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 23
Case studies
Application #Parameters Extra points
Cost savings [%]
Prediction error [%]
FASTEST 2 0 70 2
Kripke 3 3 99 39
Relearn 2 0 85 11
0
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 24
Noise-resilient performance modeling[Ritter et al., IPDPS’21]
§ Performance measurements frequently affected by noise
§ Regression struggles with increased amounts of noise –especially w/ more parameters
§ Neural networks are resilient to noise – when noise is part of their training
Original Reconstructed
1 Parameter:
x3 Parameter:
zy100% application behavior
x100% application behavior
noise
noiseAdapted from: https://developer.nvidia.com/blog/ai-can-now-fix-your-grainy-photos-by-only-looking-at-grainy-photos/
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 25
Noise-resilient adaptive modeling
DNNs often better at guessing models in the presence of noise
Performance measurements Estimate noise
Noise low?
Final model
Classic (sparse) modeler
Select best
DNN modeler
Performance model
yes
no
Performance model
Transfer learning
Adapted. Original © 2021 IEEE
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 26
Noise-resilient performance modelingSynthetic evaluation
0%
5%
10%
15%
20%
25%
30%
20% 50% 100%Noise level
Median relative error
Adaptive Regression
2 parameters
(at unseen point, two ticks in each dimension)
0
Adapted. Original © 2021 IEEE
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 27
Noise-resilient performance modelingCase studies – Results
13,45% 16,23%
7,12%
22,28%
69,79%
7,12%
0%
10%
20%
30%
40%
50%
60%
70%
80%
Kripke (Vulcan) FASTEST (SuperMUC) RELeARN
Median relative error
Adaptive Regression
Noise < 5%:only the regression
modeler is used
0
Adapted. Original © 2021 IEEE
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 28
Noise-resilient performance modelingCase studies – Noise
Adapted. Original © 2021 IEEE
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 29
Optimized measurement point selection[Naumann et al., in preparation]
Sparse Better?
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 30
Optimized measurement point selectionvia Gaussian Process Regression (GPR) – Introduction
From: Leclercq, Florent. (2018). Bayesian optimization for likelihood-free cosmological inference. Physical Review D. 98. 10.1103/PhysRevD.98.063511. Used with permission.
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 31
Mean function
Optimized Measurement Point SelectionGaussian Processes
Gaussian Process (GP): Series of normal distributions
𝐺𝑃 𝑡 = 𝑁( 𝑚 𝑡 , 𝑐 𝑡, 𝑡, )
Covariance function
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 32
Optimized measurement point selectionGaussian Processes
Idea: Use covariance function
High variance High uncertainty
Promising measurement
CandidatesMeasurement costs
have to be considered!
Cost-benefit calculation
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 33
ResultsSynthetic evaluation (3 parameters)
86
88
90
92
94
96
98
100
5% Noise 7.5% Noise 10% Noise
% o
f acc
urat
e m
odel
s
Sparse GPR Hybrid
0
10
20
30
40
50
60
5% Noise 7,5% Noise 10% Noise
Addi
tiona
l mea
sure
men
ts
Sparse GPR Hybrid
Relative error < 20%(at unseen point, 2 ticks in each direction)
Measurements needed(in addition to the minimal measurement set)
0
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 34
Results
0
5
10
15
20
25
30
35
40
5% Noise 7,5% Noise 10% Noise
% o
f bud
get u
sed
Sparse GPR Hybrid
Budget needed(Compared to complete measurement set)
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 35
Parameter selection[Copik et al, PPoPP’21]
• The more paramters the more experiments
• Modeling parameters without performance impact is harmful
Taint analysis
Program
Input parameters Which parameter influences
whichfunction?
Taint labels
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 36
Selected papers
Topic BibliographyFoundation (single model paramter)
Alexandru Calotoiu, Torsten Hoefler, Marius Poke, Felix Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. SC13.
MPI case study Sergei Shudler, Yannick Berens, Alexandru Calotoiu, Torsten Hoefler, Alexandre Strube, Felix Wolf: Engineering Algorithms for Scalability through Continuous Validation of Performance Expectations. IEEE TPDS, 30(8):1768–1785, 2019.
Multiple model parameters
Alexandru Calotoiu, David Beckingsale, Christopher W. Earl, Torsten Hoefler, Ian Karlin, Martin Schulz, Felix Wolf: Fast Multi-Parameter Performance Modeling. IEEE Cluster 2016.
Cost-effective sampling strategies
Marcus Ritter, Alexandru Calotoiu, Sebastian Rinke, Thorsten Reimann, Torsten Hoefler, Felix Wolf: Learning Cost-Effective Sampling Strategies for Empirical Performance Modeling. IPDPS 2020.
Noise resilience Marcus Ritter, Alexander Geiß, Johannes Wehrstein, Alexandru Calotoiu, Thorsten Reimann, Torsten Hoefler, Felix Wolf: Noise-Resilient Empirical Performance Modeling with Deep Neural Networks. IPDPS 2021.
Taint-based performance modeling
Marcin Copik, Alexandru Calotoiu, Tobias Grosser, Nicolas Wicki, Felix Wolf, Torsten Hoefler: Extracting Clean Performance Models from Tainted Programs. PPoPP 2021.
11/2/21 | Technical University of Darmstadt, Germany | Felix Wolf | 37
Thank you!
Q&A