A Reproducible Research Methodology for Designing and
Conducting Faithful Simulations of Dynamic Task-based
Scienti�c Application
Luka Stanisic
Inria, Bordeaux Sud-Ouest, France
MPCDF seminarGarching
February 24, 2017
Background
Bachelor (CS specialty)
EE facultyBelgrade, Serbia
Phd(supervisors A. Legrand & J.F. Mehaut)
Grenoble, FranceModeling and simulation of dynamic
task-based applicationsMethodology for reproducible researchStatistical analysis, trace visualizations
Research Master(parallelism specialty)
Grenoble, FranceBenchmarking
CPU cache modelingARM vs Intel
2011 2012 2013 2014 2015 20172016
PostDocBordeaux, France
Performance optimizationLarge scale simulations
Modeling complex kernelsSimulating openQCD
2 / 29
Background
Bachelor (CS specialty)
EE facultyBelgrade, Serbia
Phd(supervisors A. Legrand & J.F. Mehaut)
Grenoble, FranceModeling and simulation of dynamic
task-based applicationsMethodology for reproducible researchStatistical analysis, trace visualizations
Research Master(parallelism specialty)
Grenoble, FranceBenchmarking
CPU cache modelingARM vs Intel
2011 2012 2013 2014 2015 20172016
PostDocBordeaux, France
Performance optimizationLarge scale simulations
Modeling complex kernelsSimulating openQCD
2 / 29
Parallel Programming Challenges
Communications and data placement
Synchronization of the workers
Computation duration variability scalability
Exploiting hybrid machines
Choosing granularity portability of code and performance
Theory
Practice
3 / 29
Di�erent Programming Approaches
Traditional, explicit programming models(MPI, CUDA, OpenMP, pthreads, . . . )
Perfect control maximal achievable performanceE�cient granularity advanced numerical features
Monolithic codes hard to develop and maintainFixed scheduling sensitive to variabilityHard and long to optimize performance portability
Recent task-based programming models(PaRSEC, OmpSs, Charm++, StarPU, . . . )
Single, abstract programming model based on DAGRuntime system responsible for dynamic schedulingPortability of code and performance
Introducing runtime system overheadDeveloping omnipotent runtime new challenges
4 / 29
Di�erent Programming Approaches
Traditional, explicit programming models(MPI, CUDA, OpenMP, pthreads, . . . )
Perfect control maximal achievable performanceE�cient granularity advanced numerical features
Monolithic codes hard to develop and maintainFixed scheduling sensitive to variabilityHard and long to optimize performance portability
Recent task-based programming models(PaRSEC, OmpSs, Charm++, StarPU, . . . )
Single, abstract programming model based on DAGRuntime system responsible for dynamic schedulingPortability of code and performance
Introducing runtime system overheadDeveloping omnipotent runtime new challenges
4 / 29
Di�erent Programming Approaches
Traditional, explicit programming models(MPI, CUDA, OpenMP, pthreads, . . . )
Perfect control maximal achievable performanceE�cient granularity advanced numerical features
Monolithic codes hard to develop and maintainFixed scheduling sensitive to variabilityHard and long to optimize performance portability
Recent task-based programming models(PaRSEC, OmpSs, Charm++, StarPU, . . . )
Single, abstract programming model based on DAGRuntime system responsible for dynamic schedulingPortability of code and performance
Introducing runtime system overheadDeveloping omnipotent runtime new challenges
POTRF
TRSM
TRSMTRSM TRSMGEMM
GEMM GEMM GEMM
GEMM
GEMM GEMM
GEMM
GEMM
GEMM
POTRF
TRSM
POTRF
TRSM
TRSM
POTRF
TRSM
TRSM
TRSM
POTRF
GEMM GEMM GEMM
GEMM
GEMM
GEMM
GEMM GEMM
GEMM
GEMM
4 / 29
Di�erent Programming Approaches
Traditional, explicit programming models(MPI, CUDA, OpenMP, pthreads, . . . )
Perfect control maximal achievable performanceE�cient granularity advanced numerical features
Monolithic codes hard to develop and maintainFixed scheduling sensitive to variabilityHard and long to optimize performance portability
Recent task-based programming models(PaRSEC, OmpSs, Charm++, StarPU, . . . )
Single, abstract programming model based on DAGRuntime system responsible for dynamic schedulingPortability of code and performance
Introducing runtime system overheadDeveloping omnipotent runtime new challenges
4 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Task-based Programming Paradigm
Tiled Cholesky Algorithmusing Sequential Task Flow (STF)
for (j = 0; j < N; j++) {POTRF (RW,A[j][j]);for (i = j+1; i < N; i++)
TRSM (RW,A[i][j], R,A[j][j]);for (i = j+1; i < N; i++) {
SYRK (RW,A[i][i], R,A[i][j]);for (k = j+1; k < i; k++)
GEMM (RW,A[i][k],R,A[i][j], R,A[k][j]);
}}wait();
GEMM
SYRK
TRSM
POTRF
5 / 29
Need Regular Performance Evaluation
Native experiments
Complex systems
Wide variety of setups
Faithful but expensive
Model, equations, theory
PRAM, BSP, DAG
Scheduling bounds
Quick trends but simplistic
Simulation: running real code with machine abstraction
Advantages:
Reproducible executions (performance, bugs)
Predictions on unavailable architectures (extrapolation)
Richer experimental design possible
Di�culties:
Implementing more than a simple prototype
Hard to validate its reliability
6 / 29
Need Regular Performance Evaluation
Native experiments
Complex systems
Wide variety of setups
Faithful but expensive
Model, equations, theory
PRAM, BSP, DAG
Scheduling bounds
Quick trends but simplistic
Simulation: running real code with machine abstraction
Advantages:
Reproducible executions (performance, bugs)
Predictions on unavailable architectures (extrapolation)
Richer experimental design possible
Di�culties:
Implementing more than a simple prototype
Hard to validate its reliability
6 / 29
Research Statement
Is it possible to perform a clean, coherent, reproducible studyof HPC applications executed on top of dynamic task-basedruntime systems, using simulation?
7 / 29
Research Statement
Is it possible to perform a clean, coherent, reproducible studyof HPC applications executed on top of dynamic task-basedruntime systems, using simulation?
7 / 29
Research Statement
Is it possible to perform a clean, coherent, reproducible studyof HPC applications executed on top of dynamic task-basedruntime systems, using simulation?
7 / 29
Outline
1 Simulating Task-based ApplicationsCoupling StarPU Runtime System and SimGrid SimulatorTackling Irregular Numerical CodesUse CasesSummary
2 Methodology for Reproducible Research
3 Conclusion
StarPU and SimGrid
StarPU (Inria Bordeaux)
Dynamic runtime system for hybrid architectures (CPU, GPU, MPI)
Opportunistic scheduling of a task graph guided by performance models
Features dense, sparse and FMM applications
SimGrid (Inria Grenoble, Lyon, Rennes, . . . )
Scalable simulation framework for distributed systems
Sound �uid network models accounting for heterogeneity and contention
Modeling with threads rather than only trace replay ability to simulate dynamic applications
Portable, open source and easily extendable
Same approach could be applicable to any task-based runtime system
8 / 29
Devised Work�ow: StarPU + SimGrid
StarPU
Performance Profile
Calibration
Run once!9 / 29
Devised Work�ow: StarPU + SimGrid
StarPU
SimGrid
Simulation
Quickly Simulate Many Times
StarPU
Performance Profile
Calibration
Run once!9 / 29
Implementation Principles
Emulation executing real applications in a synthetic environment
Simulation replace process execution by delays using performance models
StarPU applications and runtime system are emulated
similar scheduling
Thread synchronizations, actual computations, memory allocationsand data transfers are simulated
need for a good computational kernel and communication models
Control part of StarPU is modi�ed to inject into SimGrid runtimesystem, communication and computation delays
10 / 29
Modeling Runtime System
Simulation delays (increasing simulated time)
Process synchronizations
Memory allocations of CPU or GPU
Submission of data transfer requests
Example for CUDA memory allocation in StarPU...#ifdef STARPU_SIMGRID
MSG_process_sleep((float) dim * alloc_cost_per_byte);#else
if (_starpu_can_submit_cuda_task()) {cudaError_t cures;cures = cudaHostAlloc(A, dim, cudaHostAllocPortable);
...
11 / 29
Modeling Communication
Components of hybrid platforms have di�ering characteristics
Correctly modeling their communication is of primary importance
Built on exhaustively validated existing SimGrid network models
Flexible �ow-based contention model
CPU
GPU2
GPU1
GPU0
(a) Fatpipe (crude) model
CPU
GPU2
GPU1
GPU0
(b) Complete graph (pragmatic) model
12 / 29
Modeling Communication
Components of hybrid platforms have di�ering characteristics
Correctly modeling their communication is of primary importance
Built on exhaustively validated existing SimGrid network models
Flexible �ow-based contention model
GPU7GPU6GPU2 GPU3
CPU
GPU4GPU1GPU0 GPU5
(c) Treelike (elaborated) model
12 / 29
Modeling Computation
Actual computation results irrelevant only computation time matters
Task = Kernel for task-based paradigm
Execution of tasks replaced bysimulation delays
Average duration for regular kernels
Additional techniques to optionallyaccount for variability
GEMM
SYRK
TRSM
POTRF
13 / 29
Overview of Simulation Accuracy
Hannibal: 3 QuadroFX5800 Attila: 3 TeslaC2050 Mirage: 3 TeslaM2070 Conan: 3 TeslaM2075
01000200030004000
01000200030004000
Cholesky
LU
20K 40K 60K 80K 20K 40K 60K 80K 20K 40K 60K 80K 20K 40K 60K 80KMatrix dimension
GF
lop/
s
Frogkepler: 2 K20 Pilipili2: 2 K40 Idgraf: 8 TeslaC2050
01000200030004000
01000200030004000
Cholesky
LU
20K 40K 60K 80K 20K 40K 60K 80K 20K 40K 60K 80KMatrix dimension
GF
lop/
s ExperimentType
Native
SimGrid
Publications[1] L. Stanisic, S. Thibault, A. Legrand, B. Videau, and J.-F. Méhaut. Faithful Performance Prediction of aDynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures. Concurrency and Computation:Practice and Experience, page 16, May 2015.[2] L. Stanisic, S. Thibault, A. Legrand, B. Videau, and J.-F. Méhaut. Modeling and Simulation of a DynamicTask-Based Runtime System for Heterogeneous Multi-core Architectures. In Euro-par - 20th InternationalConference on Parallel Processing, Euro-Par 2014, LNCS 8632, pages 50�62, Porto, Portugal, Aug. 2014.
14 / 29
Overview of Simulation Accuracy
Publications[1] L. Stanisic, S. Thibault, A. Legrand, B. Videau, and J.-F. Méhaut. Faithful Performance Prediction of aDynamic Task-Based Runtime System for Heterogeneous Multi-Core Architectures. Concurrency and Computation:Practice and Experience, page 16, May 2015.[2] L. Stanisic, S. Thibault, A. Legrand, B. Videau, and J.-F. Méhaut. Modeling and Simulation of a DynamicTask-Based Runtime System for Heterogeneous Multi-core Architectures. In Euro-par - 20th InternationalConference on Parallel Processing, Euro-Par 2014, LNCS 8632, pages 50�62, Porto, Portugal, Aug. 2014.
14 / 29
Outline
1 Simulating Task-based ApplicationsCoupling StarPU Runtime System and SimGrid SimulatorTackling Irregular Numerical CodesUse CasesSummary
2 Methodology for Reproducible Research
3 Conclusion
Di�erence Between Regular and Irregular Kernels
Regular kernels use always the same block size duration is (relatively) stable
Irregular kernel durations depend on their input parameters need more than simple average values
Native, Do_subtree
0
2
4
6
0 50 100
Num
ber o
f Occ
uran
ces
Native, Activate
0
5
10
15
20
0 250 500 750
Native, Panel
0
50
100
150
200
0 25 50 75 100
Native, Update
0
2000
4000
0 20 40 60
Native, Assemble
0
50
100
150
200
250
0 10 20 30
Native, Deactivate
0
10
20
0 25 50 75
KernelDo_subtreeActivatePanelUpdateAssembleDeactivate
15 / 29
Modeling Duration of Complex Computation Kernels
Using statistical analysis and multiple linear regression
Extended StarPU to automatically support such models
Kernel duration estimations useful for both simulation and nativeexecutions (scheduling)
Native, Do_subtree
SimGrid, Do_subtree
0
3
6
9
0
3
6
9
0 50 100
Num
ber
of O
ccur
ance
s
Native, Activate
SimGrid, Activate
0
5
10
15
20
0
5
10
15
20
0 250 500 750
Native, Panel
SimGrid, Panel
0
100
200
0
100
200
0 25 50 75
Native, Update
SimGrid, Update
0
2000
4000
6000
0
2000
4000
6000
0 20 40 60
Native, Assemble
SimGrid, Assemble
0
100
200
300
0
100
200
300
0 10 20 30
Native, Deactivate
SimGrid, Deactivate
0
10
20
30
0
10
20
30
0 25 50 75
Kernel
Do_subtree
Activate
Panel
Update
Assemble
Deactivate
16 / 29
Simulating Irregular Numerical Libraries
Chameleon solver: dense linear algebra libraryqr_mumps solver: MUMPS multi-frontal factorizationScalFMM library: simulating N-body interactions using the FMMQDWH solver: QR-based Dynamically Weighted Halley (ongoing)
0
10
20
30
40
50
0
200
400
tp−6
karte
dEt
erni
tyII_
Ede
gme
hirla
m
e18
TF16
Ruc
ci1 sls
TF17
Dur
atio
n [s
] ExperimentType
Native
SimGrid
qr_mumps
0
100
200
300
2M 4M 8M 16M 32M 64MNumber of Particles
Dur
atio
n [s
] ExperimentType
Native
SimGrid
ScalFMM
Publication[3] L. Stanisic, E. Agullo, A. Buttari, A. Guermouche, A. Legrand, F. Lopez, and B. Videau. Fast and AccurateSimulation of Multithreaded Sparse Linear Algebra Solvers. Parallel and Distributed Systems (ICPADS), Dec. 2015.[4] E. Agullo, B. Bramas, O. Coulaud, L. Stanisic, and S. Thibault. Modeling Irregular Kernels of Task-based codes:Illustration with the Fast Multipole Method. submitted to Transaction on Mathematical Software (TOMS),[Research Report] 9036, INRIA Bordeaux. pp.35. Feb. 2017.
17 / 29
Outline
1 Simulating Task-based ApplicationsCoupling StarPU Runtime System and SimGrid SimulatorTackling Irregular Numerical CodesUse CasesSummary
2 Methodology for Reproducible Research
3 Conclusion
Comparing Di�erent Schedulers
Di�erences between schedulers performances faithfully predicted
DMDAR and DMDAS locality aware schedulers less transfers between GPUs
DMDA DMDAR DMDAS
0
500
1000
1500
20K 40K 60K 80K 20K 40K 60K 80K 20K 40K 60K 80KMatrix dimension
GF
lop/
s ExperimentType
Native
SimGrid
18 / 29
Studying Memory Consumption
Minimizing memory footprint is very important for such applicationsRemember scheduling is dynamic so consecutive Native experimentshave di�erent output
Experiment number 1
Experiment number 2
Experiment number 3
Experiment number 4
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0 10,000 20,000 30,000 40,000Time [ms]
Allo
cate
d M
emor
y [G
iB]
19 / 29
Studying Memory Consumption
Minimizing memory footprint is very important for such applicationsRemember scheduling is dynamic so consecutive Native experimentshave di�erent output
Native 1
Native 2
SimGrid
Native 3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0 10,000 20,000 30,000 40,000Time [ms]
Allo
cate
d M
emor
y [G
iB]
19 / 29
Extrapolating to Larger Machines
Predicting performance in idealized contextStudying the parallelization limits of the problem
Extrapolation
0
30
60
90
4 10 20 40 100 400Number of Threads
Dur
atio
n [s
] ExperimentType
Native
SimGrid
20 / 29
Outline
1 Simulating Task-based ApplicationsCoupling StarPU Runtime System and SimGrid SimulatorTackling Irregular Numerical CodesUse CasesSummary
2 Methodology for Reproducible Research
3 Conclusion
Achievements
Works great for small hybrid setups with dense, sparse and FMMStarPU applications
Not only a prototype, already used by other researchers
Our solution allows to:
Debug applications on a commodity laptop in a reproducible wayDetect problems with real experiments using reliable comparisonTest di�erent scheduling alternativesEvaluate memory footprintQuickly and accurately evaluate the impact of variousscheduling/application parameters:
qr_mumps Cores RAM Evaluation MakespanNative 40 58.0GiB 157s 141sSimGrid 1 1.5GiB 57s 142s
21 / 29
Outline
1 Simulating Task-based ApplicationsCoupling StarPU Runtime System and SimGrid SimulatorTackling Irregular Numerical CodesUse CasesSummary
2 Methodology for Reproducible Research
3 Conclusion
Challenges of Experimental Studies in HPC
Large, hybrid, prototype hardware/software (hard to control)
Costly experiments with numerous parameters
Non-deterministic executions (overall duration, traces, . . . )
Work�ows speci�c to the studies (hardly applicable in general)
di�cult to make research results reproducible
22 / 29
Reproducible Research: Trying to Bridge the Gap
Analysis
Experiments
Experiment Code
(workload injector, VM recipes, ...)
Processing
Code
Analysis
Code
Presentation
Code
Analytic
Data
Computational
Results
Measured
Data
Numerical
Summaries
Figures
Tables
Text
Reader
Author
(Design of Experiments)
Protocol
Scienti�c
Question
Published
Article
Nature/System/...
Inspired by Roger D. Peng's lecture on reproducible research, May 201423 / 29
Reproducible Research: Trying to Bridge the Gap
Analysis
Experiments
Experiment Code
(workload injector, VM recipes, ...)
Processing
Code
Analysis
Code
Presentation
Code
Analytic
Data
Computational
Results
Measured
Data
Numerical
Summaries
Figures
Tables
Text
Reader
Author
(Design of Experiments)
Protocol
Scienti�c
Question
Published
Article
Nature/System/...
Our approach: use a lightweight combination of existing generic tools23 / 29
Experiments
Full control of design of experiments
Automatize process
Gather as much useful meta-data as possible for each experiment
Time
Experiment plan Memory allocationOperating system
Sequence order
Repetitions
Element type
Allocation technique
Scheduling priority
CPU frequency
Core pinning
Dedication
Optimization
Loop unrolling
Intel
ARMCycles
Size
Stride
Architecture Compilation Kernel
Bandwidth
Publication[5] L. Stanisic, L. M. Schnorr, A. Degomme, F. Heinrich, A. Legrand, and B. Videau. Characterizing thePerformance of Modern Architectures Through Opaque Benchmarks: Pitfalls Learned the Hard Way. submitted toInternational Workshop on Reproducibility in Parallel Computing (REPPAR), 2017.
24 / 29
Analysis
Write papers/reports with completely reproducible analysis
Rely on literate programming tools (IPython/Jupiter, Orgmode)
Modular scripting approach (shell + R)
730
CPE
368
ABE
434
CPU0CPU1CPU2CPU3CPU4CPU5CPU6CPU7CPU8CPU9
CPU10CPU11CPU12CPU13CPU14CPU15CPU16CPU17CPU18CPU19CPU20CPU21CPU22CPU23CPU24CUDA0CUDA1CUDA2
0 200 400 600Time [ms]
Res
ourc
es
dgemm dpotrf dsyrk dtrsm Idle/Sleeping Critical Paths 1 2
CPU0CPU1CPU2CPU3CPU4CPU5CPU6CPU7CPU8CPU9
CPU10CPU11CPU12CPU13CPU14CPU15CPU16CPU17CPU18CPU19CPU20CPU21CPU22CPU23CPU24CUDA0CUDA1CUDA2
Res
ourc
es
0
20
40
60
k ite
ratio
n
05000
100001500020000
# ta
sks
6696
5
CPE
2201
ABE
5974
8
0.7%0.9%1.0%0.9%0.9%1.0%1.0%0.8%1.0%1.0%1.0%0.9%0.9%1.0%1.3%1.3%1.2%1.3%1.4%1.4%1.6%1.5%1.6%1.4%1.6%
20.6%20.2%19.9%
6272
5
CPE
2149
ABE
5946
4
0.4%0.6%0.6%0.7%0.9%1.0%1.0%0.9%1.0%1.0%1.0%0.9%0.9%0.9%1.0%1.0%1.0%0.9%1.0%1.1%1.0%1.1%1.0%1.0%1.0%5.9%1.9%2.0%
6098
7
CPE
2146
ABE
5845
2
1.1%1.3%1.2%1.3%1.3%1.5%1.4%1.4%1.5%1.5%1.3%1.4%1.2%1.3%1.5%1.5%1.5%1.5%1.4%1.5%1.5%1.5%1.4%1.4%1.5%4.0%2.2%2.2%
0
20
40
60
0
500
1000
1500
0
500
1000
1500
05000
100001500020000
0
20
40
60
0
500
1000
1500
0
500
1000
1500
05000
100001500020000
0
20
40
60
0
500
1000
1500
0
500
1000
1500
0 20000 40000 60000 0 20000 40000 60000 0 20000 40000 60000
dgemm dpotrf dsyrk dtrsm Idle/Sleeping
0 20000 40000 60000 0 20000 40000 60000 0 20000 40000 60000Time [ms]
CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA
6406
1
CPE
2114
ABE
6000
4
12.5%12.8%12.3%11.6%13.5%13.6%14.6%14.3%12.0%11.6%11.2%3.2%2.7%3.1%3.7%3.8%3.1%2.6%3.7%4.0%4.1%3.7%2.9%3.9%3.0%3.6%2.2%2.6%
6017
4
CPE
2159
ABE
5901
7
1.0%0.8%1.1%1.3%1.3%1.5%1.5%1.7%1.7%1.8%1.8%1.8%1.9%2.0%2.0%2.1%2.3%2.3%2.3%2.2%2.3%2.3%2.5%2.5%2.4%2.5%1.1%0.9%
5957
7
CPE
2160
ABE
5760
3
0.9%1.3%0.9%1.0%1.0%0.9%1.0%1.1%0.9%0.9%1.0%1.0%0.9%1.1%1.1%1.0%1.2%1.0%1.2%1.0%1.1%1.1%1.2%0.9%0.9%3.2%1.4%1.4%
CPU0CPU1CPU2CPU3CPU4CPU5CPU6CPU7CPU8CPU9
CPU10CPU11CPU12CPU13CPU14CPU15CPU16CPU17CPU18CPU19CPU20CPU21CPU22CPU23CPU24CUDA0CUDA1CUDA2
0
20
40
60
0 20000 40000 60000
0 20000 40000 60000
0 20000 40000 60000
0 20000 40000 60000
0 20000 40000 60000
0 20000 40000 60000Time [ms]
Res
ourc
esk
itera
tion
dgemm dpotrf dsyrk dtrsm Idle/Sleeping
05000
100001500020000
0
20
40
60
0500
10001500
0500
10001500
05000
100001500020000
0
20
40
60
0500
10001500
0500
10001500
05000
100001500020000
0
20
40
60
0500
10001500
0500
10001500
CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA CPU CUDA
# ta
sks
Publication[6] V. G. Pinto, L. Stanisic, A. Legrand, L. M. Schnorr, and S. Thibault. Analyzing Dynamic Task-BasedApplications on Hybrid Platforms: An Agile Scripting Approach. 3rd Workshop on Visual Performance Analysis(VPA), Nov 2016, Salt Lake City, United States
25 / 29
Work�ow for the Whole Research Process
Documentation and experimentation journal (laboratory notebook)Unique Git branching system for better project history
src data
art/art1
xp/foo1
xp/foo2
Publications[7] L. Stanisic, A. Legrand, and V. Danjean. An E�ective Git And Org-Mode Based Work�ow For ReproducibleResearch. ACM SIGOPS Operating Systems Review, 49:61 � 70, 2015. Special Topic: Repeatability and Sharing ofExperimental Artifacts.[8] L. Stanisic and A. Legrand. E�ective Reproducible Research with Org-Mode and Git. In 1st InternationalWorkshop on Reproducibility in Parallel Computing, Porto, Portugal, Aug. 2014.
26 / 29
Achievements
Design:
Original approach based on well-known tools
Helps �lling the author/reader gap in our context
Applicable and extendable to other research �elds
Application:
Used this approach for many studies, presentations and papers
E�ciently handled ≈10,000 experiments (40GiB) and ≈2,000 commits
Evangelism:
Our closest colleagues successfully adopting this approach
Presented our methods on numerous occasions (RR webinar,conferences, workshops, ANR project meetings, . . . )
27 / 29
Outline
1 Simulating Task-based ApplicationsCoupling StarPU Runtime System and SimGrid SimulatorTackling Irregular Numerical CodesUse CasesSummary
2 Methodology for Reproducible Research
3 Conclusion
Experience
Modeling, simulation and performance evaluation
Methodology for reproducible research
Statistical analysis, visualizations
Code and performance debugging and optimizations
Working with large, hybrid, prototype hardware and software
Contributions to many large code projects:
StarPU (C) SimGrid (C/C++)qr_mumps (C/Fortran) ScalFMM (C++)Chameleon (C/Fortran)
28 / 29
Summary
Regular algorithmsDynamic task-based HPC applications
Research methodology
BenchmarksBasic modeling
2013 2014 2015 2016 2017 20192018
Numerical (irregular) libraries
Performance optimizationLarge scale executions
Real-life applicationsCollaboration with other
domain experts
Thank you!
http://mescal.imag.fr/membres/luka.stanisic/
29 / 29
Summary
Regular algorithmsDynamic task-based HPC applications
Research methodology
BenchmarksBasic modeling
2013 2014 2015 2016 2017 20192018
Numerical (irregular) libraries
Performance optimizationLarge scale executions
Real-life applicationsCollaboration with other
domain experts
Thank you!
http://mescal.imag.fr/membres/luka.stanisic/29 / 29
Ongoing Research: Multiple Nodes
StarPU-MPI + SimGrid for large scale distributed memory studies
Requires combining two modules of SimGrid framework technically challenging, need to rewrite internals
Large number of resources, kernels and communications in parallel need to optimize simulator performance
Multiple network models (PCI bus and Ethernet/In�niband) contention harder to model