high performance computing for science and engineering ii

44
High Performance Computing for Science and Engineering II 22.03.2021 - Lecture: High Throughput Computing Lecturer: Dr. Sergio Martin

Upload: others

Post on 01-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

High Performance Computing for Science and Engineering II

22032021 - Lecture High Throughput Computing

Lecturer Dr Sergio Martin

A Recap of Last WeekIntroduction to UQ and Optimization Software - Overview of Koralis features and installation tutorial

2

Optimization with Korali - Used CMAES to optimize a simple functionProbability Sampling - Used MCMC to sample the shape of a unnormalized distributionBayesian Inference - Used TMCMC to infer the best fitting parameters of a model giving reference data

Please download the rest of the practices (5-11) from the website

Todays Lecture

Sample Distribution Strategies- Load Imbalance - Divide amp Conquer vs ProducerConsumer

Korali Tutorial (Part II)- Concurrent Distributed Parallelism - Fault Tolerance Multi-Experiment Support

MPI and Sample Distribution- One-Sided Communication - Example (Genome Assembly)

Sample DistributionStrategies

Sample Distribution

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

How do we distribute samples to cores

Sample 0

Divide-And-Conquer Strategy

Regular communication- Happens at the beginning of each generation - Message Sizes Well-known- Can use separate messages or a Broadcast

Only applicable when the entire workload is known from the beginning

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Distribute samples equally (in number) among cores at the start of every generation

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time = Max(Core Time)

Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)

Idle

Idle

Idle

Happens when cores receive uneven workloadsRepresents a waste of computational power

Sample 0

Sample 2

Sample 4

Sample 6

Sample 1

Sample 3

Sample 5

Sample 7

Producer Consumer Model

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Assign workload opportunistically as coreswork become available

Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers

Does not require the entireknowing the workload in advance

Producer

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite

Lost Performance = ProducerCoresTotalCores

Pop Quiz Why do we need to sacrifice one worker node

Sample 6

Sample 7

Sample 5

Sample 3

Sample 4Sample 0

Sample 1

Sample 2

Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)

Generation-Based Methods

All samples for the next generation are knownat the end of the previous generation

CMA-ES TMCMC

Samples for the current generationare determined in real-time based on the

evaluation of previous chain steps

Lets Discuss

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is the Divide and Conquer Strategy

Good for CMA-ES What about TMCMC

Q2 Is the ProducerConsumer Strategy

Good for CMA-ES And for TMCMC

High Throughput Computing with Korali

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

A Recap of Last WeekIntroduction to UQ and Optimization Software - Overview of Koralis features and installation tutorial

2

Optimization with Korali - Used CMAES to optimize a simple functionProbability Sampling - Used MCMC to sample the shape of a unnormalized distributionBayesian Inference - Used TMCMC to infer the best fitting parameters of a model giving reference data

Please download the rest of the practices (5-11) from the website

Todays Lecture

Sample Distribution Strategies- Load Imbalance - Divide amp Conquer vs ProducerConsumer

Korali Tutorial (Part II)- Concurrent Distributed Parallelism - Fault Tolerance Multi-Experiment Support

MPI and Sample Distribution- One-Sided Communication - Example (Genome Assembly)

Sample DistributionStrategies

Sample Distribution

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

How do we distribute samples to cores

Sample 0

Divide-And-Conquer Strategy

Regular communication- Happens at the beginning of each generation - Message Sizes Well-known- Can use separate messages or a Broadcast

Only applicable when the entire workload is known from the beginning

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Distribute samples equally (in number) among cores at the start of every generation

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time = Max(Core Time)

Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)

Idle

Idle

Idle

Happens when cores receive uneven workloadsRepresents a waste of computational power

Sample 0

Sample 2

Sample 4

Sample 6

Sample 1

Sample 3

Sample 5

Sample 7

Producer Consumer Model

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Assign workload opportunistically as coreswork become available

Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers

Does not require the entireknowing the workload in advance

Producer

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite

Lost Performance = ProducerCoresTotalCores

Pop Quiz Why do we need to sacrifice one worker node

Sample 6

Sample 7

Sample 5

Sample 3

Sample 4Sample 0

Sample 1

Sample 2

Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)

Generation-Based Methods

All samples for the next generation are knownat the end of the previous generation

CMA-ES TMCMC

Samples for the current generationare determined in real-time based on the

evaluation of previous chain steps

Lets Discuss

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is the Divide and Conquer Strategy

Good for CMA-ES What about TMCMC

Q2 Is the ProducerConsumer Strategy

Good for CMA-ES And for TMCMC

High Throughput Computing with Korali

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Todays Lecture

Sample Distribution Strategies- Load Imbalance - Divide amp Conquer vs ProducerConsumer

Korali Tutorial (Part II)- Concurrent Distributed Parallelism - Fault Tolerance Multi-Experiment Support

MPI and Sample Distribution- One-Sided Communication - Example (Genome Assembly)

Sample DistributionStrategies

Sample Distribution

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

How do we distribute samples to cores

Sample 0

Divide-And-Conquer Strategy

Regular communication- Happens at the beginning of each generation - Message Sizes Well-known- Can use separate messages or a Broadcast

Only applicable when the entire workload is known from the beginning

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Distribute samples equally (in number) among cores at the start of every generation

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time = Max(Core Time)

Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)

Idle

Idle

Idle

Happens when cores receive uneven workloadsRepresents a waste of computational power

Sample 0

Sample 2

Sample 4

Sample 6

Sample 1

Sample 3

Sample 5

Sample 7

Producer Consumer Model

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Assign workload opportunistically as coreswork become available

Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers

Does not require the entireknowing the workload in advance

Producer

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite

Lost Performance = ProducerCoresTotalCores

Pop Quiz Why do we need to sacrifice one worker node

Sample 6

Sample 7

Sample 5

Sample 3

Sample 4Sample 0

Sample 1

Sample 2

Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)

Generation-Based Methods

All samples for the next generation are knownat the end of the previous generation

CMA-ES TMCMC

Samples for the current generationare determined in real-time based on the

evaluation of previous chain steps

Lets Discuss

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is the Divide and Conquer Strategy

Good for CMA-ES What about TMCMC

Q2 Is the ProducerConsumer Strategy

Good for CMA-ES And for TMCMC

High Throughput Computing with Korali

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Sample DistributionStrategies

Sample Distribution

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

How do we distribute samples to cores

Sample 0

Divide-And-Conquer Strategy

Regular communication- Happens at the beginning of each generation - Message Sizes Well-known- Can use separate messages or a Broadcast

Only applicable when the entire workload is known from the beginning

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Distribute samples equally (in number) among cores at the start of every generation

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time = Max(Core Time)

Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)

Idle

Idle

Idle

Happens when cores receive uneven workloadsRepresents a waste of computational power

Sample 0

Sample 2

Sample 4

Sample 6

Sample 1

Sample 3

Sample 5

Sample 7

Producer Consumer Model

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Assign workload opportunistically as coreswork become available

Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers

Does not require the entireknowing the workload in advance

Producer

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite

Lost Performance = ProducerCoresTotalCores

Pop Quiz Why do we need to sacrifice one worker node

Sample 6

Sample 7

Sample 5

Sample 3

Sample 4Sample 0

Sample 1

Sample 2

Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)

Generation-Based Methods

All samples for the next generation are knownat the end of the previous generation

CMA-ES TMCMC

Samples for the current generationare determined in real-time based on the

evaluation of previous chain steps

Lets Discuss

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is the Divide and Conquer Strategy

Good for CMA-ES What about TMCMC

Q2 Is the ProducerConsumer Strategy

Good for CMA-ES And for TMCMC

High Throughput Computing with Korali

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Sample Distribution

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

How do we distribute samples to cores

Sample 0

Divide-And-Conquer Strategy

Regular communication- Happens at the beginning of each generation - Message Sizes Well-known- Can use separate messages or a Broadcast

Only applicable when the entire workload is known from the beginning

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Distribute samples equally (in number) among cores at the start of every generation

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time = Max(Core Time)

Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)

Idle

Idle

Idle

Happens when cores receive uneven workloadsRepresents a waste of computational power

Sample 0

Sample 2

Sample 4

Sample 6

Sample 1

Sample 3

Sample 5

Sample 7

Producer Consumer Model

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Assign workload opportunistically as coreswork become available

Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers

Does not require the entireknowing the workload in advance

Producer

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite

Lost Performance = ProducerCoresTotalCores

Pop Quiz Why do we need to sacrifice one worker node

Sample 6

Sample 7

Sample 5

Sample 3

Sample 4Sample 0

Sample 1

Sample 2

Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)

Generation-Based Methods

All samples for the next generation are knownat the end of the previous generation

CMA-ES TMCMC

Samples for the current generationare determined in real-time based on the

evaluation of previous chain steps

Lets Discuss

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is the Divide and Conquer Strategy

Good for CMA-ES What about TMCMC

Q2 Is the ProducerConsumer Strategy

Good for CMA-ES And for TMCMC

High Throughput Computing with Korali

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Divide-And-Conquer Strategy

Regular communication- Happens at the beginning of each generation - Message Sizes Well-known- Can use separate messages or a Broadcast

Only applicable when the entire workload is known from the beginning

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Distribute samples equally (in number) among cores at the start of every generation

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time = Max(Core Time)

Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)

Idle

Idle

Idle

Happens when cores receive uneven workloadsRepresents a waste of computational power

Sample 0

Sample 2

Sample 4

Sample 6

Sample 1

Sample 3

Sample 5

Sample 7

Producer Consumer Model

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Assign workload opportunistically as coreswork become available

Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers

Does not require the entireknowing the workload in advance

Producer

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite

Lost Performance = ProducerCoresTotalCores

Pop Quiz Why do we need to sacrifice one worker node

Sample 6

Sample 7

Sample 5

Sample 3

Sample 4Sample 0

Sample 1

Sample 2

Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)

Generation-Based Methods

All samples for the next generation are knownat the end of the previous generation

CMA-ES TMCMC

Samples for the current generationare determined in real-time based on the

evaluation of previous chain steps

Lets Discuss

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is the Divide and Conquer Strategy

Good for CMA-ES What about TMCMC

Q2 Is the ProducerConsumer Strategy

Good for CMA-ES And for TMCMC

High Throughput Computing with Korali

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time = Max(Core Time)

Load Imbalance Ratio = Max(Core Time) - Average(Core Time)Max(Core Time)

Idle

Idle

Idle

Happens when cores receive uneven workloadsRepresents a waste of computational power

Sample 0

Sample 2

Sample 4

Sample 6

Sample 1

Sample 3

Sample 5

Sample 7

Producer Consumer Model

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Assign workload opportunistically as coreswork become available

Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers

Does not require the entireknowing the workload in advance

Producer

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite

Lost Performance = ProducerCoresTotalCores

Pop Quiz Why do we need to sacrifice one worker node

Sample 6

Sample 7

Sample 5

Sample 3

Sample 4Sample 0

Sample 1

Sample 2

Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)

Generation-Based Methods

All samples for the next generation are knownat the end of the previous generation

CMA-ES TMCMC

Samples for the current generationare determined in real-time based on the

evaluation of previous chain steps

Lets Discuss

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is the Divide and Conquer Strategy

Good for CMA-ES What about TMCMC

Q2 Is the ProducerConsumer Strategy

Good for CMA-ES And for TMCMC

High Throughput Computing with Korali

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Producer Consumer Model

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Assign workload opportunistically as coreswork become available

Asynchronous Behavior- Producer sends samples to workers as soon as they become available- Workers report back finished sample and its result- Producer keeps a queue of available workers

Does not require the entireknowing the workload in advance

Producer

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite

Lost Performance = ProducerCoresTotalCores

Pop Quiz Why do we need to sacrifice one worker node

Sample 6

Sample 7

Sample 5

Sample 3

Sample 4Sample 0

Sample 1

Sample 2

Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)

Generation-Based Methods

All samples for the next generation are knownat the end of the previous generation

CMA-ES TMCMC

Samples for the current generationare determined in real-time based on the

evaluation of previous chain steps

Lets Discuss

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is the Divide and Conquer Strategy

Good for CMA-ES What about TMCMC

Q2 Is the ProducerConsumer Strategy

Good for CMA-ES And for TMCMC

High Throughput Computing with Korali

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Producer

Load Imbalance

Node 0

Node 0

Node 0

Node 0

Parallel Sampler - Single-Core Model

Total Running Time asymp Mean(Core Time) as sample size and cores rarr Infinite

Lost Performance = ProducerCoresTotalCores

Pop Quiz Why do we need to sacrifice one worker node

Sample 6

Sample 7

Sample 5

Sample 3

Sample 4Sample 0

Sample 1

Sample 2

Pop Quiz Whats the impact on large multi-core systems (Euler = 24 cores)

Generation-Based Methods

All samples for the next generation are knownat the end of the previous generation

CMA-ES TMCMC

Samples for the current generationare determined in real-time based on the

evaluation of previous chain steps

Lets Discuss

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is the Divide and Conquer Strategy

Good for CMA-ES What about TMCMC

Q2 Is the ProducerConsumer Strategy

Good for CMA-ES And for TMCMC

High Throughput Computing with Korali

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Generation-Based Methods

All samples for the next generation are knownat the end of the previous generation

CMA-ES TMCMC

Samples for the current generationare determined in real-time based on the

evaluation of previous chain steps

Lets Discuss

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is the Divide and Conquer Strategy

Good for CMA-ES What about TMCMC

Q2 Is the ProducerConsumer Strategy

Good for CMA-ES And for TMCMC

High Throughput Computing with Korali

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Lets Discuss

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is the Divide and Conquer Strategy

Good for CMA-ES What about TMCMC

Q2 Is the ProducerConsumer Strategy

Good for CMA-ES And for TMCMC

High Throughput Computing with Korali

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

High Throughput Computing with Korali

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

13

Study Case Heating Plate

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Study Case Heating PlateGiven

A square metal plate with 3 sources of heat underneath it

Can we infer the (xy) locations of the 3 heat sources

We have ~10 temperature measurements at different locations

14

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Study Case Configuration

Experiment Problem Bayesian Inference

Model C++ 2D Heat Equation

Solver TMCMC

Run

Heat Source 1

Heat Source 2

Heat Source 3

X Y

Likelihood Probability Distributions

15

Parameter Space Heat Source 1 (xy) Heat Source 2 (xy) Heat Source 3 (xy) Sigma (StdDev from Likelihood)

Objective Function Likelihood by Reference Data

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Practice 6 Running Study Case

Step I Go to the practice6 folder and analyze its contentsStep II Fill in the missing prior information based on the diagram below Step III Compile and run experiment ldquopractice6rdquoStep IV Gather information about the possible heat source locationStep V Plot the posterior distributions

16

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

17

Parallel Execution

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Heterogeneous Model Support

+ Sequential (default) Good for simple function-based PythonC++ models

+ Concurrent For legacy code or pre-compiled applications (eg LAMMPS Matlab Fortran)

+ Distributed For MPIUPC++ distributed models (eg Mirheo)

Korali exposes multiple ldquoConduitsrdquo ways to run computational models

18

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Sequential ConduitLinks to the model code and runs the model sequentially via function call

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Sequentialkrun(e)

Korali Application

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] computation sample[Evaluation] = result

Computational Model

$ myKoraliApppy

Running Application

19

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Concurrent ConduitKorali creates multiple concurrent workers to process samples in parallel

e = koraliExperiment()

k = koraliEngine()

k[Conduitrdquo][Type] = Concurrent

k[Conduitrdquo][Concurrent Jobs] = 4

krun(e)

Korali Application

$ myKoraliApppy

Running Application

Korali Main Process

Worker 0

Worker 1

Worker 2

Worker 3

Fork

Join

Sample Sample SampleSample

SampleSampleSample

Sample

20

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Practice 7 Parallelize Study Case

Step I Go to folder practice7 and

Use the concurrent conduit to parallelize the code in practice 6

21

Step II Analyze running times by running different levels of parallelism

Step III Use the top command to observe the CPU usage while you run the example

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

22

Distributed Execution

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Distributed ConduitCan be used to run applications beyond the limits of a single node (needs MPI)

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Distributed

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] Local Computation

sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

23

Korali Engine Rank

Node 1

Node 2

Node 3

Node 4

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Distributed ConduitLinks to and runs distributed MPI applications through sub-communicator teams

e = koraliExperiment()k = koraliEngine()e[Problem][Objective Function] = myMPIModelk[Conduitrdquo][Type] = Distributedk[Conduitrdquo][Ranks Per Sample] = 4

krun(e)

Korali Application

def myModel(sample MPIComm) x = sample[Parameters][0] y = sample[Parameters][1] myRank = commGet_rank() rankCount = commGet_size() Distributed Computation sample[Evaluation] = result

Computational Model

$ mpirun -n 17 myKoraliApppy

Running Application

Rank 0

Rank 1

Rank 2

Rank 3

Rank 4

Rank 5

Rank 6

Rank 7

Rank 8

Rank 9

Rank 10

Rank 11

Rank 12

Rank 13

Rank 14

Rank 15

Subcomm 0

Subcomm 1

Subcomm 2

Subcomm 3

24

Korali Engine Rank

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Idle

Idle

Idle

Idle

Koralirsquos Scalable Sampler

Start ExperimentSamples

Busy

Busy

Busy

Busy

Done

Done

Done

Done

Save Results Check For Termination

Run Next Generation

Idle

Idle

Idle

Idle25

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Practice 8 MPI-Based Distributed Models

Step II Go to folder practice8 and have Korali run the the MPI-based model there

26

Step III Fix MPI Ranks (to eg 8) and analyze execution times by running different levels of

1) Sampling parallelism2) Model Parallelism

Step IV Configure Korali to store Profiling Information and use the profiler tool to see the

evolution of the samplesUsing Korali gt Tools gt Korali Profiler

Step 0 Getinstall any MPI library (openMPI is open-source)

Step I Use the distributed conduit to parallelize practice7

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

27

Running Out-of-the-box applications

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

def myModel(sample) x = sample[Parameters][0] y = sample[Parameters][1] osshellrun(myApp + x + y) result = parseFile(ResultFileout) sample[F(x)] = result

Computational Model

For these cases we can run them from inside a model and then gather the results

Running Out-of-the-Box ApplicationsMany applications are close-code or too complicated to interface with others

e[Problem][Objective Function] = myModelk[Conduitrdquo][Type] = Concurrentk[Conduitrdquo][Concurrent Jobs] = 4krun(e)

Korali Application

$ myKoraliApppy

Running Application

28

myAppmyApp x y Result

ResultFileout

parseFile(ResultFileout)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Practice 9 Running out-of-the-box applicationsStep I

Go to folder practice9 and examine the model application (what are its inputs and outputs)

29

Step II Modify the Korali applications objective model to run the application specifying its

inputs and gathering its output

Step III Run the application with different levels of concurrency

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

30

Running Multiple Experiments

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Scheduling Multiple Experiments

Samples

SamplesIdle

Done

Busy

Busy

Start Experiments

31

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Effect of Simultaneous ExecutionRunning Experiments Sequentially

Average Efficiency 739

Running Experiments Simultaneously

Average Efficiency 978

32

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Practice 10 Running Multiple Experiments

Step I Go to folder practice10 and examine the Korali Application

33

Step II Run the application in parallel and use the profiler tool too see how the experiments

executed

Step III Change the Korali application to run all experiments simultaneously

Step IV Run and profile the application again and compare the results with those of Step II

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

34

Resuming Previous Experiments

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Self-Enforced Fault Tolerance

Korali saves the entire state of the experiment(s) at every generation

Gen 1

Gen 1Gen 0

Gen 0 Gen 2

Gen 2

Gen 3

Gen 3

Time (Hours)

Slurm Job 1 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Fatal Failure

Gen 4

Gen 4

Final

Final

Slurm Job 2 (4000 Nodes)

Experiment 0

Experiment 1

Korali Engine

Korali can resume any Solver Problem Conduit combination35

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Practice 11 Running Multiple Experiments

Step I Go to folder practice11 and examine the Korali Application

36

Step II Run the application to completion (10 generations) taking note of the final result

Step III Delete the results folder and change the Korali application to run only the first 55

generations (with this we simulate that an error has occurred)

Step IV Now change the application again to run the last 5 generations

Step V Compare the results with that of an uninterrupted run

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

MPI and Sample Distribution A Discussion

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Two-sided Communication A sender and a receive process explicitly participate in the exchange of a message

MessageMPI_Recv()MPI_Send()

Intermediate Buffer

A message encodes two pieces of information1 The actual message payload (data)2 The fact that two ranks reached the exchange point (synchronization)

It does not encode semantics the receiver needs to know what to do with the data

MPI De facto communication standard for high-performance scientific applications

A Review of MPI

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

One-sided Communication A process can directly access a shared partition in another address space

MPI_Put()MPI_Get()

One-Sided Communication

Allows passingreceiving data without a corresponding sendrecv requestThe other end is not notified of the operation (concurrency hazards)Good for cases in which synchronization ordering is not necessary

It only encodes one piece of information data

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

A Good Case for MPI Iterative Solvers

Traditional Decomposition

1 Process (Rank) per Core

Node

Core 0 Core 1

Core 2 Core 3

Iteratively approaches a solution

Ranks Exchange Halo (Boundary) Cells

Structured Grid Stencil Solver

2D Grid

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Regular Communication

TimeCore Usage Timeline

Conventional Decomposition (1 Rank Core)

R0

Network

R0

Network

Most HPC applications are programmed under the Bulk-Synchronous Model Iterates among separate computation and communication phases

R0

Useful Computation

Network Communication Cost

Intra-Node Data Motion Cost

Computation Phase

Network

Communication Phase

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

A NOT so Good Case for MPI Genome Assembly

Original DNA

Re-assembled DNA

Construct a genome (chromosome) from a pool of short fragments produced by sequencersAnalogy shred many copies of a book and reconstruct the book by examining the pieces Complications shreds of other books may be intermixed can also contain errorsChop the reads into fixed-length fragments (k-mers)K-mers form a De Bruijn graph traverse the graph to construct longer sequences Graph is stored in a distributed hash table

Image Credit Slide Credit Scott B Baden (Berkeley Lab)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

A NOT so Good Case for MPI Genome Assembly

Initial Segment of DNA ACTCGATGCTCAATG

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA

Hash Table for Rank 1

TGCT-gtGCTC TCAA-gtCAAT-gtAATG

Hash Table for Rank 0

Rank 0 Rank 1

Detect new edgeUpdate Hash Table

Detect coinciding hash

Build k-mer graphs from independent segments sharing their hash numbers

GATG-gtATGC ACTC-gtCTCG-gtTCGA

TGTC-gtGCTC-CTCA-TCAA-gtCAAT-gtAATG

Hash Table for Rank 1

TGCT-gtGCTC

Hash Table for Rank 0

Align K-mers

Completely Asynchronous- Detection of coincident hashes - Asynchronous Hash Updates

Irregular Communication- K-mer chain size can vary- Need to allocate hash entries in real time (cannot pre-allocate)

Difficult to implement on MPI due to its asynchronicity

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)

Lets Discuss

Sample 7

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Node 0 Core 0

Node 0 Core 1

Node 0 Core 2

Node 0 Core 3

Sample 0

Q1 Is MPI a good model for the

divide-and-conquer strategy

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Node 0 Core 0

Producer

Node 0 Core 1

Consumer

Node 0 Core 2

Consumer

Node 0 Core 3

Consumer

Sample 0

Q2 Is MPI a good model for the

ProducerConsumer strategy

Asynchronous communication models might be better in these cases (eg UPC++)