modeling non-determinism in hpc applications · 2020. 9. 23. · non-determinism and correctness in...

88
Modeling Non-Determinism in HPC Applications Dylan Chapp Advisor: Michela Taufer

Upload: others

Post on 12-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Modeling Non-Determinism in HPC Applications

Dylan ChappAdvisor: Michela Taufer

Page 2: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Non-Determinism and Correctness in HPC

• HPC Community Position: “Non-determinism control” and “anomaly detection” identified in DOE report on the 2017 HPC Correctness Summit as key challenges in bug detection and localization [1]

• Our Position: § These challenges go hand-in-hand. § Detecting when and how they act non-deterministically

in unexpected ways is critical!

1. Gopalakrishnan, G., Hovland, P.D., Iancu, C., Krishnamoorthy, S., Laguna, I., Lethin, R.A., Sen, K., Siegel, S.F. and Solar-Lezama, A., 2017. Report of the HPC Correctness Summit, Jan 25--26, 2017, Washington, DC. arXiv preprint arXiv:1705.07478. See section 3.2.4, 3.2.5 1

Page 3: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Non-Determinism and Correctness in HPC

• HPC Community Position: “Non-determinism control” and “anomaly detection” identified in DOE report on the 2017 HPC Correctness Summit as key challenges in bug detection and localization [1]

• Our Position: § These challenges go hand-in-hand. § Detecting when and how applications act non-deterministically

in anomalous ways is critical

1. Gopalakrishnan, G., Hovland, P.D., Iancu, C., Krishnamoorthy, S., Laguna, I., Lethin, R.A., Sen, K., Siegel, S.F. and Solar-Lezama, A., 2017. Report of the HPC Correctness Summit, Jan 25--26, 2017, Washington, DC. arXiv preprint arXiv:1705.07478. See section 3.2.4, 3.2.5 1

Page 4: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Impacts of Non-Determinism on Scientific Outcomes

(a) (b)

Figure 1: We show the result of two Enzo cosmological simulations with identical initial conditions andsimulation properties in the two panels. The galaxy gas density is projected onto the x-y plane. The whitecircles indicate detected galactic halos, representing the size (and mass) of each halo. Note for example thathalo 49 exists in Figure 1a, and is missing in Figure 1b. This is because minute numerical variations havea↵ected the history of the second simulation to the point that the mass clump where halo 49 should be neverbecame gravitationally bound and hence detectable by the rockstar algorithm. Also note the distances andpositions of several of the galaxies are subtly di↵erent in the two panels. Close inspection of some of thehalos will also reveal di↵erent orientations in the two simulations.

(a) Mass distribution of the largest galaxy (b) X position distribution of the largest galaxy

Figure 2: The two panels show the variability of properties of the largest galaxy from 200 Enzo cosmologysimulations. These simulations were performed using the gcc compiler version 6.2.0 with ‘high’ categoryoptimizations (‘-O2’). The mass of this galaxy can vary by as much as 0.5% of the mean value.

3

1. Stodden, V. and Krafczyk, M.S., 2018. Assessing Reproducibility: An Astrophysical Example of Computational Uncertainty in the HPC Context.

Run where galactic halo was detected Run where galactic halo was not detected

2

Page 5: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

a = 109, b = �109, c = 10�9

Summation order 1(a+ b) + c = (109 � 109) + 10�9 = 10�9

Summation order 2a+ (b+ c) = 109 + (�109 + 10�9) = 0

1

Interaction Between Non-Associativity and Non-Determinism

3

Page 6: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

a = 109, b = �109, c = 10�9

Summation order 1(a+ b) + c = (109 � 109) + 10�9 = 10�9

Summation order 2a+ (b+ c) = 109 + (�109 + 10�9) = 0

1

Interaction Between Non-Associativity and Non-DeterminismThread 0

Thread 1

Thread 2

Thread 0

Thread 1

Thread 2

10-9

10-9

109 -109

109 -109

Run 1

Run 2

3

+ +

+ +

Page 7: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Impacts of Non-Determinism on CorrectnessExample 1:• Rounding difference between Xeon CPU and

Xeon Phi caused message count to differ on CPU code vs. accelerator code [1]

• Message count difference induced deadlock

Example 2:• A non-deterministic bug in

Diablo/HYPRE 2.10.1 [2]• Application hung after several hours,

in approximately 1/50 runs• Cost of debugging effort:

• 18 months of scientists’ time• 9560 node-hours

Expecting 62messages

Sends 63 messages DEADLOCK

10k

Nod

e-Ho

urs

Wasted Useful

Equal to wasting 400 nodes for 24 hours

1. Gopalakrishnan, G., Hovland, P.D., Iancu, C., Krishnamoorthy, S., Laguna, I., Lethin, R.A., Sen, K., Siegel, S.F. and Solar-Lezama, A., 2017. Report of the HPC Correctness Summit, Jan 25--26, 2017, Washington, DC. arXiv preprint arXiv:1705.07478.2. Sato, K., Ahn, D.H., Laguna, I., Lee, G.L., Schulz, M. and Chambreau, C.M., 2017, January. Noise injection techniques to expose subtle and unintended message races. In ACM SIGPLAN Notices (Vol. 52, No. 8, pp. 89-101). ACM. 4

Page 8: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Summary of Challenges

• HPC applications involve the interaction between:§ Floating-point non-associativity§ Communication non-determinism

• Impacts can range from program incorrectnessto irreproducibility scientific outputs

• Mitigation strategies exist, but can be costly, limited in scope, and fail to address a key need: linking observable non-determinism to its root causes

5

Page 9: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Summary of Challenges

• HPC applications involve the interaction between:§ Floating-point non-associativity§ Communication non-determinism

• Impacts can range from program incorrectnessto irreproducibility of scientific outputs

• Mitigation strategies exist, but can be costly, limited in scope, and fail to address a key need: linking observable non-determinism to its root causes

5

Page 10: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Summary of Challenges

• HPC applications involve the interaction between:§ Floating-point non-associativity§ Communication non-determinism

• Impacts can range from program incorrectnessto irreproducibility of scientific outputs

• Mitigation strategies exist, but can be costly, limited in scope, and fail to address a key need: linking observable non-determinism to its root causes

5

Page 11: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Summary of Challenges

• HPC applications involve the interaction between:§ Floating-point non-associativity§ Communication non-determinism

• Impacts can range from program incorrectnessto irreproducibility of scientific outputs

• Mitigation strategies exist, but can be costly, limited in scope, and fail to address a key need: Linking observable non-determinism to its root causes

5

Page 12: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Our Approach

• Need to link observations of potentially harmful non-determinism to potential root causes

• Need to distinguish between anomalous non-deterministic application behaviors and expected ones

• Need a metric for execution similarity• Need a model of executions that supports such a metric

6

Page 13: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Our Approach

• Need to link observations of potentially harmful non-determinism to potential root causes

• Need to distinguish between anomalous non-deterministic application behaviors and expected ones

• Need a metric for execution similarity• Need a model of executions that supports such a metric

6

Page 14: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Our Approach

• Need to link observations of potentially harmful non-determinism to potential root causes

• Need to distinguish between anomalous non-deterministic application behaviors and expected ones

• Need a metric for execution similarity• Need a model of executions that supports such a metric

6

Page 15: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Our Approach

• Need to link observations of potentially harmful non-determinism to potential root causes

• Need to distinguish between anomalous non-deterministic application behaviors and expected ones

• Need a metric for execution similarity• Need a model of executions that supports such a metric

6

Page 16: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Workflow for Non-Determinism Characterization

• Phase 1: Build graph-structured models of executions• Phase 2: Quantify cross-execution trends in non-deterministic

communication via graph similarity• Phase 3: Detect periods of anomalous execution dissimilarity

and localize potential root causes

7

Page 17: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Workflow for Non-Determinism Characterization

• Phase 1: Build graph-structured models of executions

8

Page 18: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

From Traces to Event Graphs

• We trace a non-deterministic application multiple times, capturing a record of communication events

1. Rodrigues, A.F., Voskuilen, G.R., Hammond, S.D. and Hemmert, K.S., 2016. Structural Simulation Toolkit (SST) (No. SAND2016-3693PE). Sandia National Lab.(SNL-NM), Albuquerque, NM (United States).2. https://github.com/sstsimulator/sst-dumpi

Tracing Library

9

Page 19: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

From Traces to Event Graphs

• We trace a non-deterministic application multiple times, capturing a record of communication events

• We convert the set of trace files to a graph-structured model of the inter-process communication that occurred during the execution—i.e., the event graph [3]

1. Rodrigues, A.F., Voskuilen, G.R., Hammond, S.D. and Hemmert, K.S., 2016. Structural Simulation Toolkit (SST) (No. SAND2016-3693PE). Sandia National Lab.(SNL-NM), Albuquerque, NM (United States).2. https://github.com/sstsimulator/sst-dumpi3. Kranzlmüller, D., 2000. Event graph analysis for debugging massively parallel programs.

Tracing Library

9

Page 20: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Event Graph Structure• Directed, acyclic graph (DAG) representing a

message-passing program execution [1]

1. Kranzlmüller, D., 2000. Event graph analysis for debugging massively parallel programs. 10

P1

P0

Page 21: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Event Graph Structure

Receive Event

Send Event

Barrier Events

• Directed, acyclic graph (DAG) representing a message-passing program execution [1]

• Vertices represent communication events(e.g., message sends, receives, and barriers)

1. Kranzlmüller, D., 2000. Event graph analysis for debugging massively parallel programs. 10

P1

P0

Page 22: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Event Graph Structure

Receive Event

Send Event

Barrier Events

P1

P0

• Directed, acyclic graph (DAG) representing a message-passing program execution [1]

• Vertices represent communication events(e.g., message sends, receives, and barriers)

• Edges represent “happens-before”[2] relationship between events

1. Kranzlmüller, D., 2000. Event graph analysis for debugging massively parallel programs.2. Lamport, L., 1978. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7), pp.558-565.

10

Page 23: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Event Graph Vertex Labels

Send Event

Receive Event

Event Type: ReceiveProcess ID: 0Wall Time: 3.1745455 (s)

Event Type: SendProcess ID: 1Wall Time: 2.12323 (s)

11

Page 24: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Event Graph Vertex Labels

Send Event

Receive Event

Event Type: ReceiveProcess ID: 0Wall Time: 3.1745455 (s)Call-Stack: main à do_comm à non_det_recv à MPI_Recv

Event Type: SendProcess ID: 1Wall Time: 2.12323 (s)Call-Stack: main à do_comm à non_det_send à MPI_Send

11

Page 25: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Event Graph Vertex Labels

Send Event

Receive Event

Event Type: ReceiveProcess ID: 0Wall Time: 3.1745455 (s)Call-Stack: main à do_comm à non_det_recv à MPI_Recv

Event Type: SendProcess ID: 1Wall Time: 2.12323 (s)Call-Stack: main à do_comm à non_det_send à MPI_Send

Call-stack labels link run-time non-determinism to root causes in source code

11

Page 26: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Summary of Event Graph Model

• Traces of MPI application → DAG• Vertices → communication events• Edges → happens-before orders• Vertex Labels:

§ Event type → What happened?§ Process ID → In which process did it happen?§ Timestamp → When did it happen?§ Call-Stack → Where in the code did it come from?

12

Page 27: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Workflow for Non-Determinism Characterization

• Phase 2: Quantify cross-execution trends in non-deterministic communication via graph similarity

13

Page 28: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

K ( , )

Graph Similarity via Graph Kernels

Similarity Score

14

G G’

Page 29: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

K ( , )

Graph Similarity via Graph Kernels

Similarity Score

14

Intuition: K counts matching substructures• Matches between G and G’ increase score• Differences do not• Graph kernel K induces a metric → the graph kernel distance [1]• Formula: D i j = 𝐾 𝑖 𝑖 + 𝐾 𝑗 𝑗 − 2𝐾 𝑖 𝑗

G G’

Page 30: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

K ( , )

Graph Similarity via Graph Kernels

Similarity Score

14

Intuition: K counts matching substructures• Matches between G and G’ increase score• Differences do not• Graph kernel K induces a metric → the graph kernel distance [1]• Formula: D i j = 𝐾 𝑖 𝑖 + 𝐾 𝑗 𝑗 − 2𝐾 𝑖 𝑗

G G’

Page 31: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

K ( , )

Graph Similarity via Graph Kernels

Similarity Score

Intuition: K counts matching substructures• Matches between G and G’ increase score• Differences do not• Graph kernel K induces a metric → the graph kernel distance [1]• Formula: D i j = 𝐾 𝑖 𝑖 + 𝐾 𝑗 𝑗 − 2𝐾 𝑖 𝑗

141. Phillips, J.M. and Venkatasubramanian, S., 2011. A gentle introduction to the kernel distance. arXiv preprint arXiv:1103.1625.

G G’

Page 32: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

K ( , )

Graph Similarity via Graph Kernels

Similarity Score

Intuition: K counts matching substructures• Matches between G and G’ increase score• Differences do not• Graph kernel K induces a metric → the graph kernel distance [1]

Formula: D(G, G′) = 𝐾 𝐺, 𝐺2 + 𝐾(𝐺, 𝐺′) − 2𝐾(𝐺, 𝐺2)

G G’

141. Phillips, J.M. and Venkatasubramanian, S., 2011. A gentle introduction to the kernel distance. arXiv preprint arXiv:1103.1625.

Page 33: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Quantifying Non-Determinism via Graph Kernel Distance

We evaluate the Weisfeiler-Lehman Subtree Pattern Kernel for quantifying dissimilarity between event graphs:

§ Demonstrated performance on other graph classification tasks [1]§ Scalability compared to other graph kernels§ Incorporation of arbitrary vertex label data

1. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), pp.2539-2561.2. Yanardag, P. and Vishwanathan, S.V.N., 2015, August. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1365-1374). ACM.3. Ghosh, S., Das, N., Gonçalves, T., Quaresma, P. and Kundu, M., 2018. The journey of graph kernels through two decades. Computer Science Review, 27, pp.88-111.

15

Page 34: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Quantifying Non-Determinism via Graph Kernel Distance

We evaluate the Weisfeiler-Lehman Subtree Pattern Kernel for quantifying dissimilarity between event graphs:

§ Demonstrated performance on other graph classification tasks [1]§ Scalability compared to other graph kernels§ Incorporation of arbitrary vertex label data

1. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), pp.2539-2561.2. Yanardag, P. and Vishwanathan, S.V.N., 2015, August. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1365-1374). ACM.3. Ghosh, S., Das, N., Gonçalves, T., Quaresma, P. and Kundu, M., 2018. The journey of graph kernels through two decades. Computer Science Review, 27, pp.88-111.

15

Page 35: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

We evaluate the Weisfeiler-Lehman Subtree Pattern Kernel for quantifying dissimilarity between event graphs:

§ Demonstrated performance on other graph classification tasks [1]§ Scalability compared to other graph kernels§ Incorporation of arbitrary vertex label data

1. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), pp.2539-2561.2. Yanardag, P. and Vishwanathan, S.V.N., 2015, August. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1365-1374). ACM.3. Ghosh, S., Das, N., Gonçalves, T., Quaresma, P. and Kundu, M., 2018. The journey of graph kernels through two decades. Computer Science Review, 27, pp.88-111.

15

Quantifying Non-Determinism via Graph Kernel Distance

Page 36: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

We evaluate the Weisfeiler-Lehman Subtree Pattern Kernel for quantifying dissimilarity between event graphs:

§ Demonstrated performance on other graph classification tasks [1]§ Scalability compared to other graph kernels§ Incorporation of arbitrary vertex label data

1. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), pp.2539-2561.2. Yanardag, P. and Vishwanathan, S.V.N., 2015, August. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1365-1374). ACM.3. Ghosh, S., Das, N., Gonçalves, T., Quaresma, P. and Kundu, M., 2018. The journey of graph kernels through two decades. Computer Science Review, 27, pp.88-111.

15

Quantifying Non-Determinism via Graph Kernel Distance

Page 37: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Evaluation Methodology

• Construct event graphs for common communication patterns with:§ Controlled degree of non-determinism § Fixed amount of communication volume

• Hypothesis: As greater non-determinism is permitted in the runs, the graph kernel distances between the event graphs representing those runs will increase

17

Page 38: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Evaluation Methodology

• Construct event graphs for common communication patterns with:§ Controlled degree of non-determinism § Fixed amount of communication volume

• Hypothesis: As greater non-determinism is permitted in the runs,the graph kernel distances between the event graphs representing those runs will increase

17

Page 39: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Deterministic

Run 1

Run 2

All receives guaranteed to occur in same order run-to-run

18

Page 40: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Deterministic

Run 1

Run 2

All receives guaranteed to occur in same order run-to-run

Partially Non-DeterministicSome receives guaranteed to

occur in same order

18

Page 41: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Deterministic

Run 1

Run 2

All receives guaranteed to occur in same order run-to-run

These messages are always received in same order

Partially Non-DeterministicSome receives guaranteed to

occur in same order

18

Page 42: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Deterministic

Run 1

Run 2

All receives guaranteed to occur in same order run-to-run

These messages are received in arbitrary order

Partially Non-DeterministicSome receives guaranteed to

occur in same order

18

Page 43: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Deterministic Fully Non-Deterministic

Run 1

Run 2

All receives guaranteed to occur in same order run-to-run

All receives occur in arbitrary order run-to-run

Partially Non-DeterministicSome receives guaranteed to

occur in same order

18

Page 44: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Deterministic Fully Non-Deterministic

Run 1

Run 2

Partially Non-Deterministic

vs

Kern

el D

istan

ceHi

gher

==

Less

Sim

ilar 1.00

0.00

0.50

0.75

0.25

Comparing Runs via Kernel Distance

19

Page 45: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Deterministic Fully Non-Deterministic

Run 1

Run 2

Partially Non-Deterministic

vs

Kern

el D

istan

ceHi

gher

==

Less

Sim

ilar 1.00

0.00

0.50

0.75

0.25

Comparing Runs via Kernel DistanceHeight of bar == graph kernel distance

19

Page 46: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Deterministic Fully Non-Deterministic

Run 1

Run 2

Partially Non-Deterministic

vs

Kern

el D

istan

ceHi

gher

==

Less

Sim

ilar 1.00

0.00

0.50

0.75

0.25

Comparing Runs via Kernel Distance

19

Height of bar == graph kernel distance

Page 47: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Deterministic Fully Non-Deterministic

Run 1

Run 2

Partially Non-Deterministic

vs

Kern

el D

istan

ceHi

gher

==

Less

Sim

ilar 1.00

0.00

0.50

0.75

0.25

Comparing Runs via Kernel Distance

19

Height of bar == graph kernel distance

Page 48: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Trends Across Multiple Runs

Run 1

Run 2

Run N

… … …

• Previous example only measured kernel distance between two executions• But we want a statistical picture of the trend in kernel distance over time across many executions

= kernel distance between graphs i and j

= kernel distance from graph to itself, always 0

= redundant, due to symmetry

Kernel matrixw/r/t chosen graph kernel K

20

Page 49: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Trends Across Multiple RunsKe

rnel

Dist

ance

High

er =

= Le

ss S

imila

r

Deterministic Comm. Pattern

Fully Non-DeterministicComm. Pattern

Partially Non-DeterministicComm. Pattern

Kernel distances indexed by pairs of runs

21

Page 50: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Trends Across Multiple RunsKe

rnel

Dist

ance

High

er =

= Le

ss S

imila

r

• Non-deterministic communication patterns have a chance of manifesting the same message order at runtime.

• Motivates looking at distribution of kernel distancesbetween pairs of runs rather than individual distances

Deterministic Comm. Pattern

Partially Non-DeterministicComm. Pattern

Fully Non-DeterministicComm. Pattern

Kernel distances indexed by pairs of runs

22

Page 51: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Trends Across Multiple RunsKe

rnel

Dist

ance

High

er =

= Le

ss S

imila

r

Deterministic Comm. Pattern

Partially Non-DeterministicComm. Pattern

Fully Non-DeterministicComm. Pattern

23

Page 52: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Evaluation Against a Realistic Communication Pattern

• We now evaluate against a communication pattern extracted from the AMG benchmark from the CORAL-2 Benchmark Suite

• Known to exhibit both receiver-side non-determinism and sender-side non-determinism [1]

• For our evaluation, we control proportion of non-deterministic communication volume by splitting ranks into two groups: § One group performs the actual non-deterministic AMG pattern § One group performs a determinized version of the pattern

24

Page 53: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Evaluation Against a Realistic Communication Pattern

• We now evaluate against a communication pattern extracted from the AMG benchmark from the CORAL-2 Benchmark Suite

• Known to exhibit both receiver-side non-determinism and sender-side non-determinism [1]

• For our evaluation, we control proportion of non-deterministic communication volume by splitting ranks into two groups: § One group performs the actual non-deterministic AMG pattern § One group performs a determinized version of the pattern

241. Cappello, F., Guermouche, A. and Snir, M., 2010, August. On communication determinism in parallel HPC applications. In 2010 Proceedings of 19th International Conference on Computer Communications and Networks (pp. 1-8). IEEE.

Page 54: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Evaluation Against a Realistic Communication Pattern

• We now evaluate against a communication pattern extracted from the AMG benchmark from the CORAL-2 Benchmark Suite

• Known to exhibit both receiver-side non-determinism and sender-side non-determinism [1]

• For our evaluation, we control proportion of non-deterministic communication volume by splitting ranks into two groups: § One group performs the actual non-deterministic AMG pattern § One group performs a determinized version of the pattern

241. Cappello, F., Guermouche, A. and Snir, M., 2010, August. On communication determinism in parallel HPC applications. In 2010 Proceedings of 19th International Conference on Computer Communications and Networks (pp. 1-8). IEEE.

Page 55: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Between Runs of Sequoia-AMGKe

rnel

Dist

ance

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

16-process runs 32-process runs

Graph Kernel Used: Weisfeiler-Lehman Subtree Kernel [1]

1. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), pp.2539-2561.

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

High

er =

= Ru

ns L

ess S

imila

r to

Each

Oth

er

Hypothesis: Average kernel distance increases as fraction of non-deterministic communication increases

25

Page 56: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kern

el D

istan

ce

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

16-process runs 32-process runs

Graph Kernel Used: Weisfeiler-Lehman Subtree Kernel [1]

1. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), pp.2539-2561.

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

High

er =

= Ru

ns L

ess S

imila

r to

Each

Oth

er

Hypothesis: Average kernel distance increases as fraction of non-deterministic communication increases

25

Kernel Distance Between Runs of Sequoia-AMG

Page 57: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kern

el D

istan

ce

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

16-process runs 32-process runs

Graph Kernel Used: Weisfeiler-Lehman Subtree Kernel [1]

1. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), pp.2539-2561.

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

High

er =

= Ru

ns L

ess S

imila

r to

Each

Oth

er

Hypothesis: Average kernel distance increases as fraction of non-deterministic communication increases

25

Kernel Distance Between Runs of Sequoia-AMG

Page 58: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kern

el D

istan

ce

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

16-process runs 32-process runs

Graph Kernel Used: Weisfeiler-Lehman Subtree Kernel [1]

1. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), pp.2539-2561.

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

High

er =

= Ru

ns L

ess S

imila

r to

Each

Oth

er

Hypothesis: Average kernel distance increases as fraction of non-deterministic communication increases

25

Kernel Distance Between Runs of Sequoia-AMG

Page 59: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kern

el D

istan

ce

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

16-process runs

Graph Kernel Used: Weisfeiler-Lehman Subtree Kernel [1]

1. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), pp.2539-2561.

High

er =

= Ru

ns L

ess S

imila

r to

Each

Oth

er

Hypothesis: Average kernel distance increases as fraction of non-deterministic communication increases

26

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

64-process runs

Kernel Distance Between Runs of Sequoia-AMG

Page 60: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Lessons Learned

• Graph kernel distance is a useful proxy for degree of non-determinism in a communication pattern

• This holds for both:§ Simple communication patterns with only receiver-side

non-determinism (i.e., reduce example)§ More complex patterns with mixed receiver/sender-side

non-determinism (i.e., AMG probe example)

27

Page 61: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Lessons Learned

• Graph kernel distance is a useful proxy for degree of non-determinism in a communication pattern

• This holds for both:§ Simple communication patterns with only receiver-side

non-determinism (e.g., naïve reduce pattern)§ More complex patterns with mixed receiver/sender-side

non-determinism (e.g., Sequoia-AMG Iprobe pattern)

27

Page 62: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Workflow for Non-Determinism Characterization

• Phase 3: Detect periods of anomalous execution dissimilarity and localize potential root causes

28

Page 63: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Applying our Workflow to miniAMR

• Adaptive mesh refinement (AMR) code from the MatevoBenchmark Suite [1]

• We target miniAMR based on the following criteria:§ MPI application exhibiting communication non-determinism§ Root cause of non-determinism known a priori

1. Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K. and Numrich, R.W., 2009. Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep. SAND2009-5574, 3. 29

Page 64: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Applying our Workflow to miniAMR

• Adaptive mesh refinement (AMR) code from the MatevoBenchmark Suite [1]

• We target miniAMR based on the following criteria:§ MPI application exhibiting communication non-determinism§ Root cause of non-determinism known a priori

1. Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K. and Numrich, R.W., 2009. Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep. SAND2009-5574, 3. 29

Page 65: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Event Graph Slicing

. . .

Run 001

Run 002

Run N

Time

Compute graph kernel similarities for this slice

N x N matrix of pairwise kernel distances

vs.Run 1, Slice 1

Run 2, Slice 1

30

Page 66: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Distributions of Kernel Distances

. . .

Run 001

Run 002

Run N

Time Time

Kern

el D

istan

ce

Compute graph kernel similarities for this slice

30

N x N matrix of pairwise kernel distances

Page 67: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Trends Over Time

. . .

Run 001

Run 002

Run N

Time Time

Compute graph kernel similarities for this slice

Kern

el D

istan

ce

30

N x N matrix of pairwise kernel distances

Page 68: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Trends Over Time

. . .

Run 001

Run 002

Run N

Time Time

Compute graph kernel similarities for this slice

Kern

el D

istan

ce

30

N x N matrix of pairwise kernel distances

Page 69: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Trends Over Time

. . .

Run 001

Run 002

Run N

Time Time

Compute graph kernel similarities for this slice

Kern

el D

istan

ce

30

N x N matrix of pairwise kernel distances

barrier

Page 70: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Linking Observed Non-Determinism to Root CausesRun 1

Run 2

Run N

A set of N event graphs modeling N runs of a

non-deterministic application

31

Page 71: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Linking Observed Non-Determinism to Root CausesRun 1

Run 2

Run N

A set of N event graphs modeling N runs of a

non-deterministic application

A time-series of distributions of graph kernel distances between corresponding subgraphs

from pairs of runs

Time

Kern

el D

istan

ce

31

Page 72: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Linking Observed Non-Determinism to Root CausesRun 1

Run 2

Run N

A set of N event graphs modeling N runs of a

non-deterministic application

A time-series of distributions of graph kernel distances between corresponding subgraphs

from pairs of runs

Time

Kern

el D

istan

ce

Largest kernel distance observed here

31

Page 73: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Linking Observed Non-Determinism to Root CausesRun 1

Run 2

Run N

A set of N event graphs modeling N runs of a

non-deterministic application

A time-series of distributions of graph kernel distances between corresponding subgraphs

from pairs of runs

Time

Kern

el D

istan

ce

Largest kernel distance observed here

Extract subgraphs associated with largest kernel distances

31

Page 74: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Linking Observed Non-Determinism to Root CausesRun 1

Run 2

Run N

v7,0 : main à do_comm à det_send à MPI_Sendv6,0 : main à do_comm à det_send à MPI_Send

…v0,0 : main à do_comm à non_det_recv à MPI_Recv

…v0,6 : main à do_comm à non_det_recv à MPI_Recv

Extract call-stacks from constituent vertices

Extract subgraphs associated with largest kernel distances

Time

Kern

el D

istan

ce

Largest kernel distance observed here

31

A set of N event graphs modeling N runs of a

non-deterministic application

Page 75: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Linking Observed Non-Determinism to Root CausesRun 1

Run 2

Run N

v7,0 : main à do_comm à det_send à MPI_Sendv6,0 : main à do_comm à det_send à MPI_Send

…v0,0 : main à do_comm à non_det_recv à MPI_Recv

…v0,6 : main à do_comm à non_det_recv à MPI_Recv

Extract call-stacks from constituent vertices

Extract subgraphs associated with largest kernel distances

Time

Kern

el D

istan

ce

Largest kernel distance observed here

Distribution of call-stacks describes likely locations of non-determinism root causes

31

A set of N event graphs modeling N runs of a

non-deterministic application

Page 76: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Time Series for miniAMR

Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 9

Kernel distance distributionsfrom first 10 slices

32

Page 77: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Time Series for miniAMR

Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 99

Kernel distance distributionsfrom first 100 slices

33

Page 78: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Kernel Distance Time Series for miniAMR

Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 1004

Kernel distance distributionsfrom all 1005 slices

34

Page 79: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Identifying Anomalous Slices

Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 1004

Alternating periods of relatively high median kernel distance between executions

34

Page 80: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Identifying Anomalous Slices

Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 1004

Flag slices as anomalous when median kernel distance increases relative to previous slice

34

Page 81: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Extract call stack labels from subgraphs within the slice:

Linking to Potential Root Causes

Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 1004

Run 1 Slice

Subgraph

Run 200 Slice

Subgraph

… …

35

Page 82: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Extract call stack labels from subgraphs within the slice:

Linking to Potential Root Causes

Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 1004

main, driver, refine, load_balance

Run 1 Slice

Subgraph

Run 200 Slice

Subgraph

… …

Call Stack Count

main, driver, refine, comm_refine

main, driver, refine, load_balance, load_balance_move_blocks

main, driver, refine, redistribute_blocks, move_blocks, comm_parent_proc

… …

54

90

224

95

35

Page 83: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Frequency of call-stacks across all flagged slices

36

Page 84: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Frequency of call-stacks across all flagged slices• Call stacks related to mesh refinement

and load-balancing dominate

36

Page 85: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Conclusion

• Our workflow identifies potential root causes of non-determinism in HPC applications by:§ Building event graph models of executions§ Quantifying trends in non-determinism via graph kernel distance§ Linking runtime non-determinism to root causes

• Demonstrated viability against:§ Isolated communication patterns (Sequoia-AMG)§ Representative mini-apps (miniAMR)

• Future Work:§ Target full-fledged production AMR applications (e.g., Enzo)

37

Page 86: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Conclusion

• Our workflow identifies potential root causes of non-determinism in HPC applications by:§ Building event graph models of executions§ Quantifying trends in non-determinism via graph kernel distance§ Linking runtime non-determinism to root causes

• Demonstrated viability against:§ Isolated communication patterns (CORAL-2 AMG)§ Representative mini-app (miniAMR)

• Future Work:§ Target full-fledged production AMR applications (e.g., Enzo)

37

Page 87: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Conclusion

• Our workflow identifies potential root causes of non-determinism in HPC applications by:§ Building event graph models of executions§ Quantifying trends in non-determinism via graph kernel distance§ Linking runtime non-determinism to root causes

• Demonstrated viability against:§ Isolated communication patterns (CORAL-2 AMG)§ Representative mini-app (miniAMR)

• Future Work:§ Target full-fledged production AMR applications (e.g., Enzo)

37

Page 88: Modeling Non-Determinism in HPC Applications · 2020. 9. 23. · Non-Determinism and Correctness in HPC •HPC Community Position:“Non-determinism control” and “anomaly detection”

Questions?