modeling non-determinism in hpc applications · 2020. 9. 23. · non-determinism and correctness in...

Modeling Non-Determinism in HPC Applications

Dylan ChappAdvisor: Michela Taufer

Non-Determinism and Correctness in HPC

• HPC Community Position: “Non-determinism control” and “anomaly detection” identified in DOE report on the 2017 HPC Correctness Summit as key challenges in bug detection and localization [1]

• Our Position: § These challenges go hand-in-hand. § Detecting when and how they act non-deterministically

in unexpected ways is critical!

1. Gopalakrishnan, G., Hovland, P.D., Iancu, C., Krishnamoorthy, S., Laguna, I., Lethin, R.A., Sen, K., Siegel, S.F. and Solar-Lezama, A., 2017. Report of the HPC Correctness Summit, Jan 25--26, 2017, Washington, DC. arXiv preprint arXiv:1705.07478. See section 3.2.4, 3.2.5 1

Non-Determinism and Correctness in HPC

• HPC Community Position: “Non-determinism control” and “anomaly detection” identified in DOE report on the 2017 HPC Correctness Summit as key challenges in bug detection and localization [1]

• Our Position: § These challenges go hand-in-hand. § Detecting when and how applications act non-deterministically

in anomalous ways is critical

1. Gopalakrishnan, G., Hovland, P.D., Iancu, C., Krishnamoorthy, S., Laguna, I., Lethin, R.A., Sen, K., Siegel, S.F. and Solar-Lezama, A., 2017. Report of the HPC Correctness Summit, Jan 25--26, 2017, Washington, DC. arXiv preprint arXiv:1705.07478. See section 3.2.4, 3.2.5 1

Impacts of Non-Determinism on Scientific Outcomes

(a) (b)

Figure 1: We show the result of two Enzo cosmological simulations with identical initial conditions andsimulation properties in the two panels. The galaxy gas density is projected onto the x-y plane. The whitecircles indicate detected galactic halos, representing the size (and mass) of each halo. Note for example thathalo 49 exists in Figure 1a, and is missing in Figure 1b. This is because minute numerical variations havea↵ected the history of the second simulation to the point that the mass clump where halo 49 should be neverbecame gravitationally bound and hence detectable by the rockstar algorithm. Also note the distances andpositions of several of the galaxies are subtly di↵erent in the two panels. Close inspection of some of thehalos will also reveal di↵erent orientations in the two simulations.

(a) Mass distribution of the largest galaxy (b) X position distribution of the largest galaxy

Figure 2: The two panels show the variability of properties of the largest galaxy from 200 Enzo cosmologysimulations. These simulations were performed using the gcc compiler version 6.2.0 with ‘high’ categoryoptimizations (‘-O2’). The mass of this galaxy can vary by as much as 0.5% of the mean value.

3

1. Stodden, V. and Krafczyk, M.S., 2018. Assessing Reproducibility: An Astrophysical Example of Computational Uncertainty in the HPC Context.

Run where galactic halo was detected Run where galactic halo was not detected

2

a = 109, b = �109, c = 10�9

Summation order 1(a+ b) + c = (109 � 109) + 10�9 = 10�9

Summation order 2a+ (b+ c) = 109 + (�109 + 10�9) = 0

1

Interaction Between Non-Associativity and Non-Determinism

3

a = 109, b = �109, c = 10�9

Summation order 1(a+ b) + c = (109 � 109) + 10�9 = 10�9

Summation order 2a+ (b+ c) = 109 + (�109 + 10�9) = 0

1

Interaction Between Non-Associativity and Non-DeterminismThread 0

Thread 1

Thread 2

Thread 0

Thread 1

Thread 2

10-9

10-9

109 -109

109 -109

Run 1

Run 2

3

+ +

+ +

Impacts of Non-Determinism on CorrectnessExample 1:• Rounding difference between Xeon CPU and

Xeon Phi caused message count to differ on CPU code vs. accelerator code [1]

• Message count difference induced deadlock

Example 2:• A non-deterministic bug in

Diablo/HYPRE 2.10.1 [2]• Application hung after several hours,

in approximately 1/50 runs• Cost of debugging effort:

• 18 months of scientists’ time• 9560 node-hours

Expecting 62messages

Sends 63 messages DEADLOCK

10k

Nod

e-Ho

urs

Wasted Useful

Equal to wasting 400 nodes for 24 hours

1. Gopalakrishnan, G., Hovland, P.D., Iancu, C., Krishnamoorthy, S., Laguna, I., Lethin, R.A., Sen, K., Siegel, S.F. and Solar-Lezama, A., 2017. Report of the HPC Correctness Summit, Jan 25--26, 2017, Washington, DC. arXiv preprint arXiv:1705.07478.2. Sato, K., Ahn, D.H., Laguna, I., Lee, G.L., Schulz, M. and Chambreau, C.M., 2017, January. Noise injection techniques to expose subtle and unintended message races. In ACM SIGPLAN Notices (Vol. 52, No. 8, pp. 89-101). ACM. 4

Summary of Challenges

• HPC applications involve the interaction between:§ Floating-point non-associativity§ Communication non-determinism

• Impacts can range from program incorrectnessto irreproducibility scientific outputs

• Mitigation strategies exist, but can be costly, limited in scope, and fail to address a key need: linking observable non-determinism to its root causes

5



• Impacts can range from program incorrectnessto irreproducibility of scientific outputs

• Mitigation strategies exist, but can be costly, limited in scope, and fail to address a key need: linking observable non-determinism to its root causes

5



• Impacts can range from program incorrectnessto irreproducibility of scientific outputs

• Mitigation strategies exist, but can be costly, limited in scope, and fail to address a key need: Linking observable non-determinism to its root causes

5

Our Approach

• Need to link observations of potentially harmful non-determinism to potential root causes

• Need to distinguish between anomalous non-deterministic application behaviors and expected ones

• Need a metric for execution similarity• Need a model of executions that supports such a metric

6

Workflow for Non-Determinism Characterization

• Phase 1: Build graph-structured models of executions• Phase 2: Quantify cross-execution trends in non-deterministic

communication via graph similarity• Phase 3: Detect periods of anomalous execution dissimilarity

and localize potential root causes

7


• Phase 1: Build graph-structured models of executions

8

From Traces to Event Graphs

• We trace a non-deterministic application multiple times, capturing a record of communication events

1. Rodrigues, A.F., Voskuilen, G.R., Hammond, S.D. and Hemmert, K.S., 2016. Structural Simulation Toolkit (SST) (No. SAND2016-3693PE). Sandia National Lab.(SNL-NM), Albuquerque, NM (United States).2. https://github.com/sstsimulator/sst-dumpi

Tracing Library

9

From Traces to Event Graphs

• We trace a non-deterministic application multiple times, capturing a record of communication events

• We convert the set of trace files to a graph-structured model of the inter-process communication that occurred during the execution—i.e., the event graph [3]

1. Rodrigues, A.F., Voskuilen, G.R., Hammond, S.D. and Hemmert, K.S., 2016. Structural Simulation Toolkit (SST) (No. SAND2016-3693PE). Sandia National Lab.(SNL-NM), Albuquerque, NM (United States).2. https://github.com/sstsimulator/sst-dumpi3. Kranzlmüller, D., 2000. Event graph analysis for debugging massively parallel programs.

Tracing Library

9

Event Graph Structure• Directed, acyclic graph (DAG) representing a

message-passing program execution [1]

1. Kranzlmüller, D., 2000. Event graph analysis for debugging massively parallel programs. 10

P1

P0

Event Graph Structure

Receive Event

Send Event

Barrier Events

• Directed, acyclic graph (DAG) representing a message-passing program execution [1]

• Vertices represent communication events(e.g., message sends, receives, and barriers)

1. Kranzlmüller, D., 2000. Event graph analysis for debugging massively parallel programs. 10

P1

P0

Event Graph Structure

Receive Event

Send Event

Barrier Events

P1

P0

• Directed, acyclic graph (DAG) representing a message-passing program execution [1]

• Vertices represent communication events(e.g., message sends, receives, and barriers)

• Edges represent “happens-before”[2] relationship between events

1. Kranzlmüller, D., 2000. Event graph analysis for debugging massively parallel programs.2. Lamport, L., 1978. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7), pp.558-565.

10

Event Graph Vertex Labels

Send Event

Receive Event

Event Type: ReceiveProcess ID: 0Wall Time: 3.1745455 (s)

Event Type: SendProcess ID: 1Wall Time: 2.12323 (s)

11


Send Event

Receive Event

Event Type: ReceiveProcess ID: 0Wall Time: 3.1745455 (s)Call-Stack: main à do_comm à non_det_recv à MPI_Recv

Event Type: SendProcess ID: 1Wall Time: 2.12323 (s)Call-Stack: main à do_comm à non_det_send à MPI_Send

11


Send Event

Receive Event

Event Type: ReceiveProcess ID: 0Wall Time: 3.1745455 (s)Call-Stack: main à do_comm à non_det_recv à MPI_Recv

Event Type: SendProcess ID: 1Wall Time: 2.12323 (s)Call-Stack: main à do_comm à non_det_send à MPI_Send

Call-stack labels link run-time non-determinism to root causes in source code

11

Summary of Event Graph Model

• Traces of MPI application → DAG• Vertices → communication events• Edges → happens-before orders• Vertex Labels:

§ Event type → What happened?§ Process ID → In which process did it happen?§ Timestamp → When did it happen?§ Call-Stack → Where in the code did it come from?

12


• Phase 2: Quantify cross-execution trends in non-deterministic communication via graph similarity

13

K ( , )

Graph Similarity via Graph Kernels

Similarity Score

14

G G’

K ( , )


Similarity Score

14

Intuition: K counts matching substructures• Matches between G and G’ increase score• Differences do not• Graph kernel K induces a metric → the graph kernel distance [1]• Formula: D i j = 𝐾 𝑖 𝑖 + 𝐾 𝑗 𝑗 − 2𝐾 𝑖 𝑗

G G’

K ( , )


Similarity Score

Intuition: K counts matching substructures• Matches between G and G’ increase score• Differences do not• Graph kernel K induces a metric → the graph kernel distance [1]• Formula: D i j = 𝐾 𝑖 𝑖 + 𝐾 𝑗 𝑗 − 2𝐾 𝑖 𝑗

141. Phillips, J.M. and Venkatasubramanian, S., 2011. A gentle introduction to the kernel distance. arXiv preprint arXiv:1103.1625.

G G’

K ( , )


Similarity Score

Intuition: K counts matching substructures• Matches between G and G’ increase score• Differences do not• Graph kernel K induces a metric → the graph kernel distance [1]

Formula: D(G, G′) = 𝐾 𝐺, 𝐺2 + 𝐾(𝐺, 𝐺′) − 2𝐾(𝐺, 𝐺2)

G G’

141. Phillips, J.M. and Venkatasubramanian, S., 2011. A gentle introduction to the kernel distance. arXiv preprint arXiv:1103.1625.

Quantifying Non-Determinism via Graph Kernel Distance

We evaluate the Weisfeiler-Lehman Subtree Pattern Kernel for quantifying dissimilarity between event graphs:

§ Demonstrated performance on other graph classification tasks [1]§ Scalability compared to other graph kernels§ Incorporation of arbitrary vertex label data

1. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), pp.2539-2561.2. Yanardag, P. and Vishwanathan, S.V.N., 2015, August. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1365-1374). ACM.3. Ghosh, S., Das, N., Gonçalves, T., Quaresma, P. and Kundu, M., 2018. The journey of graph kernels through two decades. Computer Science Review, 27, pp.88-111.

15

We evaluate the Weisfeiler-Lehman Subtree Pattern Kernel for quantifying dissimilarity between event graphs:

§ Demonstrated performance on other graph classification tasks [1]§ Scalability compared to other graph kernels§ Incorporation of arbitrary vertex label data

1. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), pp.2539-2561.2. Yanardag, P. and Vishwanathan, S.V.N., 2015, August. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1365-1374). ACM.3. Ghosh, S., Das, N., Gonçalves, T., Quaresma, P. and Kundu, M., 2018. The journey of graph kernels through two decades. Computer Science Review, 27, pp.88-111.

15

Quantifying Non-Determinism via Graph Kernel Distance

Kernel Distance Evaluation Methodology

• Construct event graphs for common communication patterns with:§ Controlled degree of non-determinism § Fixed amount of communication volume

• Hypothesis: As greater non-determinism is permitted in the runs, the graph kernel distances between the event graphs representing those runs will increase

17

Kernel Distance Evaluation Methodology

• Construct event graphs for common communication patterns with:§ Controlled degree of non-determinism § Fixed amount of communication volume

• Hypothesis: As greater non-determinism is permitted in the runs,the graph kernel distances between the event graphs representing those runs will increase

17

Deterministic

Run 1

Run 2

All receives guaranteed to occur in same order run-to-run

18

Deterministic

Run 1

Run 2


Partially Non-DeterministicSome receives guaranteed to

occur in same order

18

Deterministic

Run 1

Run 2


These messages are always received in same order


occur in same order

18

Deterministic

Run 1

Run 2


These messages are received in arbitrary order


occur in same order

18

Deterministic Fully Non-Deterministic

Run 1

Run 2


All receives occur in arbitrary order run-to-run


occur in same order

18


Run 1

Run 2

Partially Non-Deterministic

vs

Kern

el D

istan

ceHi

gher

==

Less

Sim

ilar 1.00

0.00

0.50

0.75

0.25

Comparing Runs via Kernel Distance

19


Run 1

Run 2


vs

Kern

el D

istan

ceHi

gher

==

Less

Sim

ilar 1.00

0.00

0.50

0.75

0.25

Comparing Runs via Kernel DistanceHeight of bar == graph kernel distance

19


Run 1

Run 2


vs

Kern

el D

istan

ceHi

gher

==

Less

Sim

ilar 1.00

0.00

0.50

0.75

0.25

Comparing Runs via Kernel Distance

19

Height of bar == graph kernel distance

Kernel Distance Trends Across Multiple Runs

Run 1

Run 2

Run N

… … …

• Previous example only measured kernel distance between two executions• But we want a statistical picture of the trend in kernel distance over time across many executions

= kernel distance between graphs i and j

= kernel distance from graph to itself, always 0

= redundant, due to symmetry

Kernel matrixw/r/t chosen graph kernel K

20

Kernel Distance Trends Across Multiple RunsKe

rnel

Dist

ance

High

er =

= Le

ss S

imila

r

Deterministic Comm. Pattern

Fully Non-DeterministicComm. Pattern

Partially Non-DeterministicComm. Pattern

Kernel distances indexed by pairs of runs

21


rnel

Dist

ance

High

er =

= Le

ss S

imila

r

• Non-deterministic communication patterns have a chance of manifesting the same message order at runtime.

• Motivates looking at distribution of kernel distancesbetween pairs of runs rather than individual distances




Kernel distances indexed by pairs of runs

22


rnel

Dist

ance

High

er =

= Le

ss S

imila

r




23

Evaluation Against a Realistic Communication Pattern

• We now evaluate against a communication pattern extracted from the AMG benchmark from the CORAL-2 Benchmark Suite

• Known to exhibit both receiver-side non-determinism and sender-side non-determinism [1]

• For our evaluation, we control proportion of non-deterministic communication volume by splitting ranks into two groups: § One group performs the actual non-deterministic AMG pattern § One group performs a determinized version of the pattern

24

Evaluation Against a Realistic Communication Pattern

• We now evaluate against a communication pattern extracted from the AMG benchmark from the CORAL-2 Benchmark Suite

• Known to exhibit both receiver-side non-determinism and sender-side non-determinism [1]

• For our evaluation, we control proportion of non-deterministic communication volume by splitting ranks into two groups: § One group performs the actual non-deterministic AMG pattern § One group performs a determinized version of the pattern

241. Cappello, F., Guermouche, A. and Snir, M., 2010, August. On communication determinism in parallel HPC applications. In 2010 Proceedings of 19th International Conference on Computer Communications and Networks (pp. 1-8). IEEE.

Kernel Distance Between Runs of Sequoia-AMGKe

rnel

Dist

ance

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

16-process runs 32-process runs

Graph Kernel Used: Weisfeiler-Lehman Subtree Kernel [1]

1. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K. and Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12(Sep), pp.2539-2561.

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

High

er =

= Ru

ns L

ess S

imila

r to

Each

Oth

er

Hypothesis: Average kernel distance increases as fraction of non-deterministic communication increases

25

Kern

el D

istan

ce

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

16-process runs 32-process runs



70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

High

er =

= Ru

ns L

ess S

imila

r to

Each

Oth

er


25

Kernel Distance Between Runs of Sequoia-AMG

Kern

el D

istan

ce

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

16-process runs



High

er =

= Ru

ns L

ess S

imila

r to

Each

Oth

er


26

70% Det.30% ND

50% Det.50% ND

30% Det.70% ND

64-process runs

Kernel Distance Between Runs of Sequoia-AMG

Lessons Learned

• Graph kernel distance is a useful proxy for degree of non-determinism in a communication pattern

• This holds for both:§ Simple communication patterns with only receiver-side

non-determinism (i.e., reduce example)§ More complex patterns with mixed receiver/sender-side

non-determinism (i.e., AMG probe example)

27

Lessons Learned

• Graph kernel distance is a useful proxy for degree of non-determinism in a communication pattern

• This holds for both:§ Simple communication patterns with only receiver-side

non-determinism (e.g., naïve reduce pattern)§ More complex patterns with mixed receiver/sender-side

non-determinism (e.g., Sequoia-AMG Iprobe pattern)

27


• Phase 3: Detect periods of anomalous execution dissimilarity and localize potential root causes

28

Applying our Workflow to miniAMR

• Adaptive mesh refinement (AMR) code from the MatevoBenchmark Suite [1]

• We target miniAMR based on the following criteria:§ MPI application exhibiting communication non-determinism§ Root cause of non-determinism known a priori

1. Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K. and Numrich, R.W., 2009. Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep. SAND2009-5574, 3. 29

Event Graph Slicing

. . .

Run 001

Run 002

Run N

Time

Compute graph kernel similarities for this slice

N x N matrix of pairwise kernel distances

vs.Run 1, Slice 1

Run 2, Slice 1

30

Distributions of Kernel Distances

. . .

Run 001

Run 002

Run N

Time Time

Kern

el D

istan

ce


30


Kernel Distance Trends Over Time

. . .

Run 001

Run 002

Run N

Time Time


Kern

el D

istan

ce

30


Kernel Distance Trends Over Time

. . .

Run 001

Run 002

Run N

Time Time


Kern

el D

istan

ce

30


barrier

Linking Observed Non-Determinism to Root CausesRun 1

Run 2

Run N

…

A set of N event graphs modeling N runs of a

non-deterministic application

31


Run 2

Run N

…



A time-series of distributions of graph kernel distances between corresponding subgraphs

from pairs of runs

Time

Kern

el D

istan

ce

31


Run 2

Run N

…




from pairs of runs

Time

Kern

el D

istan

ce

Largest kernel distance observed here

31


Run 2

Run N

…




from pairs of runs

Time

Kern

el D

istan

ce


Extract subgraphs associated with largest kernel distances

31


Run 2

Run N

…

v7,0 : main à do_comm à det_send à MPI_Sendv6,0 : main à do_comm à det_send à MPI_Send

…v0,0 : main à do_comm à non_det_recv à MPI_Recv


Extract call-stacks from constituent vertices


Time

Kern

el D

istan

ce


31




Run 2

Run N

…

v7,0 : main à do_comm à det_send à MPI_Sendv6,0 : main à do_comm à det_send à MPI_Send



Extract call-stacks from constituent vertices


Time

Kern

el D

istan

ce


Distribution of call-stacks describes likely locations of non-determinism root causes

31



Kernel Distance Time Series for miniAMR

Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 9

Kernel distance distributionsfrom first 10 slices

32


Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 99

Kernel distance distributionsfrom first 100 slices

33


Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 1004

Kernel distance distributionsfrom all 1005 slices

34

Identifying Anomalous Slices

Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 1004

Alternating periods of relatively high median kernel distance between executions

34

Identifying Anomalous Slices

Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 1004

Flag slices as anomalous when median kernel distance increases relative to previous slice

34

Extract call stack labels from subgraphs within the slice:

Linking to Potential Root Causes

Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 1004

Run 1 Slice

Subgraph

Run 200 Slice

Subgraph

… …

35

Extract call stack labels from subgraphs within the slice:

Linking to Potential Root Causes

Slice Index

Kern

el D

istan

ce (H

ighe

r ==

Runs

Les

s Sim

ilar)

0 1004

main, driver, refine, load_balance

Run 1 Slice

Subgraph

Run 200 Slice

Subgraph

… …

Call Stack Count

main, driver, refine, comm_refine

main, driver, refine, load_balance, load_balance_move_blocks

main, driver, refine, redistribute_blocks, move_blocks, comm_parent_proc

… …

54

90

224

95

35

Frequency of call-stacks across all flagged slices

36

Frequency of call-stacks across all flagged slices• Call stacks related to mesh refinement

and load-balancing dominate

36

Conclusion

• Our workflow identifies potential root causes of non-determinism in HPC applications by:§ Building event graph models of executions§ Quantifying trends in non-determinism via graph kernel distance§ Linking runtime non-determinism to root causes

• Demonstrated viability against:§ Isolated communication patterns (Sequoia-AMG)§ Representative mini-apps (miniAMR)

• Future Work:§ Target full-fledged production AMR applications (e.g., Enzo)

37

Conclusion

• Our workflow identifies potential root causes of non-determinism in HPC applications by:§ Building event graph models of executions§ Quantifying trends in non-determinism via graph kernel distance§ Linking runtime non-determinism to root causes

• Demonstrated viability against:§ Isolated communication patterns (CORAL-2 AMG)§ Representative mini-app (miniAMR)

• Future Work:§ Target full-fledged production AMR applications (e.g., Enzo)

37

Questions?

modeling non-determinism in hpc applications · 2020. 9. 23. · non-determinism and correctness in...

Documents