david beckingsale, olga pearce, and todd gamblin...performance variability is data-dependent • the...

1
Apollo: Lightweight Models for Dynamically Tuning Data-Dependent Code Frameworks like RAJA (http://github.com/LLNL/RAJA) allow developers to map loops to different programming models independent of scientific code. David Beckingsale, Olga Pearce, and Todd Gamblin Lawrence Livermore National Laboratory, Livermore, CA, USA RAJA provides a way to map applications kernels to different programming models LLNL-POST-697252 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Apollo builds decision-models to tune parameters forall< EXECUTION_POLICY >(0, N, [=] (Index_type i){ y[ i ] += a * x[ i ]; } Execution policies let us chose where to execute each kernel, but the programming framework can’t tell us which is fastest! Using supervised learning in an offline training phase, we build a classifier that directly predicts the fastest parameter value for each kernel invocation. The application is run multiple times with various input problems and execution policies to generate a training data set. Apollo’s Python framework process this data and learns a decision tree model. We then generate a C++ decision model that can be evaluated at runtime to dynamically tune the execution policies an application is using. Performance variability is data-dependent The performance of numerical physics kernels in scientific applications depends strongly on input dataset, data storage, access pattern, and work to thread mappings. We studied three applications: LULESH and CleverLeaf are hydrodynamics mini-applications, and ARES is a large-scale production code used for munitions modeling and inertial confinement fusion simulations. We see up to three orders of magnitude in the time taken to execute each kernel, depending on the input data and the way the kernel is executed. Auto-tuning with Apollo frees application developers from manually choosing parameters Existing auto-tuners rely on costly search procedures and over fit for specific inputs, whilst Apollo can learn how to select the fastest parameters based on application features. We are working to build general models that will allow us to predict the fastest parameters across both applications and hardware platforms, allowing us to learn models from a vast body of training data, then apply them to a wide range of application runs. We will also extend our work to support heterogeneous platforms, using Apollo to predict where to run kernels, in addition to other parameters. Figure 1: Performance variability per-kernel in LULESH, CleverLeaf and ARES RAJA Apollo Control Libraries Applica3on RAJA::forall<exec_policy>(IndexSet, [=](int i) { sigxx[i] = sigyy[i] = sigzz[i] = - p(i) - q(i); }); template <typename POLICY, typename LOOP> inline void forall(IndexSet iset, LOOP loop_body); Apollo Dynamically load control library Apollo Recorder Apollo Model (Generated) Chose policy based on runtime information loop=1, num_iterations=400, ... loop=3, num_iterations=125, ... loop=N, num_iterations=376, ... Backends Open MP ® Apollo models are dynamically loadable Apollo loads the compiled C++ decision model at application startup. Execution policies are template parameters that control which programming model backend is selected. Apollo instantiates each policy type to allow dynamic backend selection. The decision model sits between RAJA and the execution policy back ends, dynamically selecting an execution policy type based on the features of the kernel that is about to be executed. Apollo selects the fastest parameter up to 98% of the time Ares Cleverleaf Lulesh Application 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy We build one model per- application, based on training data generated from three input problems and 5 problem sizes. Using 10-fold cross-validation, the mean accuracy of these models is up to 0.98 Figure 3: Predictive accuracy of each Apollo models. Sedov 0 1 2 3 4 5 6 Speedup LULESH Sod Triple Point Sedov Input Deck CleverLeaf Sedov Jet Hotspot Ares Policy Default Dynamically tuning policies at runtime provides speedups of up to 4.8x Figure 5: Online speedups for multiple input problems in each application. These speedups are problem-dependent, since the fastest policy choices dependent on the kernel invocation parameters. The data sizes encountered in the Triple Point problem in CleverLeaf benefit most, with a speedup of 4.8x. 16 32 64 128 256 Processing Cores 0 100 200 300 400 500 600 700 800 900 Runtime (s) Policy Default MPI ranks make independent tuning decisions Figure 6: Parallel runtimes for the Hotspot problem in ARES. We see speedups on up to 256 processor cores. Applica'on Training Data Apollo Model Execu'on Apollo Model Generator Processing training data generated by the Apollo Recorder control library Build decision tree model that can predict the best parameter value for a given sample. Generate tuning control library from decision tree. if feature < value parameter = p if feature < value if feature < value

Upload: others

Post on 10-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: David Beckingsale, Olga Pearce, and Todd Gamblin...Performance variability is data-dependent • The performance of numerical physics kernels in scientific applications depends strongly

Apollo: Lightweight Models for Dynamically TuningData-Dependent Code

• Frameworks like RAJA (http://github.com/LLNL/RAJA) allowdevelopers to map loops to different programming modelsindependent of scientific code.

David Beckingsale, Olga Pearce, and Todd GamblinLawrence Livermore National Laboratory, Livermore, CA, USA

RAJA provides a way to map applications kernels to different

programming models

LLNL-POST-697252This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Apollo builds decision-models to tune parameters

forall< EXECUTION_POLICY >(0, N, [=] (Index_type i){y[ i ] += a * x[ i ];

}

• Execution policies let us chose where to execute each kernel, butthe programming framework can’t tell us which is fastest!

• Using supervised learning in an offline training phase, webuild a classifier that directly predicts the fastestparameter value for each kernel invocation.

• The application is run multiple times with various inputproblems and execution policies to generate a trainingdata set.

• Apollo’s Python framework process this data and learnsa decision tree model.

• We then generate a C++ decision model that can beevaluated at runtime to dynamically tune theexecution policies an application is using.

Performance variability is data-dependent• The performance of numerical physics kernels in scientific

applications depends strongly on input dataset, data storage,access pattern, and work to thread mappings.

• We studied three applications: LULESH and CleverLeaf arehydrodynamics mini-applications, and ARES is a large-scaleproduction code used for munitions modeling and inertialconfinement fusion simulations.

• We see up to three orders of magnitude in the time taken toexecute each kernel, depending on the input data and the waythe kernel is executed.

Auto-tuning with Apollo frees application developers from manually choosing parameters

• Existing auto-tuners rely on costly search procedures and over fit for specific inputs, whilstApollo can learn how to select the fastest parameters based on application features.

• We are working to build general models that will allow us to predict the fastest parametersacross both applications and hardware platforms, allowing us to learn models from a vastbody of training data, then apply them to a wide range of application runs.

• We will also extend our work to support heterogeneous platforms, using Apollo to predictwhere to run kernels, in addition to other parameters.

Figure 1: Performance variability per-kernel in LULESH, CleverLeaf and ARES

RAJA

ApolloControlLibraries

Applica3onRAJA::forall<exec_policy>(IndexSet, [=](int i) { sigxx[i] = sigyy[i] = sigzz[i] = - p(i) - q(i);});

template <typename POLICY, typename LOOP>inline voidforall(IndexSet iset, LOOP loop_body);

Apollo• Dynamically load control library

ApolloRecorder ApolloModel(Generated)• Chose policy based on

runtime informationloop=1,num_iterations=400,...loop=3,num_iterations=125,...loop=N,num_iterations=376,...

Backends

OpenMP®

Apollo models are dynamically loadable• Apollo loads the compiled C++ decision model at application startup.

• Execution policies are template parameters that control which programmingmodel backend is selected. Apollo instantiates each policy type to allowdynamic backend selection.

• The decision model sits between RAJA and the execution policy back ends,dynamically selecting an execution policy type based on the features of thekernel that is about to be executed.

Apollo selects the fastest parameter up to 98% of the time

Ares

Cleverlea

f

Lules

h

Application

0.0

0.2

0.4

0.6

0.8

1.0

Acc

urac

y

• We build one model per-application, based on trainingdata generated from three inputproblems and 5 problem sizes.

• Using 10-fold cross-validation,the mean accuracy of thesemodels is up to 0.98

Figure 3: Predictive accuracy of each Apollo models.

Sedo

v0

1

2

3

4

5

6

Spee

dup

LULESH

Sod

Tripl

ePo

int

Sedo

v

Input Deck

CleverLeaf

Sedo

vJet

Hotspot

Ares

Policy Default

Dynamically tuning policies at runtime provides speedups of up to 4.8x

Figure 5: Online speedups for multiple input problems in each application.

• These speedups are problem-dependent, since the fastest policychoices dependent on the kernel invocation parameters.

• The data sizes encountered in the Triple Point problem in CleverLeafbenefit most, with a speedup of 4.8x.

16 32 64 128

256

Processing Cores

0100200300400500600700800900

Run

tim

e(s

)

Policy Default

MPI ranks make independent tuning decisions

Figure 6: Parallel runtimes for the Hotspot problem in ARES. We see speedups on up to 256 processor cores.

Applica'on

FeatureVectorinCSVfilesFeatureVectorinCSVfilesFeatureVectorinCSVfilesFeatureVectorinCSVfilesFeatureVectorinCSVfilesFeatureVectorinCSVfilesTrainingData

ApolloModel

Execu'on ApolloModelGenerator• Processing training data

generated by the Apollo Recorder control library

• Build decision tree model that can predict the best parameter value for a given sample.

• Generate tuning control library from decision tree.

iffeature<value

parameter=p

iffeature<valueiffeature<value