reusing ml tools and approaches for hep - cern › record › 2105484 › files ›...

47
Reusing ML tools and approaches for HEP Andrey Ustyuzhanin 1,2,3,4 on behalf of YSDA LHCb team 1 YSDA, 2 NRC ”Kurchatov Institute”, 3 HSE, 4 YDF

Upload: others

Post on 03-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Reusing ML tools andapproaches for HEP

Andrey Ustyuzhanin1,2,3,4 on behalf of YSDA LHCb team

1YSDA, 2NRC ”Kurchatov Institute”, 3HSE, 4YDF

Page 2: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Yandex

Andrey Ustyuzhanin 2

Page 3: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Yandex School of Data Analysis

› non commercial private universityhttps://yandexdataschool.com

› 450+ students graduated since 2007› strong education focus on Data & Computer Science› 25% of our students have physics background› associate member of LHCb since Dec 2014› interested in improving Physics results by Machine Learning› organize bi-yearly international Machine Learning Conferencehttps://yandexdataschool.com/conference/

Andrey Ustyuzhanin 3

Page 4: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Overview

› Motivation› [Possible] way to go

› Technological part› Sociological part› “yak shaving”

› HEP Challenges› Some Results› Conclusion

Andrey Ustyuzhanin 4

Page 5: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Case for ML for HEPHiggs ML Challenge (http://bit.ly/1HysVUN, p28)

Andrey Ustyuzhanin 5

Page 6: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Andrey Ustyuzhanin 6

Space is really big. You just won’tbelieve how vastly hugelymindboggingly big it is. I mean youmay think it’s a long way down theroad to the chemist, but that’s justpeanuts to space.

Douglas Adams,Hitchhiker’s Guide to the Galaxy

Page 7: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

ML out there

› Generic ML approaches› Classification› Clustering› Regression› Dimensionality Reduction› Feature Engineering› Model Selection (AutoML)

› Languages› R, Matlab, Python, Lua, Julia, …

› Tools› https://github.com/josephmisiti/awesome-machine-learning› https://mloss.org/about/› https://www.kaggle.com/wiki/Software› …

Andrey Ustyuzhanin 7

Page 8: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Popular ML problems in HEP

› Classification› Binary, Multi-class

› Model Selection› Regression› Pattern recognition› …› http://bit.ly/1M0orCQ

Andrey Ustyuzhanin 8

Page 9: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Ingredients

› Shared datasets› Technology

› data› data format› algorithms accessibility

› People you can learn from› #DSLHC has reached its full capacity 3 weeks before start!› Yandex School of Data Analysis has joined LHCb in 2014› HEPML workshop in 2014› ALEPH workshop (http://yandexdataschool.github.io/aleph2015/)

› A bit of “yak shaving”› decompose of problems into ‘independent’ pieces› algorithm execution transparency› computationally rich communication platform› consistency checks

Andrey Ustyuzhanin 9

Page 10: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Technology partReproducible Experiment Platform (REP)machine learning toolbox for humans

› unified classifier wrappers for› Sklearn, XGBoost, uBoost, TMVA, ANN, Theanets, PyBrain, …› pluggable quality metrics› support for interactive plots› parallelized grid search strategies› parallel training of classifiers classification/regression reportsmeta-algorithms (“REP Lego”)

https://github.com/yandex/rep/

Andrey Ustyuzhanin 10

Page 11: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Scenarios

Installation

› manually (pip install …, many dependencies!)› virtual env› docker

Deployment ‘modes’

› laptop› remote server› remote cluster› cloud platform (docker-machine)

Andrey Ustyuzhanin 11

Page 12: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Scenarios

Installation

› manually (pip install …, many dependencies!)› virtual env› docker

Deployment ‘modes’

› laptop› remote server› remote cluster› cloud platform (docker-machine)

Andrey Ustyuzhanin 11

Page 13: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Accessing ROOT files (root_numpy)

data = root_numpy.root2array("toy_datasets/random.root",branches=['X1', 'X2', 'sin(X1)*exp(X2)'],treename='tree', selection='X1 > 0')

root_numpy.array2root(data, filename='output.root',treename='tree', mode='recreate')

with root_open('test.root', mode='a') as myfile:c = numpy.array(mycolumn, dtype=[('new', 'f8')])root_numpy.array2tree(c, tree=myfile.tree)myfile.write()

Andrey Ustyuzhanin 12

Page 14: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Training a classifier (SKlearn GBDT)

train_data, test_data, train_labels, test_labels =train_test_split(data, labels, train_size=0.5)

variables = list(data.columns[:26])from rep.estimators import SklearnClassifierfrom sklearn.ensemble import GradientBoostingClassifiersk = SklearnClassifier(GradientBoostingClassifier(),features=variables)

sk.fit(train_data, train_labels)

Andrey Ustyuzhanin 13

Page 15: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Training a classifier (TMVA)

train_data, test_data, train_labels, test_labels =train_test_split(data, labels, train_size=0.5)

variables = list(data.columns[:26])from rep.estimators import TMVAClassifiertmva = TMVAClassifier(method='kBDT', NTrees=50,Shrinkage=0.05, features=variables)

tmva.fit(train_data, train_labels)

https://github.com/yandex/rep/blob/develop/howto/01-howto-Classifiers.ipynb

Andrey Ustyuzhanin 14

Page 16: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Making predictions

prob = tmva.predict_proba(test_data)print 'ROC AUC:', roc_auc_score(test_labels, prob[:, 1])

Andrey Ustyuzhanin 15

Page 17: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

REP “Lego”

from sklearn.ensemble import AdaBoostClassifier

# Construct AdaBoost with TMVA as base estimatorbase_tmva = TMVAClassifier(method='kBDT', NTrees=15,Shrinkage=0.05)

ada_tmva = SklearnClassifier(AdaBoostClassifier(base_estimator=base_tmva, n_estimators=5),features=variables)

ada_tmva.fit(train_data, train_labels)

Andrey Ustyuzhanin 16

Page 18: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Meta-ML, Factories

factory = ClassifiersFactory()factory.add_classifier('tmva', tmva)factory.add_classifier('ada', ada)factory['xgb'] = XGBoostClassifier(features=variables)# trainingfactory.fit(train_data, train_labels,features=variables, parallel_profile='IPC')

# predictfactory.predict_proba(some_data, parallel_profile='IPC')

https://github.com/yandex/rep/blob/develop/howto/02-howto-Factory.ipynb

Andrey Ustyuzhanin 17

Page 19: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Visual Comparison

from rep.report.metrics import RocAucreport = factory.test_on(test_data, test_labels)learning_curve = report.learning_curve(RocAuc(),metric_label='ROC AUC', steps=1)

learning_curve.plot()

Andrey Ustyuzhanin 18

Page 20: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

More links

› Neural Networks: https://github.com/yandex/rep/blob/develop/howto/06-howto-neural-nets.ipynb

› add new estimator: https://github.com/yandex/rep/wiki/Contributing-new-estimator

› add new metric: https://github.com/yandex/rep/wiki/Contributing-new-metrics

› and other stuff…

Installation: http://bit.ly/rep_install_mac

Andrey Ustyuzhanin 19

Page 21: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

“Sociological” part

Cooperation + Competition = Coopetition

Andrey Ustyuzhanin 20

Page 22: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

[Old] cooperation model

Andrey Ustyuzhanin 21

Page 23: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Coopetition model

Andrey Ustyuzhanin 22

Page 24: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Education & communication

› ALEPH 2015http://yandexdataschool.github.io/aleph2015/

› Heavy Flavour Data Mining, Feb 2016, Zurichhttps://indico.cern.ch/event/433556/

› DS @ LHC 2016› Connecting the dots workshop,http://bit.ly/connecting_dots_2015

› ML HEP summer school, http://www.hse.ru/mlhep2015/› Competitions

› Kaggle› OpenML› CodaLab› …

Andrey Ustyuzhanin 23

Page 25: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

“Yak shaving” part

Any apparently useless activity which, by allowing you to over-come intermediate difficulties, allows you to solve a largerproblem. https://en.wiktionary.org/wiki/Talk:yak_shaving

› How to share data?› How to share knowledge?› How to reproduce results?› How to share the code?› How to share the environment?› What are communication practices?› How to make sure the whole analysis stays in place with everycommit?

Andrey Ustyuzhanin 24

Page 26: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Common practices

› dropbox› use github› continuous integration (travis)› wiki› ticketing› agile

Andrey Ustyuzhanin 25

Page 27: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

One more step to lower the borders

› Algorithm containerization› Cross-language experience exchange› Analysis preservation› Docker-like technology, containerization

Andrey Ustyuzhanin 26

Page 28: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Everware

› Get familiar with Docker› Try it: https://everware.rep.school.yandex.net› Examples

› https://github.com/everware/everware-dimuon-example› https://github.com/arogozhnikov/hep_ml› https://github.com/yandexdataschool/manchester-cp-asymmetry-tutorial› https://github.com/yandexdataschool/flavours-of-physics-start

› Discussion about integration is ongoing with› CERN open data portal› http://openml.org

https://github.com/everware/everware

Andrey Ustyuzhanin 27

Page 29: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Selected HEP Challenges

› LHCb Topological Triggers› LHCb Flavour Tagging› Rare decay search (𝜏 → 3𝜇)› COMET track reconstruction› Data Placement Strategy› Online Data Processing Anomaly Detection

Andrey Ustyuzhanin 28

Page 30: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

LHCb Topological Triggers, Data

Andrey Ustyuzhanin 29

› Sample: one proton-protonbunches collision (40MHz)

› Event consists of the secondaryvertices (SVR), where particlesare produced

› Features: SVR and decayproducts physicalcharacteristics reconstructed(momentum, � mass, angles,impact parameter)

Page 31: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

ML problem statement

› Output rate is fixed, thus, false positive rate (FPR) for events isfixed

› Goal is to improve efficiency for each type of signal events› We improve true positive rate (TPR) for fixed FPR for events

Andrey Ustyuzhanin 30

Page 32: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Random forest for SVRs selection

› Train random forest (RF) on SVRs› Select top-1, top-2 SVRs by �RF predictions for each signal event› Train MatrixNet on selected SVRs

Andrey Ustyuzhanin 31

Page 33: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Bonsai BDT, online processing

› Features hashing using bins before training (equivalent to featuresbinarization in MatrixNet)

› Converting decision trees to �N-dimensional lookup table› Table size is limited in RAM

Andrey Ustyuzhanin 32

Page 34: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Post-pruning

› Train MatrixNet with several thousands trees, then reduce thisamount of trees to a hundred by

› greedy selection of trees from the initial ensemble to minimize amodified loss function (exploss for BG and logloss for SIG)

› update leaves

Andrey Ustyuzhanin 33

Page 35: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Results

Andrey Ustyuzhanin 34

https://github.com/yandexdataschool/LHCb-topo-trigger

Page 36: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Other Results

› LHCb Triggers – up to 60% relative efficiency increase https://github.com/yandexdataschool/LHCb-topo-trigger

› Rare decay search (𝜏 → 3𝜇) – 6% significance increase in upperlimit estimation

› COMET track reconstruction improved ROC AUC from 88.3% to99.9% https://inclass.kaggle.com/c/comet-track-recognition-mlhep-2015

Andrey Ustyuzhanin 35

Page 37: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

ML-inspired tools for HEP (HEP_ML)

› UGBoost http://bit.ly/uBoost› GBReweighting http://bit.ly/GBReweight› https://github.com/arogozhnikov/hep_ml

Andrey Ustyuzhanin 36

Page 38: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Reweighting, motivationReweighting use-case for particle physics is to modify output ofMonte-Carlo (MC) simulation to reduce disagreement with real data(RD).let’s assign new weights to MC such that MC and RD distributionscoincide.

Andrey Ustyuzhanin 37

Page 39: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Traditional approach

multiplierbin = 𝑤target, bin

𝑤original, bin

Caveat Emptor!

› no good test for N-dimensional case

› becomes unstable as bins get small number of events

› no good reason to compare distributions per se (complexN-D objects)

Andrey Ustyuzhanin 38

Page 40: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

BDT-reweighter intuitionCan my classifier spot any difference between MC and RD?

Andrey Ustyuzhanin 39

𝜒2 = ∑bin

(𝑤bin, original − 𝑤bin, target)2

𝑤bin, original + 𝑤bin, target

Page 41: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

BDT-reweighter approach

1. build a shallow tree to maximize symmetrized 𝜒2

2. compute predictions for leaves:

leafpred = ���𝑤leaf, target

𝑤leaf, original

3. reweight distributions:

𝑤 =⎧{⎨{⎩

𝑤, if event from target (RD) distribution

𝑤 × 𝑒pred, if event from original (MC) distribution

4. Repeat

Andrey Ustyuzhanin 40

Page 42: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Result illustration

Andrey Ustyuzhanin 41

before reweighting after reweighting

Page 43: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Results

Andrey Ustyuzhanin 42

Two major problems addressed by ML: distribution comparisonand reweightingThings to remember:

› check distributions by classifier used in analysis

› check reweighter on a holdout

Page 44: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Instructions

from hep_ml.reweight import GBReweightergb = GBReweighter()gb.fit(mc_data, real_data,target_weight=real_data_sweights)

gb.predict_weights(mc_other_channel)

› https://arogozhnikov.github.io/2015/10/09/gradient-boosted-reweighter.html

Andrey Ustyuzhanin 43

Page 45: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Outro

› [Open] Data› Jupyter, REP to improve computational communications› Things to improve team coopetition

› Continuous Integration› Everware

› Communication channels› ALEPH Workshop, Dec 2015› Heavy Flavour Data Mining Workshop, Feb 2016› DS @ LHC 2016› MLHEP summer school› ML Challenges platforms

Andrey Ustyuzhanin 44

Page 46: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Make Science, Not War

Andrey Ustyuzhanin 45

Page 47: Reusing ML tools and approaches for HEP - CERN › record › 2105484 › files › LHCb-TALK-2015... · 2015-11-24 · ReusingMLtoolsand approachesforHEP AndreyUstyuzhanin1,2,3,4

Thank you!