j. wilkinson

37
J. Wilkinson Brief intro for using ML to study Λ c J. Wilkinson 1

Upload: others

Post on 07-Dec-2021

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: J. Wilkinson

J. Wilkinson

Brief intro for using ML to study Λc

J. Wilkinson

1

Page 2: J. Wilkinson

J. Wilkinson

● Strategy: full reconstruction of hadronic decays of charmed hadrons

→ Retains full kinematic information of original particle ● Reconstruction relies on topological + particle identification (PID) selections to reduce combinatorial

background

Reconstruction of charmed hadrons in ALICE

● Example: D0 meson: non-zero lifetime; decay vertex displaced from interaction point (primary vertex)

→ Decay length, impact parameter, pointing angle (forexample) can be used to select candidates

● PID at mid-rapidity using TOF (where available) + TPC, standard method with 'nσ' PID

→ Strong separation of pions, kaons and protons inwide momentum range

● “Standard” approach: apply series of 1D cuts to topo.and PID variables

decay length

2

Page 3: J. Wilkinson

J. Wilkinson 3

Λc baryon reconstruction in ALICE● Λc in hadronic decay channels: Λc → pKπ, Λc → pK0

S and semileptonic (Λc → e+νeΛ)

● Λc → pKπ: BR 6.23%; three-body decay via multiple resonant + nonresonant channels

● Λc → pK0S: BR 1.58% (x 69.20% for K0

S → π+π–). Reconstructed using displaced K0 vertex topology

● Typical selections include:

→ PID of decay daughters→ Distance of closest approach & impact parameter of

decay daughters→ Pointing angle of reconstructed momentum w.r.t. flight

line→ Decay lengths of Λc and K0

S

→ KF reconstruction

Page 4: J. Wilkinson

J. Wilkinson 4

Λc baryon reconstruction in ALICE● Yield extraction via invariant mass method

● For Lc: assume a Gaussian function for the signal peak and exponential for background.

● Fit background in sideband regions, extrapolate function to signal region, fit signal peak, integrate and subtract background fit to extract signal

● Statistical significance defined by counting uncertainty of signal vs. background: S = s/√(s+b)→ Improvement of signal purity (signal/background ratio) will

reduce uncertainties→ But too tight cuts may increase statistical uncertainties

Page 5: J. Wilkinson

J. Wilkinson 5

“Standard” cut-based selection

- Compare distributions of signal and background in each variable of interest

Page 6: J. Wilkinson

J. Wilkinson 6

“Standard” cut-based selection

- Compare distributions of signal and background in each variable of interest

- Apply some set of rectangular cuts, usually per pT bin

- Aim: high signal efficiency, high background rejection

- Fine for “simple” problems, but doesn’t fully use the available information

Page 7: J. Wilkinson

J. Wilkinson 7

Shift toward multivariate techniques● Historically, Λc a very challenging measurement: rare probe with

high level of combinatorial background

→ Required development of novel identification techniques in ALICE

● Bayesian Particle Identification [1]:

→ Probabilistic approach to combine signals from TPC and TOF in regions where species overlap; “most likely” species chosen as opposed to inclusive “nσ” cut

→ Prior probabilities for each species defined based on particle abundances in data

→ Increases purity of selected sample

● Toolkit for Multivariate Analysis [2]:

→ Machine learning method for signal classification with “Boosted Decision Trees” (BDT)

→ Trained on kinematic variables and PID response from Monte Carlo candidates

[1] Eur. Phys. J. Plus 131 (2016) no.5, 168[2] PoS ACAT 040 (2007), arXiv:physics/0703039

Page 8: J. Wilkinson

J. Wilkinson 8

[1] Eur. Phys. J. Plus 131 (2016) no.5, 168

Bayesian PID (“Early days” of Λc)● Standard PID based on combination of

sigma cuts from TPC and TOF

● Simple acceptance cuts on nsigma were not enough to reduce the huge background → try a new approach

● Instead of taking simply the expected resolution, combine the signals into a single probability, based on known detector response + particle abundances:

● Easy to combine probabilities (assuming detector response is independent); allows selection of “most likely” species and not just “accept everything within N”

Page 9: J. Wilkinson

J. Wilkinson 9

Shift toward multivariate techniques

Instead of tuning single cuts on individual variables, can exploit correlations in feature space to distinguish signal from background

→ multidimensional cut space

How to optimise this?

Page 10: J. Wilkinson

J. Wilkinson10

Packages used Variable tree, invariant mass fitting, systematics: handled by AliPhysics: https://github.com/alisw/AliPhysics/tree/master/PWGHF/vertexingHF/

Hipe4ML: Heavy-Ion Physics Environment for Machine Learning, https://github.com/hipe4ml/: wrapper / helper functions for scikit-learn

XGBoost: eXtreme Gradient Boosting library, https://github.com/dmlc/xgboost

GPU architecture well-suited to problems in complicated parameter space. We can optionally use NVidia CUDA: Driver interface for machine learning with NVidia GPU, https://developer.nvidia.com/cuda-zone

Necessary packages installed as Singularity container on Lustre: /lustre/alice/users/jwilkins/cuda/cuda_xgboost_env.sif

Recipe for Singularity deployment on Github: https://github.com/jezwilkinson/singularity-cudaxgboost

Sample scripts for tree filtering / ML training + application / job submission on Lustre: /lustre/alice/users/jwilkins/hipe4ml_intro

Page 11: J. Wilkinson

J. Wilkinson 11

Move towards machine learning● Use of TMVA (Toolkit for Multivariate Analysis) started to

become more common

● Boosted Decision Tree method using AdaBoost, implemented in ROOT

● The principle: Simplify a multivariate problem down to a single, “probability-like” parameter to cut on.

● Each candidate assigned a weight based on final node of each of a series of decision trees (summed with some weight)“Boosting”: application of multiple separate treesControlled by certain “hyperparameters”:- “maximum depth” -- the number of layers within each tree- “n_estimators”: the total number of trees in the boosted forest- learning rate (eta): The step size between iterations of the training.

● Hyperparameters can be optimised with Bayesian iteration

Page 12: J. Wilkinson

J. Wilkinson 12

Hyperparameter optimisation

● Bayesian method: Iterate multiple times, assigning probability score to each iteration to perform better than before, based on ROC score (or other objective function)

● Perform additional fits in high-probability region to quickly find best fit to truth

● Benefit: Faster than full grid search of hyperparameter spaceStill the slowest part of

https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

Page 13: J. Wilkinson

J. Wilkinson13

Pitfall: Overfitting

Page 14: J. Wilkinson

J. Wilkinson14

Pitfall: Overfitting

Page 15: J. Wilkinson

J. Wilkinson15

Pitfall: Overfitting

“if you overtrain the model to find dogs, then it will find dogs everywhere”

Page 16: J. Wilkinson

J. Wilkinson16

Overfitting: What it means in data● Example on the top right of a (good) invariant mass for

yield extraction

● Overfitting: Performance of a model on random data that is better than expected→ “massaging” of a statistical fluctuation into a peakCan also mean the model does not generalise to data outside its training sample

● Signal extraction relies on a fit of sidebands for background, extrapolated into the signal region→ This function should be as smooth as possible

→ We also train our model for background in the sidebands, no background information in the mass peak region

→ → VERY important to avoid anything that may make the background shape irregular, or deviate from the shape within the signal regionRisk case: Model suppresses background only in mass peak region. This lowers “visible” peak height in a way we can’t fit and properly correct for

X

Page 17: J. Wilkinson

J. Wilkinson17

Overfitting: How to control?● First step: check correlation matrix of input variables.

● Shows correlation coefficients between every pair of variables

● Various information can be taken from here:

→ A square that is fully red = full correlation -- possible that two variables give the same (or very similar) info. e.g. decay length in XY + ct. Can consider to remove one from training to keep model simpler.

→ A variable pair strongly correlated for signal and strongly anticorrelated for background = likely to be a strong discriminator (NB does not necessarily mean that a pair correlated in both will be weak)

→ Crucial: Avoid, or be very careful with, anything correlated with the invariant mass (selection will likely affect the background shape)

Page 18: J. Wilkinson

J. Wilkinson18

Overfitting: How to control?● First step: check correlation matrix of input variables.● Shows correlation coefficients between every pair of

variables

● Various information can be taken from here:

→ A square that is fully red = full correlation -- possible that two variables give the same (or very similar) info. e.g. decay length in XY + ct. Can consider to remove one from training to keep model simpler.

→ A variable pair strongly correlated for signal and strongly anticorrelated for background = likely to be a strong discriminator (NB does not necessarily mean that a pair correlated in both will be weak)

→ Crucial: Avoid, or be very careful with, anything correlated with the invariant mass (selection will likely affect the background shape)

● Sensible pre-processing of variables: avoid artifacts in the training data, e.g. -999 being passed for TOF response→ often better to use “cooked” variables like combined nsigma, or “clean” variables like TPC response

Page 19: J. Wilkinson

J. Wilkinson19

Overfitting: How to control?

● Check relative weight of variables in model: SHAP value

● “Feature importance” plot from hipe4ml shows mean impact of each value

● “Permutation importance”: Model tested with input values shifted randomly. An unstable permutation importance is likely to be overfit.

2-4 GeV/c

6-8 GeV/c

Page 20: J. Wilkinson

J. Wilkinson20

Controlling for efficiency and purity

Page 21: J. Wilkinson

J. Wilkinson21

Controlling for efficiency and purity

Page 22: J. Wilkinson

J. Wilkinson22

Controlling for efficiency and purity

Page 23: J. Wilkinson

J. Wilkinson23

Controlling for efficiency and purity

A model trained to find cats and dogs:

- May identify some cats as dogs (dog purity < 1)- May not identify all dogs as dogs (dog efficiency < 1)

Page 24: J. Wilkinson

J. Wilkinson24

Controlling for efficiency and purity● Receiver Operating Curve - Area Under Curve

(ROC-AUC):→ Plots true positive rate (efficiency) against false positive (1 - purity) for various cut valuesMaximal area under curve = best performance of model

● Diagonal: “random chance”. Your model should never be under this line!

● The top-left corner is the golden area, we need to balance between purity and efficiency

-- a 100% pure model: likely low efficiency. Everything that you find is an Lc, but with small statistics and large corrections.-- 100% efficient: Likely to not be rejecting much background -- you get all Lc, but are swamped by bkg (as good as no cut!)

● Both efficiency and purity have impact on statistical uncertainties

Page 25: J. Wilkinson

J. Wilkinson25

Defining the “working point” + efficiency

● Efficiency: % of “known” candidates (from independent MC) that are correctly identfied for a given cut value

● Working point: The cut value that should be used for the analysis

● Bad practice to base this from looking at real invariant mass (other fields call this “p-hacking”) -- can amplify statistical fluctuations

● Instead: Define “pseudosignificance” and an expected number of total signal (from sim) and background (data), multiplied by respective efficiencies:

Page 26: J. Wilkinson

J. Wilkinson26

Defining the “working point” + efficiency

● Cross section measurement: Take fitted raw yield, correct for efficiency and detector acceptance, normalise for BR and luminosity

● Ideally: efficiency perfectly defined and cut value would be arbitrary.

● Monte Carlo description of data is good, but never perfect.

→ Variation of threshold gives systematic uncertainty

● Other sources of systematic: yield extraction (fit function, range, rebinning of inv mass histo), fraction of candidates from feed-down, ..

Page 27: J. Wilkinson

J. Wilkinson27

Conclusion

● After all these corrections, you come to your result!

● ML should be seen as a very powerful tool, but not a panacea

○ Many effects to control for: Overfitting, data artifacts, balance of efficiency and purity

● Used right, we can reduce uncertainties and push measurements to regions that would be impossible otherwise

● Can also start to consider more complex cases: contribution of feed-down vs prompt Lc

Page 28: J. Wilkinson

J. Wilkinson 28

[[Backup]]

Page 29: J. Wilkinson

J. Wilkinson29

Hardware/running

● All jobs run on the Virgo cluster, instructions from https://hpc.gsi.de/virgo/#gpus

● NVidia GPU: Tesla V100 (32GB VRAM, 5120 CUDA cores, 640 tensor cores) - lxbk[0717-0718]

● CUDA performance for training on XGBoost is specific to server-grade dedicated cards

● Singularity containers for each environment defined based on example docker containers from NVidia [docker://nvidia/cuda:11.0-devel-rc]

● Additional needed python packages installed in container directly

Page 30: J. Wilkinson

J. Wilkinson

● XGBoost/Hipe4ml framework (outlined in previous meetings) has support for multi-class baked in.

● Requirement: Label in signal trees to flag “Λc from prompt” (‘4’) vs “Λc from beauty” (‘5’), and split sample, labelling each set in Python

● Caveat: Automatic optimisation of ML hyperparameters (depth of decision trees, number of estimators used in boosting, etc) is not supported for multi-class. Must be defined by hand.

Multi-class classification in XGBoost

30

Page 31: J. Wilkinson

J. Wilkinson

● Standard ML case: “Binary classification” - “Is the candidate signal-like or background-like?”

○ Strong for selection of just prompt sample○ Requires manual feed-down subtraction (usually

model-dependent) to measure prompt contribution

● Possible to generalise to multiple labels and subdivide “signal-like” into “prompt-like” and “non-prompt-like”

○ Outputs a separate probability distribution for each label

○ “multi-class” = assumption of one candidate -> one label. “Multi-label” (many labels per candidate) also exists, not used here

Multi-class classification

31

Page 32: J. Wilkinson

J. Wilkinson

● Data sample: 30-50% centrality Pb-Pb, LHC20k3b (signal) + sidebands from 40% of LHC18qr. Using new pass3 splines

● Similar basis to binary classification: compare variables in input tree for all classes

● Clear indication some variables distinguish well between S/Bbut not prompt/nonprompt (e.g. mass of K0, PID response of proton)

● Distinction between c + b: impact parameter of proton, KF chi2 of candidate refit

Training procedure: Variable distributions

32

Page 33: J. Wilkinson

J. Wilkinson

● Correlations between examined variables: Prompt and non-prompt largely similar to each other. Slight differences in (anti-)correlation strength between the two

● Background shows marked difference compared to both

● expect signal/bkg separationto be stronger than prompt/feed-down separation fromthis sample

Training procedure: Correlation matrices

33

Page 34: J. Wilkinson

J. Wilkinson

● Separate importance rankings per candidate class, based on SHAP values (shows pull on model weight from high/low values of each variable)

● as usual, proton PID generally strong in all classes● Impact parameter of proton and pointing angle of Λc appear to

be strong selectors for non-prompt● Properties of reconstructed Ks cascade tend to be less

important in distinguishing

Training procedure: Feature importance

34

bkg

prompt

feed

Page 35: J. Wilkinson

J. Wilkinson

● Unlike binary classification, each candiate assigned three probabilities (prompt/nonprompt/bkg), Σ = 1

● Non-prompt candidates converge at high feed-down probability, low bkg/prompt prob (as expected)

● Definition of working point becomes a multi-dimensionalproblem (example from Ds, bottom-left): cut on both FD and BKG probabilities to max significance / reduce contamination

Training procedure: Probability distributions

35

Page 36: J. Wilkinson

J. Wilkinson

● “One-Vs-One” (OVO) strategy used for training: model is split into separate binary classifications for each pair of classes.(alternative “one-vs-rest”, with separation of “feed” vs “signal+prompt”, etc)

● As with standard case, ROC curve performance shown for both testing set and training set.Each class: test and train consistent with one another, model appears stable.

● Separation of feed-down from background strong; feed-down from prompt currently acceptable but could be better. Would require further optimisation of model and variables

ROC curves

36

Page 37: J. Wilkinson

J. Wilkinson

● Model applied on sample from tree from real data (40% of total statistics available)

● Vast majority of data sits at high background probability, very few candidates found at high feed-down prob.

● Still to be examined with better optimisation of the model.

Application output

37