j. wilkinson
TRANSCRIPT
J. Wilkinson
Brief intro for using ML to study Λc
J. Wilkinson
1
J. Wilkinson
● Strategy: full reconstruction of hadronic decays of charmed hadrons
→ Retains full kinematic information of original particle ● Reconstruction relies on topological + particle identification (PID) selections to reduce combinatorial
background
Reconstruction of charmed hadrons in ALICE
● Example: D0 meson: non-zero lifetime; decay vertex displaced from interaction point (primary vertex)
→ Decay length, impact parameter, pointing angle (forexample) can be used to select candidates
● PID at mid-rapidity using TOF (where available) + TPC, standard method with 'nσ' PID
→ Strong separation of pions, kaons and protons inwide momentum range
● “Standard” approach: apply series of 1D cuts to topo.and PID variables
decay length
2
J. Wilkinson 3
Λc baryon reconstruction in ALICE● Λc in hadronic decay channels: Λc → pKπ, Λc → pK0
S and semileptonic (Λc → e+νeΛ)
● Λc → pKπ: BR 6.23%; three-body decay via multiple resonant + nonresonant channels
● Λc → pK0S: BR 1.58% (x 69.20% for K0
S → π+π–). Reconstructed using displaced K0 vertex topology
● Typical selections include:
→ PID of decay daughters→ Distance of closest approach & impact parameter of
decay daughters→ Pointing angle of reconstructed momentum w.r.t. flight
line→ Decay lengths of Λc and K0
S
→ KF reconstruction
J. Wilkinson 4
Λc baryon reconstruction in ALICE● Yield extraction via invariant mass method
● For Lc: assume a Gaussian function for the signal peak and exponential for background.
● Fit background in sideband regions, extrapolate function to signal region, fit signal peak, integrate and subtract background fit to extract signal
● Statistical significance defined by counting uncertainty of signal vs. background: S = s/√(s+b)→ Improvement of signal purity (signal/background ratio) will
reduce uncertainties→ But too tight cuts may increase statistical uncertainties
J. Wilkinson 5
“Standard” cut-based selection
- Compare distributions of signal and background in each variable of interest
J. Wilkinson 6
“Standard” cut-based selection
- Compare distributions of signal and background in each variable of interest
- Apply some set of rectangular cuts, usually per pT bin
- Aim: high signal efficiency, high background rejection
- Fine for “simple” problems, but doesn’t fully use the available information
J. Wilkinson 7
Shift toward multivariate techniques● Historically, Λc a very challenging measurement: rare probe with
high level of combinatorial background
→ Required development of novel identification techniques in ALICE
● Bayesian Particle Identification [1]:
→ Probabilistic approach to combine signals from TPC and TOF in regions where species overlap; “most likely” species chosen as opposed to inclusive “nσ” cut
→ Prior probabilities for each species defined based on particle abundances in data
→ Increases purity of selected sample
● Toolkit for Multivariate Analysis [2]:
→ Machine learning method for signal classification with “Boosted Decision Trees” (BDT)
→ Trained on kinematic variables and PID response from Monte Carlo candidates
[1] Eur. Phys. J. Plus 131 (2016) no.5, 168[2] PoS ACAT 040 (2007), arXiv:physics/0703039
J. Wilkinson 8
[1] Eur. Phys. J. Plus 131 (2016) no.5, 168
Bayesian PID (“Early days” of Λc)● Standard PID based on combination of
sigma cuts from TPC and TOF
● Simple acceptance cuts on nsigma were not enough to reduce the huge background → try a new approach
● Instead of taking simply the expected resolution, combine the signals into a single probability, based on known detector response + particle abundances:
● Easy to combine probabilities (assuming detector response is independent); allows selection of “most likely” species and not just “accept everything within N”
J. Wilkinson 9
Shift toward multivariate techniques
Instead of tuning single cuts on individual variables, can exploit correlations in feature space to distinguish signal from background
→ multidimensional cut space
How to optimise this?
J. Wilkinson10
Packages used Variable tree, invariant mass fitting, systematics: handled by AliPhysics: https://github.com/alisw/AliPhysics/tree/master/PWGHF/vertexingHF/
Hipe4ML: Heavy-Ion Physics Environment for Machine Learning, https://github.com/hipe4ml/: wrapper / helper functions for scikit-learn
XGBoost: eXtreme Gradient Boosting library, https://github.com/dmlc/xgboost
GPU architecture well-suited to problems in complicated parameter space. We can optionally use NVidia CUDA: Driver interface for machine learning with NVidia GPU, https://developer.nvidia.com/cuda-zone
Necessary packages installed as Singularity container on Lustre: /lustre/alice/users/jwilkins/cuda/cuda_xgboost_env.sif
Recipe for Singularity deployment on Github: https://github.com/jezwilkinson/singularity-cudaxgboost
Sample scripts for tree filtering / ML training + application / job submission on Lustre: /lustre/alice/users/jwilkins/hipe4ml_intro
J. Wilkinson 11
Move towards machine learning● Use of TMVA (Toolkit for Multivariate Analysis) started to
become more common
● Boosted Decision Tree method using AdaBoost, implemented in ROOT
● The principle: Simplify a multivariate problem down to a single, “probability-like” parameter to cut on.
● Each candidate assigned a weight based on final node of each of a series of decision trees (summed with some weight)“Boosting”: application of multiple separate treesControlled by certain “hyperparameters”:- “maximum depth” -- the number of layers within each tree- “n_estimators”: the total number of trees in the boosted forest- learning rate (eta): The step size between iterations of the training.
● Hyperparameters can be optimised with Bayesian iteration
J. Wilkinson 12
Hyperparameter optimisation
● Bayesian method: Iterate multiple times, assigning probability score to each iteration to perform better than before, based on ROC score (or other objective function)
● Perform additional fits in high-probability region to quickly find best fit to truth
● Benefit: Faster than full grid search of hyperparameter spaceStill the slowest part of
https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f
J. Wilkinson13
Pitfall: Overfitting
J. Wilkinson14
Pitfall: Overfitting
J. Wilkinson15
Pitfall: Overfitting
“if you overtrain the model to find dogs, then it will find dogs everywhere”
J. Wilkinson16
Overfitting: What it means in data● Example on the top right of a (good) invariant mass for
yield extraction
● Overfitting: Performance of a model on random data that is better than expected→ “massaging” of a statistical fluctuation into a peakCan also mean the model does not generalise to data outside its training sample
● Signal extraction relies on a fit of sidebands for background, extrapolated into the signal region→ This function should be as smooth as possible
→ We also train our model for background in the sidebands, no background information in the mass peak region
→ → VERY important to avoid anything that may make the background shape irregular, or deviate from the shape within the signal regionRisk case: Model suppresses background only in mass peak region. This lowers “visible” peak height in a way we can’t fit and properly correct for
X
J. Wilkinson17
Overfitting: How to control?● First step: check correlation matrix of input variables.
● Shows correlation coefficients between every pair of variables
● Various information can be taken from here:
→ A square that is fully red = full correlation -- possible that two variables give the same (or very similar) info. e.g. decay length in XY + ct. Can consider to remove one from training to keep model simpler.
→ A variable pair strongly correlated for signal and strongly anticorrelated for background = likely to be a strong discriminator (NB does not necessarily mean that a pair correlated in both will be weak)
→ Crucial: Avoid, or be very careful with, anything correlated with the invariant mass (selection will likely affect the background shape)
J. Wilkinson18
Overfitting: How to control?● First step: check correlation matrix of input variables.● Shows correlation coefficients between every pair of
variables
● Various information can be taken from here:
→ A square that is fully red = full correlation -- possible that two variables give the same (or very similar) info. e.g. decay length in XY + ct. Can consider to remove one from training to keep model simpler.
→ A variable pair strongly correlated for signal and strongly anticorrelated for background = likely to be a strong discriminator (NB does not necessarily mean that a pair correlated in both will be weak)
→ Crucial: Avoid, or be very careful with, anything correlated with the invariant mass (selection will likely affect the background shape)
● Sensible pre-processing of variables: avoid artifacts in the training data, e.g. -999 being passed for TOF response→ often better to use “cooked” variables like combined nsigma, or “clean” variables like TPC response
J. Wilkinson19
Overfitting: How to control?
● Check relative weight of variables in model: SHAP value
● “Feature importance” plot from hipe4ml shows mean impact of each value
● “Permutation importance”: Model tested with input values shifted randomly. An unstable permutation importance is likely to be overfit.
2-4 GeV/c
6-8 GeV/c
J. Wilkinson20
Controlling for efficiency and purity
J. Wilkinson21
Controlling for efficiency and purity
J. Wilkinson22
Controlling for efficiency and purity
J. Wilkinson23
Controlling for efficiency and purity
A model trained to find cats and dogs:
- May identify some cats as dogs (dog purity < 1)- May not identify all dogs as dogs (dog efficiency < 1)
J. Wilkinson24
Controlling for efficiency and purity● Receiver Operating Curve - Area Under Curve
(ROC-AUC):→ Plots true positive rate (efficiency) against false positive (1 - purity) for various cut valuesMaximal area under curve = best performance of model
● Diagonal: “random chance”. Your model should never be under this line!
● The top-left corner is the golden area, we need to balance between purity and efficiency
-- a 100% pure model: likely low efficiency. Everything that you find is an Lc, but with small statistics and large corrections.-- 100% efficient: Likely to not be rejecting much background -- you get all Lc, but are swamped by bkg (as good as no cut!)
● Both efficiency and purity have impact on statistical uncertainties
J. Wilkinson25
Defining the “working point” + efficiency
● Efficiency: % of “known” candidates (from independent MC) that are correctly identfied for a given cut value
● Working point: The cut value that should be used for the analysis
● Bad practice to base this from looking at real invariant mass (other fields call this “p-hacking”) -- can amplify statistical fluctuations
● Instead: Define “pseudosignificance” and an expected number of total signal (from sim) and background (data), multiplied by respective efficiencies:
J. Wilkinson26
Defining the “working point” + efficiency
● Cross section measurement: Take fitted raw yield, correct for efficiency and detector acceptance, normalise for BR and luminosity
● Ideally: efficiency perfectly defined and cut value would be arbitrary.
● Monte Carlo description of data is good, but never perfect.
→ Variation of threshold gives systematic uncertainty
● Other sources of systematic: yield extraction (fit function, range, rebinning of inv mass histo), fraction of candidates from feed-down, ..
J. Wilkinson27
Conclusion
● After all these corrections, you come to your result!
● ML should be seen as a very powerful tool, but not a panacea
○ Many effects to control for: Overfitting, data artifacts, balance of efficiency and purity
● Used right, we can reduce uncertainties and push measurements to regions that would be impossible otherwise
● Can also start to consider more complex cases: contribution of feed-down vs prompt Lc
J. Wilkinson 28
[[Backup]]
J. Wilkinson29
Hardware/running
● All jobs run on the Virgo cluster, instructions from https://hpc.gsi.de/virgo/#gpus
● NVidia GPU: Tesla V100 (32GB VRAM, 5120 CUDA cores, 640 tensor cores) - lxbk[0717-0718]
● CUDA performance for training on XGBoost is specific to server-grade dedicated cards
● Singularity containers for each environment defined based on example docker containers from NVidia [docker://nvidia/cuda:11.0-devel-rc]
● Additional needed python packages installed in container directly
J. Wilkinson
● XGBoost/Hipe4ml framework (outlined in previous meetings) has support for multi-class baked in.
● Requirement: Label in signal trees to flag “Λc from prompt” (‘4’) vs “Λc from beauty” (‘5’), and split sample, labelling each set in Python
● Caveat: Automatic optimisation of ML hyperparameters (depth of decision trees, number of estimators used in boosting, etc) is not supported for multi-class. Must be defined by hand.
Multi-class classification in XGBoost
30
J. Wilkinson
● Standard ML case: “Binary classification” - “Is the candidate signal-like or background-like?”
○ Strong for selection of just prompt sample○ Requires manual feed-down subtraction (usually
model-dependent) to measure prompt contribution
● Possible to generalise to multiple labels and subdivide “signal-like” into “prompt-like” and “non-prompt-like”
○ Outputs a separate probability distribution for each label
○ “multi-class” = assumption of one candidate -> one label. “Multi-label” (many labels per candidate) also exists, not used here
Multi-class classification
31
J. Wilkinson
● Data sample: 30-50% centrality Pb-Pb, LHC20k3b (signal) + sidebands from 40% of LHC18qr. Using new pass3 splines
● Similar basis to binary classification: compare variables in input tree for all classes
● Clear indication some variables distinguish well between S/Bbut not prompt/nonprompt (e.g. mass of K0, PID response of proton)
● Distinction between c + b: impact parameter of proton, KF chi2 of candidate refit
Training procedure: Variable distributions
32
J. Wilkinson
● Correlations between examined variables: Prompt and non-prompt largely similar to each other. Slight differences in (anti-)correlation strength between the two
● Background shows marked difference compared to both
● expect signal/bkg separationto be stronger than prompt/feed-down separation fromthis sample
Training procedure: Correlation matrices
33
J. Wilkinson
● Separate importance rankings per candidate class, based on SHAP values (shows pull on model weight from high/low values of each variable)
● as usual, proton PID generally strong in all classes● Impact parameter of proton and pointing angle of Λc appear to
be strong selectors for non-prompt● Properties of reconstructed Ks cascade tend to be less
important in distinguishing
Training procedure: Feature importance
34
bkg
prompt
feed
J. Wilkinson
● Unlike binary classification, each candiate assigned three probabilities (prompt/nonprompt/bkg), Σ = 1
● Non-prompt candidates converge at high feed-down probability, low bkg/prompt prob (as expected)
● Definition of working point becomes a multi-dimensionalproblem (example from Ds, bottom-left): cut on both FD and BKG probabilities to max significance / reduce contamination
Training procedure: Probability distributions
35
J. Wilkinson
● “One-Vs-One” (OVO) strategy used for training: model is split into separate binary classifications for each pair of classes.(alternative “one-vs-rest”, with separation of “feed” vs “signal+prompt”, etc)
● As with standard case, ROC curve performance shown for both testing set and training set.Each class: test and train consistent with one another, model appears stable.
● Separation of feed-down from background strong; feed-down from prompt currently acceptable but could be better. Would require further optimisation of model and variables
ROC curves
36
J. Wilkinson
● Model applied on sample from tree from real data (40% of total statistics available)
● Vast majority of data sits at high background probability, very few candidates found at high feed-down prob.
● Still to be examined with better optimisation of the model.
Application output
37