tmva andreas höcker (cern) cern meeting, oct 6, 2006 toolkit for multivariate data analysis
TRANSCRIPT
TMVA
Andreas Höcker (CERN)
CERN meeting, Oct 6, 2006
Toolkit for Multivariate Data Analysis
2ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
Many (more and more) HEP data analyses uses multivariate techniques
MVA Experience
Often analysts use custom tools, without much comparison
MVAs tedious to implement & debug, therefore few comparisons between methods !
most accepted: cuts
also widely understood and accepted: likelihood (probability density estimators – PDE )
often disliked, but more and more used (LEP, BABAR): Artificial Neural Networks
much used in BABAR and Belle: Fisher discriminants
introduced by D0 (for electron id): H-Matrix
used by MiniBooNE and recently by BABAR: Boosted Decision Trees
All interesting methods … but how to dissipate the widespread skepticism ?
black boxes !
what if the training samples incorrectly describe the data ?
how can one evaluate systematics ?
3ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
What is TMVA
Toolkit for Multivariate Analysis (TMVA): provides a ROOT-integrated environment for the parallel processing and evaluation of MVA techniques to discriminate signal from background samples. Note that TMVA does not seek to promote a particular method !
The TMVA analysis provides training, testing and evaluation of the MVAs
Each MVA method provides a ranking of the input variables
MVAs produce weight files that are read by reader class for actual MVA analysis
TMVA presently includes (ranked by complexity):
Rectangular cut optimisation Likelihood estimator (PDE approach) Multi-dimensional likelihood estimator (PDE - range-search approach) Fisher discriminant H-Matrix approach (2 estimator) Multilayer Perceptron Artificial Neural Network (three different implementations) Boosted Decision Trees RuleFit upcoming: Support Vector Machines, “Committee” methods Common preprocessing of input data: decorrelation, principal-component analysis
4ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
TMVA is a sourceforge (SF) package to accommodate world-wide access code can be downloaded as tar file, or via anonymous or developer cvs access
home page …………………………… http://tmva.sourceforge.net/
SF project page ……………………… http://sourceforge.net/projects/tmva
view CVS …………………………….. http://tmva.cvs.sourceforge.net/tmva/TMVA/
mailing lists …………………………… http://sourceforge.net/mail/?group_id=152074
ROOT::TMVA class index…………... http://root.cern.ch/root/htmldoc/TMVA_Index.html
TMVA Technicalities
TMVA is written in C++ and is built upon ROOT
Since 5.11/03 integrated and distributed within ROOT
TMVA is modular
training, testing and evaluation factory iterates over all registered methods
each method has specific options that can be conveniently set by the user
ROOT scripts are provided for most important performance assessments
We enthusiastically (!) welcome new users and developers at present: 18 SF-registered TMVA developers ! (http://sourceforge.net/project/memberlist.php?group_id=152074)
5ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
TMVA Methods
6ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
Cut Optimisation
Simplest method: cut in rectangular volume using Nvar input variables
event eventvariabl
cut, , ,min ,mas
xe
(0,1) ,v vv
ivix x x x
Usually training files in TMVA do not contain realistic signal and background abundance cannot optimize for best significance
scan in signal efficiency [0 1] and maximise background rejection
Technical problem: how to perform maximisation
Minuit fit (SIMPLEX) found to be not reliable enough
Robust, but suboptimal: random sampling New available techniques: Genetics Algorithm and Simulated Annealing (SA not yet optimal)
Huge speed improvement by sorting training events in Nvar-dim. Binary Trees
for 4 variables: 41 times faster than simple volume cut
7ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
automatic, unbiased, but suboptimal
easy to automate, can create artefacts;currently in TMVA: Splines0-5, Gaussian KDEs
difficult to automate
Combine probability density distributions to likelihood estimator
Projected Likelihood Estimator (PDE Approach)
event
event
event
variables,
signal
speci
P
variabe le
DE,
,ss
( )
( )
v vv
vS
i
iv
i
vS
p x
x
p x
discriminating variables
Species: signal, background types
Likelihood ratio for event ievent
PDFs
Assumes uncorrelated input variables
optimal MVA approach if true, since containing all the information
performance reduction if not true reason for development of other methods!
Technical problem: how to implement reference PDFs
3 ways: function fitting parametric fitting (splines, kernel est.) counting
8ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
Linear correlations can be removed by diagonalising the input variable space
Preprocessing the Input Variables: Decorrelation
Various ways to choose diagonalisation (also implemented: principal component analysis)
Determine square-root C of correlation matrix C, i.e., C = C C compute C by diagonalising C:
transformation from original (x) in de-correlated variable space (x) by: x = C 1x
Note that this “de-correlation” is only complete, if: input variables are Gaussians
correlations linear only
in practise: gain form de-correlation often rather modest – or even harmful!
T TD S SSSC C D
Commonly realised for all methods in TMVA (centrally in DataSet class)
9ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
Multidimensional Likelihood Estimator
Carli-Koblitz, NIM A501, 576 (2003)
Generalisation of 1D PDE approach to Nvar dimensions
Optimal method – in theory – since full information is used
Practical challenges: parameterisation of multi-dimensional phase space needs huge training samples
implementation of Nvar-dim. reference PDF with kernel estimates or counting
for kernel estimates: difficult to control fidelity of parameterisation
TMVA implementation following Range-Search method count number of signal and background events in “vicinity” of data event
“vicinity” defined by fixed or adaptive Nvar-dim. volume size
volumes can be rectangular or elliptical
use multi-D kernels (Gaussians, triangular, …) to weight events within volume
speed up range search by sorting training events in Binary Trees
10ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
Well-known, simple and elegant MVA method: event selection is performed in a transformed variable space with zero linear correlations, by distinguishing the mean values of the signal and background distributions
Fisher Discriminant (and H-Matrix)
Instead of equations, words:
optimal for linearly correlated Gaussians with equal RMS’ and different means
no separation if equal means and different RMS (shapes)
An axis is determined in the (correlated) hyperspace of the input variables such that, when projecting the output classes (signal and background) upon this axis, they are pushed as far as possible away from each other, while events of a same class are confined in a close vicinity. The linearity property of this method is reflected in the metric with which "far apart" and "close vicinity" are determined: the covariance matrix of the discriminant variable space.
Computation of Fisher MVA couldn’t be simpler:
event eventvariab
Fisheres
,l
, i v vv
ix x F
“Fisher coefficents”
H-Matrix estimator: correlated 2 – poor man’s variation of Fisher discriminant
11ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
Artificial Neural Network (ANN)
1( ) 1 xA x e
1
i
. . .N
1 input layer k hidden layers 1 ouput layer
1
j
M1
. . .
. . . 1
. . .Mk
2 output classes (signal and background)
Nvar discriminating input variables
11w
ijw
1jw. . .. . .
1( ) ( ) ( ) ( 1)
01
kMk k k k
j j ij ii
x w w xA
var
(0)1..i Nx
( 1)1,2kx
(“Activation” function)
with:
TMVA has three different ANN implementations – all are Multilayer Perceptrons
1. Clermont-Ferrand ANN: used for ALEPH Higgs analysis; translated from FORTRAN
2. TMultiLayerPerceptron interface: ANN implemented in ROOT
3. MLP-ANN (new): similar to TMultiLayerPerceptron but faster and more flexible
recent comparison on tau-ID in ATLAS with Stuttgart NN confirmed excellent performance!
ANNs are non-linear discriminants: Fisher = ANN without hidden layer
ANNs seem to be better adapted to realistic use cases than Fisher and LikelihoodM
ultil
ayer
Per
cept
ron
12ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
Decision Trees
ROOT-NodeNsignal
Nbkg depth: 1depth: 3
depth: 2
NodeNsignal
Nbkg
NodeNsignal
Nbkg
Leaf-NodeNsignal<Nbkg
bkg
Leaf-NodeNsignal >Nbkg
signal
1i iVar x 1i iVar x 1
i iVar x 1i iVar x
2k kVar x 2k kVar x
3j jVar x 3j jVar x
2k kVar x 2k kVar x
3j jVar x 3j jVar x
Leaf-NodeNsignal<Nbkg
bkg
Leaf-NodeNsignal >Nbkg
signal
Decision treeDecision treeDecision Trees: a sequential application of “cuts” which splits the data into nodes, and the final nodes (leaf) classifies an event as signal or background
Pros:
robust out-of-the-box method
immune against addition of weak variables
Contras:
instability: small changes in training sample can give large changes in tree structure
overtraining: requires huge input statistics
Boosting (bagging, …), pruning:
combine several decision trees into a forest, where each tree is optimised for different event weights; can also give overall weights to trees in forest
pruning: remove statistically insignificant nodes
13ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
RuleFit
Ref: Friedman-Popescu: http://www-stat.stanford.edu/~jhf/R-RuleFit.html
Model is a linear combination of rules; a rule is a sequence of cuts:
The problem to solve is:1. create the rule ensemble created from a set of decision trees2. fit the coefficients “Gradient direct regularization” (Friedman et al)
Fast, robust and good performance
rules
coefficientsscore
data
01
( ) ( )M
m mm
F x a a r x
5 2( ) ( 3) ( 2) ...
where : (...) 1 if cut satisfied, 0 otherwiseir x I x I x
I
Methods tested:
1. Multi Layer Perceptron (MLP)
2. Fisher
3. RuleFit
Test on 4k simulated SUSY + top events with 4 discriminating variables
14ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
TMVA – an example
15ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
Simple toy to illustrate the strength of the de-correlation technique
4 linearly correlated Gaussians, with equal RMS and shifted means between S and B
A Truly Academic Example
--- TMVA_Factory: correlation matrix (signal):------------------------------------------------------------------- var1 var2 var3 var4--- var1: +1.000 +0.336 +0.393 +0.447--- var2: +0.336 +1.000 +0.613 +0.668--- var3: +0.393 +0.613 +1.000 +0.907--- var4: +0.447 +0.668 +0.907 +1.000
--- TMVA_MethodFisher: ranked output (top --- variable is best ranked)------------------------------------------------------------------- Variable : Coefficient: Discr. power:------------------------------------------------------------------- var4 : +8.077 0.3888--- var3 : 3.417 0.2629--- var2 : 0.982 0.1394--- var1 : 0.812 0.0391
TMVA output :
16ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
Quality Assurance
Likelihood reference PDFs
Likelihood reference PDFsMLP convergence
MLP convergence MLP architecture
MLP architecture
17ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
TMVA output distributions for Fisher, Likelihood, BDT and MLP
The Output
Fisher, LikelihoodD, ANNs are optimal
Fisher, LikelihoodD, ANNs are optimal
Note: About A
ll Realistic
Use Cases are Much More Difficult T
han This One
18ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –
Concluding Remarks
TMVA is still a young project !
TMVA is open source: we want to gather as much development help as
possible (from all areas, including real experts from statistics faculties – there
is no one so far!) to make it to a reference tool in HEP
New MVA methods are important, but the emphasis lies on the consolidation
of the currently implemented methods: Implement fully independent, phase-space-dependent MVAs
Implement use of event weights throughout all methods (missing: Cuts, Fisher, H-Matr., RuleFit)
Improve kernel estimators
Improve decorrelation techniques
Expect performance gain from combination of MVAs in Committees
Although the number of users rises, still need more real-life experience !
Need to write a manual !