tmva andreas höcker (cern) cern meeting, oct 6, 2006 toolkit for multivariate data analysis

TMVA

Andreas Höcker (CERN)

CERN meeting, Oct 6, 2006

Toolkit for Multivariate Data Analysis

2ROOT meeting, CERN, Oct 06, 2006 Andreas Höcker –

Many (more and more) HEP data analyses uses multivariate techniques

MVA Experience

Often analysts use custom tools, without much comparison

MVAs tedious to implement & debug, therefore few comparisons between methods !

most accepted: cuts

also widely understood and accepted: likelihood (probability density estimators – PDE )

often disliked, but more and more used (LEP, BABAR): Artificial Neural Networks

much used in BABAR and Belle: Fisher discriminants

introduced by D0 (for electron id): H-Matrix

used by MiniBooNE and recently by BABAR: Boosted Decision Trees

All interesting methods … but how to dissipate the widespread skepticism ?

black boxes !

what if the training samples incorrectly describe the data ?

how can one evaluate systematics ?


What is TMVA

Toolkit for Multivariate Analysis (TMVA): provides a ROOT-integrated environment for the parallel processing and evaluation of MVA techniques to discriminate signal from background samples. Note that TMVA does not seek to promote a particular method !

The TMVA analysis provides training, testing and evaluation of the MVAs

Each MVA method provides a ranking of the input variables

MVAs produce weight files that are read by reader class for actual MVA analysis

TMVA presently includes (ranked by complexity):

Rectangular cut optimisation Likelihood estimator (PDE approach) Multi-dimensional likelihood estimator (PDE - range-search approach) Fisher discriminant H-Matrix approach (2 estimator) Multilayer Perceptron Artificial Neural Network (three different implementations) Boosted Decision Trees RuleFit upcoming: Support Vector Machines, “Committee” methods Common preprocessing of input data: decorrelation, principal-component analysis


TMVA is a sourceforge (SF) package to accommodate world-wide access code can be downloaded as tar file, or via anonymous or developer cvs access

home page …………………………… http://tmva.sourceforge.net/

SF project page ……………………… http://sourceforge.net/projects/tmva

view CVS …………………………….. http://tmva.cvs.sourceforge.net/tmva/TMVA/

mailing lists …………………………… http://sourceforge.net/mail/?group_id=152074

ROOT::TMVA class index…………... http://root.cern.ch/root/htmldoc/TMVA_Index.html

TMVA Technicalities

TMVA is written in C++ and is built upon ROOT

Since 5.11/03 integrated and distributed within ROOT

TMVA is modular

training, testing and evaluation factory iterates over all registered methods

each method has specific options that can be conveniently set by the user

ROOT scripts are provided for most important performance assessments

We enthusiastically (!) welcome new users and developers at present: 18 SF-registered TMVA developers ! (http://sourceforge.net/project/memberlist.php?group_id=152074)


TMVA Methods


Cut Optimisation

Simplest method: cut in rectangular volume using Nvar input variables

event eventvariabl

cut, , ,min ,mas

xe

(0,1) ,v vv

ivix x x x

Usually training files in TMVA do not contain realistic signal and background abundance cannot optimize for best significance

scan in signal efficiency [0 1] and maximise background rejection

Technical problem: how to perform maximisation

Minuit fit (SIMPLEX) found to be not reliable enough

Robust, but suboptimal: random sampling New available techniques: Genetics Algorithm and Simulated Annealing (SA not yet optimal)

Huge speed improvement by sorting training events in Nvar-dim. Binary Trees

for 4 variables: 41 times faster than simple volume cut


automatic, unbiased, but suboptimal

easy to automate, can create artefacts;currently in TMVA: Splines0-5, Gaussian KDEs

difficult to automate

Combine probability density distributions to likelihood estimator

Projected Likelihood Estimator (PDE Approach)

event

event

event

variables,

signal

speci

P

variabe le

DE,

,ss

( )

( )

v vv

vS

i

iv

i

vS

p x

x

p x

discriminating variables

Species: signal, background types

Likelihood ratio for event ievent

PDFs

Assumes uncorrelated input variables

optimal MVA approach if true, since containing all the information

performance reduction if not true reason for development of other methods!

Technical problem: how to implement reference PDFs

3 ways: function fitting parametric fitting (splines, kernel est.) counting


Linear correlations can be removed by diagonalising the input variable space

Preprocessing the Input Variables: Decorrelation

Various ways to choose diagonalisation (also implemented: principal component analysis)

Determine square-root C of correlation matrix C, i.e., C = C C compute C by diagonalising C:

transformation from original (x) in de-correlated variable space (x) by: x = C 1x

Note that this “de-correlation” is only complete, if: input variables are Gaussians

correlations linear only

in practise: gain form de-correlation often rather modest – or even harmful!

T TD S SSSC C D

Commonly realised for all methods in TMVA (centrally in DataSet class)


Multidimensional Likelihood Estimator

Carli-Koblitz, NIM A501, 576 (2003)

Generalisation of 1D PDE approach to Nvar dimensions

Optimal method – in theory – since full information is used

Practical challenges: parameterisation of multi-dimensional phase space needs huge training samples

implementation of Nvar-dim. reference PDF with kernel estimates or counting

for kernel estimates: difficult to control fidelity of parameterisation

TMVA implementation following Range-Search method count number of signal and background events in “vicinity” of data event

“vicinity” defined by fixed or adaptive Nvar-dim. volume size

volumes can be rectangular or elliptical

use multi-D kernels (Gaussians, triangular, …) to weight events within volume

speed up range search by sorting training events in Binary Trees


Well-known, simple and elegant MVA method: event selection is performed in a transformed variable space with zero linear correlations, by distinguishing the mean values of the signal and background distributions

Fisher Discriminant (and H-Matrix)

Instead of equations, words:

optimal for linearly correlated Gaussians with equal RMS’ and different means

no separation if equal means and different RMS (shapes)

An axis is determined in the (correlated) hyperspace of the input variables such that, when projecting the output classes (signal and background) upon this axis, they are pushed as far as possible away from each other, while events of a same class are confined in a close vicinity. The linearity property of this method is reflected in the metric with which "far apart" and "close vicinity" are determined: the covariance matrix of the discriminant variable space.

Computation of Fisher MVA couldn’t be simpler:

event eventvariab

Fisheres

,l

, i v vv

ix x F

“Fisher coefficents”

H-Matrix estimator: correlated 2 – poor man’s variation of Fisher discriminant


Artificial Neural Network (ANN)

1( ) 1 xA x e

1

i

. . .N

1 input layer k hidden layers 1 ouput layer

1

j

M1

. . .

. . . 1

. . .Mk

2 output classes (signal and background)

Nvar discriminating input variables

11w

ijw

1jw. . .. . .

1( ) ( ) ( ) ( 1)

01

kMk k k k

j j ij ii

x w w xA

var

(0)1..i Nx

( 1)1,2kx

(“Activation” function)

with:

TMVA has three different ANN implementations – all are Multilayer Perceptrons

1. Clermont-Ferrand ANN: used for ALEPH Higgs analysis; translated from FORTRAN

2. TMultiLayerPerceptron interface: ANN implemented in ROOT

3. MLP-ANN (new): similar to TMultiLayerPerceptron but faster and more flexible

recent comparison on tau-ID in ATLAS with Stuttgart NN confirmed excellent performance!

ANNs are non-linear discriminants: Fisher = ANN without hidden layer

ANNs seem to be better adapted to realistic use cases than Fisher and LikelihoodM

ultil

ayer

Per

cept

ron


Decision Trees

ROOT-NodeNsignal

Nbkg depth: 1depth: 3

depth: 2

NodeNsignal

Nbkg

NodeNsignal

Nbkg

Leaf-NodeNsignal<Nbkg

bkg

Leaf-NodeNsignal >Nbkg

signal

1i iVar x 1i iVar x 1

i iVar x 1i iVar x

2k kVar x 2k kVar x

3j jVar x 3j jVar x

2k kVar x 2k kVar x

3j jVar x 3j jVar x

Leaf-NodeNsignal<Nbkg

bkg

Leaf-NodeNsignal >Nbkg

signal

Decision treeDecision treeDecision Trees: a sequential application of “cuts” which splits the data into nodes, and the final nodes (leaf) classifies an event as signal or background

Pros:

robust out-of-the-box method

immune against addition of weak variables

Contras:

instability: small changes in training sample can give large changes in tree structure

overtraining: requires huge input statistics

Boosting (bagging, …), pruning:

combine several decision trees into a forest, where each tree is optimised for different event weights; can also give overall weights to trees in forest

pruning: remove statistically insignificant nodes


RuleFit

Ref: Friedman-Popescu: http://www-stat.stanford.edu/~jhf/R-RuleFit.html

Model is a linear combination of rules; a rule is a sequence of cuts:

The problem to solve is:1. create the rule ensemble created from a set of decision trees2. fit the coefficients “Gradient direct regularization” (Friedman et al)

Fast, robust and good performance

rules

coefficientsscore

data

01

( ) ( )M

m mm

F x a a r x

5 2( ) ( 3) ( 2) ...

where : (...) 1 if cut satisfied, 0 otherwiseir x I x I x

I

Methods tested:

1. Multi Layer Perceptron (MLP)

2. Fisher

3. RuleFit

Test on 4k simulated SUSY + top events with 4 discriminating variables


TMVA – an example


Simple toy to illustrate the strength of the de-correlation technique

4 linearly correlated Gaussians, with equal RMS and shifted means between S and B

A Truly Academic Example

--- TMVA_Factory: correlation matrix (signal):------------------------------------------------------------------- var1 var2 var3 var4--- var1: +1.000 +0.336 +0.393 +0.447--- var2: +0.336 +1.000 +0.613 +0.668--- var3: +0.393 +0.613 +1.000 +0.907--- var4: +0.447 +0.668 +0.907 +1.000

--- TMVA_MethodFisher: ranked output (top --- variable is best ranked)------------------------------------------------------------------- Variable : Coefficient: Discr. power:------------------------------------------------------------------- var4 : +8.077 0.3888--- var3 : 3.417 0.2629--- var2 : 0.982 0.1394--- var1 : 0.812 0.0391

TMVA output :


Quality Assurance

Likelihood reference PDFs

Likelihood reference PDFsMLP convergence

MLP convergence MLP architecture

MLP architecture


TMVA output distributions for Fisher, Likelihood, BDT and MLP

The Output

Fisher, LikelihoodD, ANNs are optimal

Fisher, LikelihoodD, ANNs are optimal

Note: About A

ll Realistic

Use Cases are Much More Difficult T

han This One


Concluding Remarks

TMVA is still a young project !

TMVA is open source: we want to gather as much development help as

possible (from all areas, including real experts from statistics faculties – there

is no one so far!) to make it to a reference tool in HEP

New MVA methods are important, but the emphasis lies on the consolidation

of the currently implemented methods: Implement fully independent, phase-space-dependent MVAs

Implement use of event weights throughout all methods (missing: Cuts, Fisher, H-Matr., RuleFit)

Improve kernel estimators

Improve decorrelation techniques

Expect performance gain from combination of MVAs in Committees

Although the number of users rises, still need more real-life experience !

Need to write a manual !

tmva andreas höcker (cern) cern meeting, oct 6, 2006 toolkit for multivariate data analysis

Documents