introduction to statistics and machine learning

1

Introduction to Statisticsand Machine Learning

How do we:• understand• interpret

our measurements

How do we get the data forour measurements

Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 2

Classifier Training and Loss-Function

Helge Voss

kNN,Likelihood calculate the PDF in D- and 1- dimension

Alternativ: provide a set of “basis” functions (or model): adjusted parameters optimally separating hyperplane (surface)

Loss function: penalizes prediction errors in trainingadjust parameters in such that:

squared error loss (regression) misclassification error (classification)

where: regression: the functional value of training events classification: =1 for signal, =0 (-1) background

minimize

Linear Discriminant

Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 3

M

iw h (x)1 D ii=0

y( x={ x ,...,x } )=

Non parametric methods like ‘k-Nearest-Neighbour” suffer from lack of training data “curse of dimensionality”slow response time evaluate the whole training data for each classification

use of parametric models y(x) to fit to the training data

i.e. any linear function of the input variables:giving rise to linear decision boundaries

D

1 D 0 i ii=1

y( x={ x ,...,x } )=w + wx

H1

H0

x1

x2

How do we determine the “weights” w that do “best”??

Linear Discriminant:

Fisher’s Linear Discriminant

4

0 D

1 D i ii=1

y( x={ x ,...,x } )=y( x,w )=w wx

determine the “weights” w that do “best”

y

Maximise “separation” between the S and B

minimise overlap of the distribution yS and yB maximise the distance between the two

mean values of the classesminimise the variance within each class

ySyB

maximise B S

2B S2 2y y

(E(y ) - E(y ))J(w) =σ + σ

T

Tw Bw "in between" variance= =w Ww "within" variance

note: these quantities can be calculated from the training data

-1w S B∇ J( w)=0⇒w∝W( x - x ) the Fisher coefficients

Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Linear Discriminant and non linear correlations

5

assume the following non-linear correlated data: the Linear discriminant obviously doesn’t do a very good job here:

Of course, these can easily be de-correlated:here: linear discriminator works

perfectly on de-correlated data

l 2 2

|

var 0 var 0 var1var 0var1 a tanvar1


Linear Discriminant with Quadratic input:

6

A simple to “quadratic” decision boundary:

var0 * var0 var1 * var1 var0 * var1

quadratic decision boundaries in var0,var1Performance of Fisher Discriminant:

linear decision boundaries in var0,var1while: var0 var1

Performance of Fisher Discriminantwith quadratic input:

FisherFisher with decorrelated variablesFisher with quadratic input


Linear Discriminant with Quadratic input:

7

A simple to “quadratic” decision boundary:


quadratic decision boundaries in var0,var1Performance of Fisher Discriminant:

linear decision boundaries in var0,var1while: var0 var1


FisherFisher with decorrelated variablesFisher with quadratic input



quadratic decision boundaries in var0,var1

Performance of Fisher Discriminant:

linear decision boundaries in var0,var1while:


Of course, if one “finds out”/”knows” correlations they are best treated explicitly!explicit decorrelationor e.g:

Function discriminant analysis (FDA)

Fit any user-defined function of input variables requiring that signal events return 1 and background 0

Parameter fitting: Genetics Alg., MINUIT, MC and combinations Easy reproduction of Fisher result, but can add nonlinearities Very transparent discriminator

Neural Networks

8

naturally, if we want to go to “arbitrary” non-linear decision boundaries, y(x) needs to be constructed in “any” non-linear fashion

Think of hi(x) as a set of “basis” functions If h(x) is sufficiently general (i.e. non linear), a linear

combination of “enough” basis function should allow to describe any possible discriminating function y(x)

Imagine you chose do the following:

i0 ij jj=1

y(x)= A w + w x

D

K.Weierstrass Theorem: proves just that previous statement.

Ready is the Neural NetworkNow we “only” need to find the appropriate “weights” w

M

0i i0 ij ji j=1

y(x)= w A w + w x

D 1A(x)= :

1+ethe sigmoid function

x

y(x) =a linear combination of

non linear function(s) of linear combination(s) of

the input data

M

i ii

y(x)= w h (x)

i0 ij jj=1

y(x)= w + w xD

hi(x)


Neural Networks:Multilayer Perceptron MLP

9

But before talking about the weights, let’s try to “interpret” the formula as a Neural Network:

Nodes in hidden layer represent the “activation functions” whose arguments are linear combinations of input variables non-linear response to the input

The output is a linear combination of the output of the activation functions at the internal nodes

It is straightforward to extend this to “several” input layers

Input to the layers from preceding nodes only feed forward network (no backward loops)

input layer hidden layer ouput layer

output:

Dvar discriminating input variablesas input + 1 offset

1( ) 1 xA x e

“Activation” functione.g. sigmoid:

or tanhor …

M

0i i0 ij ji j=1

y(x)= w A w + w x

D

1

i

. . .D

1

j

M1

. . .

11w

ijw

1 jw. . .. . .

k

. . .

1jw

D+1


Neural Networks: Multilayer Perceptron MLP

10

try to “interpret” the formula as a Neural Network:

nodesneuronslinks(weights)synapses

Neural network: try to simulate reactions of a brain to certain stimulus (input data)

input layer hidden layer ouput layer

output:

Dvar discriminating input variablesas input + 1 offset

1( ) 1 xA x e

“Activation” functione.g. sigmoid:

or tanhor …

M

0i i0 ij ji j=1

y(x)= w A w + w x

D

1

i

. . .D

1

j

M1

. . .

11w

ijw

1 jw. . .. . .

k

. . .

1jw

D+1


Neural Network Training

11

idea: using the “training events” adjust the weights such, that y(x)0 for background events y(x)1 for signal events

how do we adjust ?minimize Loss function:

events2

ii

L(w) (y(x ) y(C)) where C

1for C =signaly = 0forC =backgr.

y(x): very “wiggly” function many local minima. one global overall fit not efficient/reliable back propagation (learn from experience, gradually adjust your resonse)online learning (learn event by event -- continious, not once in a while only)

i.e. use usual “sum of squares” ormisclassification error

true event type

predicted event type


Neural Network Trainingback-propagation and online learning


i0 ij jj=1

= y(x) - y(C) A w + w x

0i

Lw

D∂∂

start with random weightsadjust weights in each step steepest descend of the “Loss”- function L

online learning the training events must be mixed randomlyotherwise first steer in a (wrong) direction hard to get out again!

2iL(w) (y(x ) y(C)) n 1 nw w learning rate

wL( w) =

for weights connected to output nodesM

0i i0 ij ji j=1

y(x)= w A w + w x

D

for weights not connected to output nodes… a bit more complicated formula

note: all these gradients are easily calculated from the training event

training is repeated n-times over the whole training data sample. how often ??

Watching at the Training ProgressFor MLP, plot architecture after each training epoch

13Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011

Overtraining

S

B

x1

x2S

B

x1

x2

training: n-times over all training data how often ?? it seems intuitive that this boundary will give better results in another

statistically independent data set than that one

e.g. stop training before you learn statistical fluctuations in the data

verify on independent “test” sample

training cycles

clas

sific

aion

err

or

training sample

test samplepossible overtraining is concern for every “tunable parameter” a of classifiers: Smoothing parameter, n-nodes… a


Cross Validationclassifiers have tuning parameters “a” choose and control performance #training cycles, #nodes, #layers, regularisation parameter (neural net) smoothing parameter h (kernel density estimator) ….

more flexible (parameters) in classifier more prone to overtraining more training data better training resultsdivision of data set into “training” and “test” and “validation” sample?

Train TrainTrainTrainTest Train

Cross Validation: divide the data sample into say 5 sub-setsTrain TrainTrainTrainTest TrainTrain TrainTrainTrain TestTrain TrainTrain TestTrainTrain TrainTrainTrain TestTrain

train 5 classifiers: yi(x,a) : i=1,..5, classifier yi(x,a) is trained without the i-th sub sample

calculate the test error:events

i kkevents

1CV( ) L(y (x , )) L : loss functionN

a a

choose tuning parameter a for which CV(a) is minimum and train the final classifier using all data Too bad it is still NOT implemented in TMVA


What is the best Network Architecture?

Theoretically a single hidden layer is enough for any problem, provided one allows for sufficient number of nodes. (K.Weierstrass theorem)

“Relatively little is known concerning advantages and disadvantages of using a single hidden layer with many nodes over many hidden layers with fewer nodes. The mathematics and approximation theory of the MLP model with more than one hidden layer is not very well understood ……”….”nonetheless there seems to be reason to conjecture that the two hidden layer model may be significantly more promising than the single hidden layer model”

(Glen Cowan) A.Pinkus, “Approximation theory of the MLP model with neural networks”, Acta Numerica (1999),pp.143-195

Typically in high-energy physics, non-linearities are reasonably simple, 1 layer with a larger number of nodes probably enough still worth trying more layers (and less nodes in each layer)


Support Vector Machines


If Neural Networks are complicated by finding the proper optimum “weights” for best separation power by “wiggly” functional behaviour of the piecewise defined separation hyperplane

If KNN (multidimensional likelihood) suffers disadvantage that calculating the MVA-output of each test event involves evaluation of ALL training events

If Boosted Decision Trees in theory are always weaker than a perfect Neural Network

Try to get the best of all worlds…

Support Vector MachineThere are methods to create linear decision boundaries using only measures of distances (= inner (scalar) products)

leads to quadratic optimisation problem

The decision boundary in the end is defined only by training events that are closest to the boundary

We’ve seen that variable transformations, i.e moving into a higher dimensional space (i.e. using var1*var1 in Fisher Discriminant) can allow you to separate with linear decision boundaries non linear problems

Support Vector Machine



x1

x2

margin

support vectors

Sep

arab

le d

ata

Find hyperplane that best separates signal from background

optimal hyperplane

Linear decision boundary

Best separation: maximum distance (margin) between closest events (support) to hyperplane

Non

-sep

arab

le d

ata

Solution of largest margin depends only on inner product of support vectors (distances)

quadratic minimisation problem

1

2

4

3 If data non-separable add misclassification cost parameter C·ii to minimisation function



Non-linear cases: Transform variables into higher dimensional feature space where again a linear

boundary (hyperplane) can separate the data

(x1,x2)Sep

arab

le d

ata

Non

-sep

arab

le d

ata


Find hyperplane that best separates signal from background Linear decision boundary


largest margin - inner product of support vectors (distances) quadratic minimisation problem

If data non-separable add misclassification cost parameter C·ii to minimisation function


x1

x2

x1

x3

x1

x2

Non-linear cases:

Kernel size paramter typically needs careful tuning! (Overtraining!)

Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data

Explicit transformation doesn’t need to be specified. Only need the “scalar product” (inner product) x·x Ф(x)·Ф(x).

certain Kernel Functions can be interpreted as scalar products between transformed vectors in the higher dimensional feature space. e.g.: Gaussian, Polynomial, Sigmoid

Choose Kernel and fit the hyperplane using the linear techniques developed above

(x1,x2)Sep

arab

le d

ata

Non

-sep

arab

le d

ata


Find hyperplane that best separates signal from background Linear decision boundary


largest margin - inner product of support vectors (distances) quadratic minimisation problem

If data non-separable add misclassification cost parameter C·ii to minimisation function

Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 22


Helge Voss

How does this “Kernel” business work? Kernel function == scalar product in “some transformed” variable space define “distances” in this variable space

standard: large if : are in the same “direction” zero if : are orthogonal (i.e. point along different axis dimension)

Gauss kernel: zero if ponts: “far apart” in original data space large only in “vicinity” of each other

distance between training data points: each data point is “lifted” into its “own” dimension full separation of “any” event configuration with decision boundary

along coordinate axis well, that would of course be: overtraining


23

SVM: the Kernel size parameter:example: Gaussian Kernels

Kernel size (s of the Gaussian) choosentoo large: not enough “flexibility” in theunderlying transformation

Kernel size (s of the Gaussian) choosenpropperly for the given problem

colour code: Red large signalprobability:


introduction to statistics and machine learning

Documents