introduction to statistics and machine learning
DESCRIPTION
Introduction to Statistics and Machine Learning. How do we: understand interpret our measurements. How do we get the data for our measurements. Classifier Training and Loss-Function. k NN,Likelihood calculate the PDF in D- and 1- dimension - PowerPoint PPT PresentationTRANSCRIPT
1
Introduction to Statisticsand Machine Learning
How do we:• understand• interpret
our measurements
How do we get the data forour measurements
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 2
Classifier Training and Loss-Function
Helge Voss
kNN,Likelihood calculate the PDF in D- and 1- dimension
Alternativ: provide a set of “basis” functions (or model): adjusted parameters optimally separating hyperplane (surface)
Loss function: penalizes prediction errors in trainingadjust parameters in such that:
squared error loss (regression) misclassification error (classification)
where: regression: the functional value of training events classification: =1 for signal, =0 (-1) background
minimize
Linear Discriminant
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 3
M
iw h (x)1 D ii=0
y( x={ x ,...,x } )=
Non parametric methods like ‘k-Nearest-Neighbour” suffer from lack of training data “curse of dimensionality”slow response time evaluate the whole training data for each classification
use of parametric models y(x) to fit to the training data
i.e. any linear function of the input variables:giving rise to linear decision boundaries
D
1 D 0 i ii=1
y( x={ x ,...,x } )=w + wx
H1
H0
x1
x2
How do we determine the “weights” w that do “best”??
Linear Discriminant:
Fisher’s Linear Discriminant
4
0 D
1 D i ii=1
y( x={ x ,...,x } )=y( x,w )=w wx
determine the “weights” w that do “best”
y
Maximise “separation” between the S and B
minimise overlap of the distribution yS and yB maximise the distance between the two
mean values of the classesminimise the variance within each class
ySyB
maximise B S
2B S2 2y y
(E(y ) - E(y ))J(w) =σ + σ
T
Tw Bw "in between" variance= =w Ww "within" variance
note: these quantities can be calculated from the training data
-1w S B∇ J( w)=0⇒w∝W( x - x ) the Fisher coefficients
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Linear Discriminant and non linear correlations
5
assume the following non-linear correlated data: the Linear discriminant obviously doesn’t do a very good job here:
Of course, these can easily be de-correlated:here: linear discriminator works
perfectly on de-correlated data
l 2 2
|
var 0 var 0 var1var 0var1 a tanvar1
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Linear Discriminant with Quadratic input:
6
A simple to “quadratic” decision boundary:
var0 * var0 var1 * var1 var0 * var1
quadratic decision boundaries in var0,var1Performance of Fisher Discriminant:
linear decision boundaries in var0,var1while: var0 var1
Performance of Fisher Discriminantwith quadratic input:
FisherFisher with decorrelated variablesFisher with quadratic input
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Linear Discriminant with Quadratic input:
7
A simple to “quadratic” decision boundary:
var0 * var0 var1 * var1 var0 * var1
quadratic decision boundaries in var0,var1Performance of Fisher Discriminant:
linear decision boundaries in var0,var1while: var0 var1
Performance of Fisher Discriminantwith quadratic input:
FisherFisher with decorrelated variablesFisher with quadratic input
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
var0 * var0 var1 * var1 var0 * var1
quadratic decision boundaries in var0,var1
Performance of Fisher Discriminant:
linear decision boundaries in var0,var1while:
Performance of Fisher Discriminantwith quadratic input:
Of course, if one “finds out”/”knows” correlations they are best treated explicitly!explicit decorrelationor e.g:
Function discriminant analysis (FDA)
Fit any user-defined function of input variables requiring that signal events return 1 and background 0
Parameter fitting: Genetics Alg., MINUIT, MC and combinations Easy reproduction of Fisher result, but can add nonlinearities Very transparent discriminator
Neural Networks
8
naturally, if we want to go to “arbitrary” non-linear decision boundaries, y(x) needs to be constructed in “any” non-linear fashion
Think of hi(x) as a set of “basis” functions If h(x) is sufficiently general (i.e. non linear), a linear
combination of “enough” basis function should allow to describe any possible discriminating function y(x)
Imagine you chose do the following:
i0 ij jj=1
y(x)= A w + w x
D
K.Weierstrass Theorem: proves just that previous statement.
Ready is the Neural NetworkNow we “only” need to find the appropriate “weights” w
M
0i i0 ij ji j=1
y(x)= w A w + w x
D 1A(x)= :
1+ethe sigmoid function
x
y(x) =a linear combination of
non linear function(s) of linear combination(s) of
the input data
M
i ii
y(x)= w h (x)
i0 ij jj=1
y(x)= w + w xD
hi(x)
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Neural Networks:Multilayer Perceptron MLP
9
But before talking about the weights, let’s try to “interpret” the formula as a Neural Network:
Nodes in hidden layer represent the “activation functions” whose arguments are linear combinations of input variables non-linear response to the input
The output is a linear combination of the output of the activation functions at the internal nodes
It is straightforward to extend this to “several” input layers
Input to the layers from preceding nodes only feed forward network (no backward loops)
input layer hidden layer ouput layer
output:
Dvar discriminating input variablesas input + 1 offset
1( ) 1 xA x e
“Activation” functione.g. sigmoid:
or tanhor …
M
0i i0 ij ji j=1
y(x)= w A w + w x
D
1
i
. . .D
1
j
M1
. . .
11w
ijw
1 jw. . .. . .
k
. . .
1jw
D+1
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Neural Networks: Multilayer Perceptron MLP
10
try to “interpret” the formula as a Neural Network:
nodesneuronslinks(weights)synapses
Neural network: try to simulate reactions of a brain to certain stimulus (input data)
input layer hidden layer ouput layer
output:
Dvar discriminating input variablesas input + 1 offset
1( ) 1 xA x e
“Activation” functione.g. sigmoid:
or tanhor …
M
0i i0 ij ji j=1
y(x)= w A w + w x
D
1
i
. . .D
1
j
M1
. . .
11w
ijw
1 jw. . .. . .
k
. . .
1jw
D+1
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Neural Network Training
11
idea: using the “training events” adjust the weights such, that y(x)0 for background events y(x)1 for signal events
how do we adjust ?minimize Loss function:
events2
ii
L(w) (y(x ) y(C)) where C
1for C =signaly = 0forC =backgr.
y(x): very “wiggly” function many local minima. one global overall fit not efficient/reliable back propagation (learn from experience, gradually adjust your resonse)online learning (learn event by event -- continious, not once in a while only)
i.e. use usual “sum of squares” ormisclassification error
true event type
predicted event type
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Neural Network Trainingback-propagation and online learning
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 12
i0 ij jj=1
= y(x) - y(C) A w + w x
0i
Lw
D∂∂
start with random weightsadjust weights in each step steepest descend of the “Loss”- function L
online learning the training events must be mixed randomlyotherwise first steer in a (wrong) direction hard to get out again!
2iL(w) (y(x ) y(C)) n 1 nw w learning rate
wL( w) =
for weights connected to output nodesM
0i i0 ij ji j=1
y(x)= w A w + w x
D
for weights not connected to output nodes… a bit more complicated formula
note: all these gradients are easily calculated from the training event
training is repeated n-times over the whole training data sample. how often ??
Watching at the Training ProgressFor MLP, plot architecture after each training epoch
13Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Overtraining
S
B
x1
x2S
B
x1
x2
training: n-times over all training data how often ?? it seems intuitive that this boundary will give better results in another
statistically independent data set than that one
e.g. stop training before you learn statistical fluctuations in the data
verify on independent “test” sample
training cycles
clas
sific
aion
err
or
training sample
test samplepossible overtraining is concern for every “tunable parameter” a of classifiers: Smoothing parameter, n-nodes… a
14Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Cross Validationclassifiers have tuning parameters “a” choose and control performance #training cycles, #nodes, #layers, regularisation parameter (neural net) smoothing parameter h (kernel density estimator) ….
more flexible (parameters) in classifier more prone to overtraining more training data better training resultsdivision of data set into “training” and “test” and “validation” sample?
Train TrainTrainTrainTest Train
Cross Validation: divide the data sample into say 5 sub-setsTrain TrainTrainTrainTest TrainTrain TrainTrainTrain TestTrain TrainTrain TestTrainTrain TrainTrainTrain TestTrain
train 5 classifiers: yi(x,a) : i=1,..5, classifier yi(x,a) is trained without the i-th sub sample
calculate the test error:events
i kkevents
1CV( ) L(y (x , )) L : loss functionN
a a
choose tuning parameter a for which CV(a) is minimum and train the final classifier using all data Too bad it is still NOT implemented in TMVA
15Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
What is the best Network Architecture?
Theoretically a single hidden layer is enough for any problem, provided one allows for sufficient number of nodes. (K.Weierstrass theorem)
“Relatively little is known concerning advantages and disadvantages of using a single hidden layer with many nodes over many hidden layers with fewer nodes. The mathematics and approximation theory of the MLP model with more than one hidden layer is not very well understood ……”….”nonetheless there seems to be reason to conjecture that the two hidden layer model may be significantly more promising than the single hidden layer model”
(Glen Cowan) A.Pinkus, “Approximation theory of the MLP model with neural networks”, Acta Numerica (1999),pp.143-195
Typically in high-energy physics, non-linearities are reasonably simple, 1 layer with a larger number of nodes probably enough still worth trying more layers (and less nodes in each layer)
16Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Support Vector Machines
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 17
If Neural Networks are complicated by finding the proper optimum “weights” for best separation power by “wiggly” functional behaviour of the piecewise defined separation hyperplane
If KNN (multidimensional likelihood) suffers disadvantage that calculating the MVA-output of each test event involves evaluation of ALL training events
If Boosted Decision Trees in theory are always weaker than a perfect Neural Network
Try to get the best of all worlds…
Support Vector MachineThere are methods to create linear decision boundaries using only measures of distances (= inner (scalar) products)
leads to quadratic optimisation problem
The decision boundary in the end is defined only by training events that are closest to the boundary
We’ve seen that variable transformations, i.e moving into a higher dimensional space (i.e. using var1*var1 in Fisher Discriminant) can allow you to separate with linear decision boundaries non linear problems
Support Vector Machine
18Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Support Vector Machines
x1
x2
margin
support vectors
Sep
arab
le d
ata
Find hyperplane that best separates signal from background
optimal hyperplane
Linear decision boundary
Best separation: maximum distance (margin) between closest events (support) to hyperplane
Non
-sep
arab
le d
ata
Solution of largest margin depends only on inner product of support vectors (distances)
quadratic minimisation problem
1
2
4
3 If data non-separable add misclassification cost parameter C·ii to minimisation function
19Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Support Vector Machines
Non-linear cases: Transform variables into higher dimensional feature space where again a linear
boundary (hyperplane) can separate the data
(x1,x2)Sep
arab
le d
ata
Non
-sep
arab
le d
ata
20Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Find hyperplane that best separates signal from background Linear decision boundary
Best separation: maximum distance (margin) between closest events (support) to hyperplane
largest margin - inner product of support vectors (distances) quadratic minimisation problem
If data non-separable add misclassification cost parameter C·ii to minimisation function
Support Vector Machines
x1
x2
x1
x3
x1
x2
Non-linear cases:
Kernel size paramter typically needs careful tuning! (Overtraining!)
Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data
Explicit transformation doesn’t need to be specified. Only need the “scalar product” (inner product) x·x Ф(x)·Ф(x).
certain Kernel Functions can be interpreted as scalar products between transformed vectors in the higher dimensional feature space. e.g.: Gaussian, Polynomial, Sigmoid
Choose Kernel and fit the hyperplane using the linear techniques developed above
(x1,x2)Sep
arab
le d
ata
Non
-sep
arab
le d
ata
21Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011
Find hyperplane that best separates signal from background Linear decision boundary
Best separation: maximum distance (margin) between closest events (support) to hyperplane
largest margin - inner product of support vectors (distances) quadratic minimisation problem
If data non-separable add misclassification cost parameter C·ii to minimisation function
Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011 22
Support Vector Machines
Helge Voss
How does this “Kernel” business work? Kernel function == scalar product in “some transformed” variable space define “distances” in this variable space
standard: large if : are in the same “direction” zero if : are orthogonal (i.e. point along different axis dimension)
Gauss kernel: zero if ponts: “far apart” in original data space large only in “vicinity” of each other
distance between training data points: each data point is “lifted” into its “own” dimension full separation of “any” event configuration with decision boundary
along coordinate axis well, that would of course be: overtraining
Support Vector Machines
23
SVM: the Kernel size parameter:example: Gaussian Kernels
Kernel size (s of the Gaussian) choosentoo large: not enough “flexibility” in theunderlying transformation
Kernel size (s of the Gaussian) choosenpropperly for the given problem
colour code: Red large signalprobability:
Helge Voss Introduction to Statistics and Machine Learning - GSI Power Week - Dec 5-9 2011