knn kernels margin

Upload: felipeg1108

Post on 05-Apr-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Knn Kernels Margin

    1/37

    KNN, Kernels, Margins

    Grzegorz Chrupala and Nicolas Stroppa

    GoogleSaarland University

    META

    Stroppa and Chrupala (UdS) ML for NLP 2010 1 / 35

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    2/37

    Outline

    1 K-Nearest neighbors classifier

    2 Kernels

    3 SVM

    Stroppa and Chrupala (UdS) ML for NLP 2010 2 / 35

    http://find/
  • 7/31/2019 Knn Kernels Margin

    3/37

    Outline

    1 K-Nearest neighbors classifier

    2 Kernels

    3 SVM

    Stroppa and Chrupala (UdS) ML for NLP 2010 3 / 35

    http://find/
  • 7/31/2019 Knn Kernels Margin

    4/37

    KNN classifier

    K-Nearest neighbors idea

    When classifying a new example, find k nearest training example, andassign the majority label

    Also known as Memory-based learning Instance or exemplar based learning Similarity-based methods

    Case-based reasoning

    Stroppa and Chrupala (UdS) ML for NLP 2010 4 / 35

    http://find/
  • 7/31/2019 Knn Kernels Margin

    5/37

    Distance metrics in feature space

    Euclidean distance or L2 norm in d dimensional space:

    D(x, x) = d

    i=1

    (xi xi)2

    L1 norm (Manhattan or taxicab distance)

    L1(x, x) =

    di=1

    |xi xi|

    L or maximum norm

    L(x, x) =

    dmaxi=1

    |xi x

    i|

    In general, Lk norm:

    Lk(x, x) =

    d

    i=1|xi x

    i|k

    1/k

    Stroppa and Chrupala (UdS) ML for NLP 2010 5 / 35

    http://find/
  • 7/31/2019 Knn Kernels Margin

    6/37

    Hamming distance

    Hamming distance used to compare strings of symbolic attributes

    Equivalent to L1 norm for binary strings

    Defines the distance between two instances to be the sum ofper-feature distances

    For symbolic features the per-feature distance is 0 for an exact match

    and 1 for a mismatch.

    Hamming(x, x) =d

    i=1(xi, x

    i) (1)

    (xi, x

    i) =

    0 if xi = x

    i

    1 if xi = x

    i

    (2)

    Stroppa and Chrupala (UdS) ML for NLP 2010 6 / 35

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    7/37

    IB1 algorithm

    For a vector with a mixture of symbolic and numeric values, the abovedefinition of per feature distance is used for symbolic features, while for

    numeric ones we can use the scaled absolute difference

    (xi, x

    i) =xi x

    i

    maxi mini. (3)

    Stroppa and Chrupala (UdS) ML for NLP 2010 7 / 35

    http://find/
  • 7/31/2019 Knn Kernels Margin

    8/37

    IB1 with feature weighting

    The per-feature distance is multiplied by the weight of the feature forwhich it is computed:

    Dw(x, x

    ) =

    di=1

    wi(xi, x

    i) (4)

    where wi is the weight of the ith feature.

    Well describe two entropy-based methods and a 2-based method to

    find a good weight vector w.

    Stroppa and Chrupala (UdS) ML for NLP 2010 8 / 35

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    9/37

    Information gain

    A measure of how much knowing the value of a certain feature for anexample decreases our uncertainty about its class, i.e. difference in classentropy with and without information about the feature value.

    wi = H(Y) vVi

    P(v)H(Y|v) (5)

    where

    wi is the weight of the ith feature

    Y is the set of class labels

    Vi is the set of possible values for the ith feature

    P(v) is the probability of value v

    class entropy H(Y) =

    yY P(y)log2 P(y)

    H(Y|v) is the conditional class entropy given that feature value = vNumeric values need to be temporarily discretized for this to work

    Stroppa and Chrupala (UdS) ML for NLP 2010 9 / 35

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    10/37

    Gain ratio

    IG assigns excessive weight to features with a large number of values.To remedy this bias information gain can be normalized by the entropy ofthe feature values, which gives the gain ratio:

    wi =H(Y)

    vVi P(v)H(Y|v)H(Vi)

    (6)

    For a feature with a unique value for each instance in the training set, theentropy of the feature values in the denominator will be maximally high,

    and will thus give it a low weight.

    Stroppa and Chrupala (UdS) ML for NLP 2010 10 / 35

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    11/37

    2

    The 2 statistic for a problem with k classes and m values for feature F:

    2 =k

    i=1

    mj=1

    (Eij Oij)2

    Eij(7)

    where

    Oij is the observed number of instances with the ith class label and

    the jth value of feature F

    Eij is the expected number of such instances in case the nullhypothesis is true: Eij =

    njnin

    nij is the frequency count of instances with the ith class label and the

    jth value of feature F nj =

    ki=1 nij

    ni =m

    j=1 nij

    n =

    ki=1

    mj=0 nij

    Stroppa and Chrupala (UdS) ML for NLP 2010 11 / 35

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    12/37

    2 example

    Consider a spam detection task: your features are words

    present/absent in email messages

    Compute 2 for each word to use a weightings for a KNN classifier

    The statistic can be computed from a contingency table. Eg. thoseare (fake) counts of rock-hard in 2000 messages

    rock-hard rock-hard

    ham 4 996

    spam 100 900

    We need to sum (Eij Oij)2/Eij for the four cells in the table:

    (52 4)2

    52+

    (948 996)2

    948+

    (52 100)2

    52+

    (948 900)2

    948= 93.4761

    Stroppa and Chrupala (UdS) ML for NLP 2010 12 / 35

    http://find/
  • 7/31/2019 Knn Kernels Margin

    13/37

    Distance-weighted class voting

    So far all the instances in the neighborhood are weighted equally forcomputing the majority class

    We may want to treat the votes from very close neighbors as moreimportant than votes from more distant ones

    A variety of distance weighting schemes have been proposed toimplement this idea

    Stroppa and Chrupala (UdS) ML for NLP 2010 13 / 35

    http://find/
  • 7/31/2019 Knn Kernels Margin

    14/37

    KNN summary

    Non-parametric: makes no assumptions about the probabilitydistribution the examples come from

    Does not assume data is linearly separable

    Derives decision rule directly from training data

    Lazy learning: During learning little work is done by the algorithm: the training

    instances are simply stored in memory in some efficient manner. During prediction the test instance is compared to the training

    instances, the neighborhood is calculated, and the majority labelassigned

    No information discarded: exceptional and low frequency traininginstances are available for prediction

    Stroppa and Chrupala (UdS) ML for NLP 2010 14 / 35

    http://find/
  • 7/31/2019 Knn Kernels Margin

    15/37

    Outline

    1 K-Nearest neighbors classifier

    2 Kernels

    3 SVM

    Stroppa and Chrupala (UdS) ML for NLP 2010 15 / 35

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    16/37

    Perceptron dual formulation

    The weight vector ends up being a linear combination of training

    examples:

    w =n

    i=1

    iyx(i)

    where i = 1 if the ith was misclassified, and = 0 otherwise.

    The discriminant function then becomes:

    g(x) =

    n

    i=1iy

    (i)x

    (i)

    x + b (8)

    =n

    i=1

    iy(i)x

    (i) x + b (9)

    Stroppa and Chrupala (UdS) ML for NLP 2010 16 / 35

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    17/37

    Dual Perceptron training

    DualPerceptron(x1:N,y1:N, I):

    1: 02: b 03: for j = 1...I do4: for k = 1...N do5: if y(k)

    Ni=1 iy

    (i)x(i) x(k) + b 0 then

    6: i i + 17: b b+ y(k)

    8: return (, b)

    Stroppa and Chrupala (UdS) ML for NLP 2010 17 / 35

    http://find/
  • 7/31/2019 Knn Kernels Margin

    18/37

    Kernels

    Note that in the dual formulation there is no explicit weight vector:the training algorithm and the classification are expressed in terms ofdot products between training examples and the test example

    We can generalize such dual algorithms to use Kernel functions A kernel function can be thought of as dot product in some

    transformed feature space

    K(x, z) = (x) (z)

    where the map projects the vectors in the original feature space ontothe transformed feature space

    It can also be thought of as a similarity function in the input objectspace

    Stroppa and Chrupala (UdS) ML for NLP 2010 18 / 35

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    19/37

    K l f

  • 7/31/2019 Knn Kernels Margin

    20/37

    Kernel vs feature map

    Feature map corresponding to K for d = 2 dimensions

    x1x2 =

    x1x1x1x2

    x2x1x2x2

    (14)

    Computing feature map explicitly needs O(d2) time

    Computing K is linear O(d) in the number of dimensions

    Stroppa and Chrupala (UdS) ML for NLP 2010 20 / 35

    Wh d it tt

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    21/37

    Why does it matter

    If you think of features as binary indicators, then the quadratic kernelabove creates feature conjunctions

    E.g. in NER if x1 indicates that word is capitalized and x2 indicates

    that the previous token is a sentence boundary, with the quadratickernel we efficiently compute the feature that both conditions are thecase.

    Geometric intuition: mapping points to higher dimensional spacemakes them easier to separate with a linear boundary

    Stroppa and Chrupala (UdS) ML for NLP 2010 21 / 35

    S bilit i 2D d 3D

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    22/37

    Separability in 2D and 3D

    Figure: Two dimensional classification example, non-separable in two dimensions,becomes separable when mapped to 3 dimensions by (x1, x2) (x

    21 , 2x1x2, x

    22 )

    Stroppa and Chrupala (UdS) ML for NLP 2010 22 / 35

    O tli e

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    23/37

    Outline

    1 K-Nearest neighbors classifier

    2 Kernels

    3 SVM

    Stroppa and Chrupala (UdS) ML for NLP 2010 23 / 35

    Support Vector Machines

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    24/37

    Support Vector Machines

    Margin: a decision boundary which is as far away from the traininginstances as possible improves the chance that if the position of thedata points is slightly perturbed, the decision boundary will still becorrect.

    Results from Statistical Learning Theory confirm these intuitions:maintaining large margins leads to small generalization error (Vapnik1995)

    A perceptron algorithm finds any hyperplane which separates the

    classes: SVM finds the one that additionally has the maximum margin

    Stroppa and Chrupala (UdS) ML for NLP 2010 24 / 35

    Quadratic optimization formulation

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    25/37

    Quadratic optimization formulation

    Functional margin can be made larger just by rescaling the weights by

    a constantHence we can fix the functional margin to be 1 and minimize thenorm of the weight vector

    This is equivalent to maximizing the geometric margin

    For linearly separable training instances ((x1, y1), ..., (xn, yn)) find thehyperplane (w, b) that solves the optimization problem:

    minimizew,b1

    2

    ||w||2

    subject to yi(w xi + b) 1 i1..n(15)

    This hyperplane separates the examples with geometric margin 2/||w||

    Stroppa and Chrupala (UdS) ML for NLP 2010 25 / 35

    Support vectors

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    26/37

    Support vectors

    SVM finds a separating hyperplane with the largest margin to thenearest instance

    This has the effect that the decision boundary is fully determined by asmall subset of the training examples (the nearest ones on both sides)

    Those instances are the support vectors

    Stroppa and Chrupala (UdS) ML for NLP 2010 26 / 35

    Separating hyperplane and support vectors

    http://find/
  • 7/31/2019 Knn Kernels Margin

    27/37

    Separating hyperplane and support vectors

    4 2 0 2 4

    4

    2

    0

    2

    4

    y

    Stroppa and Chrupala (UdS) ML for NLP 2010 27 / 35

    Soft margin

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    28/37

    Soft margin

    SVM with soft margin works by relaxing the requirement that all datapoints lie outside the margin

    For each offending instance there is a slack variable i whichmeasures how much it would have move to obey the marginconstraint.

    minimizew,b1

    2||w||2 + C

    n

    i=1 isubject to yi(w xi + b) 1 i

    i1..ni > 0

    (16)

    wherei = max(0, 1 yi(w xi + b))

    The hyper-parameter C trades off minimizing the norm of the weightvector versus classifying correctly as many examples as possible.

    As the value of C tends towards infinity the soft-margin SVMapproximates the hard-margin version.

    Stroppa and Chrupala (UdS) ML for NLP 2010 28 / 35

    Dual form

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    29/37

    Dual formThe dual formulation is in terms of support vectors, where SV is the set of theirindices:

    f(x, , b) = sign

    iSV

    yi

    i (xi x) + b

    (17)

    The weights in this decision function are the .

    minimize W() =n

    i=1

    i

    1

    2

    n

    i,j=1

    yiyjij

    (xi x

    j)

    subject ton

    i=1

    yii = 0 i1..ni 0

    (18)

    The weights together with the support vectors determine (w, b):

    w =iSV

    iyixi (19)

    b = ykw xk for any k such that k = 0 (20)

    Stroppa and Chrupala (UdS) ML for NLP 2010 29 / 35

    Summary

    http://find/
  • 7/31/2019 Knn Kernels Margin

    30/37

    Summary

    With the kernel trick we can use linear models with non-linearly separable data use polynomial kernels to create implicit feature conjunctions

    Large margin methods help us choose a linear separator with goodgeneralization properties

    Stroppa and Chrupala (UdS) ML for NLP 2010 30 / 35

    Efficient averaged perceptron algorithm

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    31/37

    Efficient averaged perceptron algorithm

    Perceptron(x1:N

    ,y1:N

    , I):1: w 0 ; wa 02: b 0 ; ba 03: c 14: for i = 1...I do5: for n = 1...N do6: if y(n)(w x(n) + b) 0 then7: w w + y(n)x(n) ; b b+ y(n)

    8: wa wa + cy(n)x(n) ; ba ba + cy

    (n)

    9: c c + 110: return (wwa/c, b ba/c)

    Stroppa and Chrupala (UdS) ML for NLP 2010 31 / 35

    Problem: Average perceptron

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    32/37

    Problem: Average perceptron

    Weight averaging

    Show that the above algorithm performs weight averaging.Hints:

    In the standard perceptron algorithm, the final weight vector (andbias) is the sum of the updates at each step.

    In average perceptron, the final weight vector should be the mean ofthe sum of partial sums of updates at each step

    Stroppa and Chrupala (UdS) ML for NLP 2010 32 / 35

    Solution

    http://find/
  • 7/31/2019 Knn Kernels Margin

    33/37

    Lets formalize:

    Basic perceptron: final weights are the sum of updates at each step:

    w =n

    i=1

    f(x(i)) (21)

    Stroppa and Chrupala (UdS) ML for NLP 2010 33 / 35

    Solution

    http://find/
  • 7/31/2019 Knn Kernels Margin

    34/37

    Lets formalize:

    Basic perceptron: final weights are the sum of updates at each step:

    w =n

    i=1

    f(x(i)) (21)

    Naive weight averaging: final weights are the mean of the sum ofpartial sums:

    w =1

    n

    ni=1

    ij=1

    f(x(j)) (22)

    Stroppa and Chrupala (UdS) ML for NLP 2010 33 / 35

    Solution

    http://find/
  • 7/31/2019 Knn Kernels Margin

    35/37

    Lets formalize:

    Basic perceptron: final weights are the sum of updates at each step:

    w =n

    i=1

    f(x(i)) (21)

    Naive weight averaging: final weights are the mean of the sum ofpartial sums:

    w =1

    n

    ni=1

    ij=1

    f(x(j)) (22)

    Efficient weight averaging:

    w =n

    i=1f(x(i))

    n

    i=1if(x(i))/n (23)

    Stroppa and Chrupala (UdS) ML for NLP 2010 33 / 35

    Show that equations 22 and 23 are equivalent. Note that we can

    http://find/
  • 7/31/2019 Knn Kernels Margin

    36/37

    rewrite the sum of partial sums by multiplying the update at each stepby the factor indicating in how many of the partial sums it appears

    w = 1n

    ni=1

    ij=1

    f(x(j)) (24)

    =1

    n

    n

    i=1

    (n i)f(x(i)) (25)

    =1

    n

    n

    i=1

    nf(x(i)) if(x(i))

    (26)

    =

    1

    n

    n

    ni=1

    f(x

    (i)

    )

    ni=1

    if(x

    (i)

    )

    (27)

    =n

    i=1

    f(xi) n

    i=1

    if(x(i))/n (28)

    Stroppa and Chrupala (UdS) ML for NLP 2010 34 / 35

    Viterbi algorithm for second-order HMM

    http://find/http://goback/
  • 7/31/2019 Knn Kernels Margin

    37/37

    In the lecture we saw the formulation of the Viterbi algorithm for afirst-order Hidden Markov Model. In a second-order HMM transitionprobabilities depend on the two previous states, instead on just the singleprevious state, that is we use the following independence assumption:

    P(yi|x1, . . . , xi1, y1, . . . , yi1) = P(yi|yi2, yi1).

    For this exercise you should formulate the Viterbi algorithm for a decodinga second-order HMM.

    Stroppa and Chrupala (UdS) ML for NLP 2010 35 / 35

    http://find/