machi nelearning: lecture 6 topics - mit csailmachi nelearning: lecture 6 tom mi s.jaakkola mit...

5
Machine learning: lecture 6 Tommi S. Jaakkola MIT CSAIL [email protected] Topics Regularization prior, penalties, MAP estimation the eect of regularization, generalization regularization and discrimination Discriminative classification criterion, margin support vector machine Tommi Jaakkola, MIT CSAIL 2 MAP estimation, regularization Consider again a simple 2-d logistic regression model P (y =1|x, w)= g ( w 0 + w 1 x 1 + w 2 x 2 ) Before seeing any data we may prefer some values of the parameters over others (e.g., small over large values). Tommi Jaakkola, MIT CSAIL 3 MAP estimation, regularization Consider again a simple 2-d logistic regression model P (y =1|x, w)= g ( w 0 + w 1 x 1 + w 2 x 2 ) Before seeing any data we may prefer some values of the parameters over others (e.g., small over large values). We can express this preference through a prior distribution over the parameters (here omitting w 0 ) p(w 1 ,w 2 ; σ 2 )= 1 2πσ 2 exp - 1 2σ 2 (w 2 1 + w 2 2 ) where σ 2 determines how tightly around zero we want to constrain the values of w 1 and w 2 . Tommi Jaakkola, MIT CSAIL 4 MAP estimation, regularization Consider again a simple 2-d logistic regression model P (y =1|x, w)= g ( w 0 + w 1 x 1 + w 2 x 2 ) Before seeing any data we may prefer some values of the parameters over others (e.g., small over large values). We can express this preference through a prior distribution over the parameters (here omitting w 0 ) p(w 1 ,w 2 ; σ 2 )= 1 2πσ 2 exp - 1 2σ 2 (w 2 1 + w 2 2 ) To combine the prior with the availabale data we find the MAP (maximum a posteriori) parameter estimates: ˆ w MAP = argmax w n i=1 P (y i |x i , w) p(w 1 ,w 2 ; σ 2 ) Tommi Jaakkola, MIT CSAIL 5 MAP estimation, regularization The estimation criterion is now given by a penalized log- likelihood (cf. log-posterior): ˜ l(D; w) = n i=1 log P (y i |x i , w) + log p(w 1 ,w 2 ; σ 2 ) = n i=1 log P (y i |x i , w) - 1 2σ 2 (w 2 1 + w 2 2 )+ const. We’d like to understand how the solution changes as a function of the prior variance σ 2 (or more generally with dierent priors) Tommi Jaakkola, MIT CSAIL 6

Upload: others

Post on 25-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Machine learning: lecture 6

    Tommi S. JaakkolaMIT CSAIL

    [email protected]

    Topics• Regularization

    – prior, penalties, MAP estimation– the effect of regularization, generalization– regularization and discrimination

    • Discriminative classification– criterion, margin– support vector machine

    Tommi Jaakkola, MIT CSAIL 2

    MAP estimation, regularization• Consider again a simple 2-d logistic regression model

    P (y = 1|x,w) = g ( w0 + w1x1 + w2x2 )

    • Before seeing any data we may prefer some values of theparameters over others (e.g., small over large values).

    Tommi Jaakkola, MIT CSAIL 3

    MAP estimation, regularization• Consider again a simple 2-d logistic regression model

    P (y = 1|x,w) = g ( w0 + w1x1 + w2x2 )

    • Before seeing any data we may prefer some values of theparameters over others (e.g., small over large values).

    • We can express this preference through a prior distributionover the parameters (here omitting w0)

    p(w1, w2;σ2) =1

    2πσ2exp

    {− 1

    2σ2(w21 + w

    22)

    }where σ2 determines how tightly around zero we want toconstrain the values of w1 and w2.

    Tommi Jaakkola, MIT CSAIL 4

    MAP estimation, regularization• Consider again a simple 2-d logistic regression model

    P (y = 1|x,w) = g ( w0 + w1x1 + w2x2 )• Before seeing any data we may prefer some values of the

    parameters over others (e.g., small over large values).

    • We can express this preference through a prior distributionover the parameters (here omitting w0)

    p(w1, w2;σ2) =1

    2πσ2exp

    {− 1

    2σ2(w21 + w

    22)

    }• To combine the prior with the availabale data we find the

    MAP (maximum a posteriori) parameter estimates:

    ŵMAP = argmaxw

    [n∏

    i=1

    P (yi|xi,w)]

    p(w1, w2;σ2)

    Tommi Jaakkola, MIT CSAIL 5

    MAP estimation, regularization• The estimation criterion is now given by a penalized log-

    likelihood (cf. log-posterior):

    l̃(D;w) =n∑

    i=1

    log P (yi|xi,w) + log p(w1, w2;σ2)

    =n∑

    i=1

    log P (yi|xi,w)− 12σ2(w21 + w

    22) + const.

    • We’d like to understand how the solution changes as afunction of the prior variance σ2 (or more generally withdifferent priors)

    Tommi Jaakkola, MIT CSAIL 6

  • The effect of regularization• Let’s first understand graphically how the addition of the

    prior changes the solution

    l̃(D;w) =

    log-likelihood︷ ︸︸ ︷n∑

    i=1

    log P (yi|xi,w)log-prior︷ ︸︸ ︷

    − 12σ2

    (w21 + w22) +const.

    !15 !10 !5 0 5 10 15

    !15

    !10

    !5

    0

    5

    10

    15

    !15 !10 !5 0 5 10 15

    !15

    !10

    !5

    0

    5

    10

    15

    log-prior log-likelihood

    Tommi Jaakkola, MIT CSAIL 7

    The effect of regularization• Let’s first understand graphically how the addition of the

    prior changes the solution

    l̃(D;w) =

    log-likelihood︷ ︸︸ ︷n∑

    i=1

    log P (yi|xi,w)log-prior︷ ︸︸ ︷

    − 12σ2

    (w21 + w22) +const.

    !15 !10 !5 0 5 10 15

    !15

    !10

    !5

    0

    5

    10

    15

    !15 !10 !5 0 5 10 15

    !15

    !10

    !5

    0

    5

    10

    15

    !15 !10 !5 0 5 10 15

    !15

    !10

    !5

    0

    5

    10

    15

    log-prior log-likelihood log-posterior

    Tommi Jaakkola, MIT CSAIL 8

    The effect of regularization cont’d

    l̃(D;w) =n∑

    i=1

    log P (yi|xi,w)− 12σ2(w21 + w

    22) + const.

    σ2 = ∞

    Tommi Jaakkola, MIT CSAIL 9

    The effect of regularization cont’d

    l̃(D;w) =n∑

    i=1

    log P (yi|xi,w)− 12σ2(w21 + w

    22) + const.

    σ2 = ∞ σ2 = 10/n

    Tommi Jaakkola, MIT CSAIL 10

    The effect of regularization cont’d

    l̃(D;w) =n∑

    i=1

    log P (yi|xi,w)− 12σ2(w21 + w

    22) + const.

    σ2 = ∞ σ2 = 10/n σ2 = 1/n

    Tommi Jaakkola, MIT CSAIL 11

    The effect of regularization cont’d

    l̃(D;w) =n∑

    i=1

    log P (yi|xi,w)− 12σ2w21 + const.

    σ2 = ∞

    Tommi Jaakkola, MIT CSAIL 12

  • The effect of regularization cont’d

    l̃(D;w) =n∑

    i=1

    log P (yi|xi,w)− 12σ2w21 + const.

    σ2 = ∞ σ2 = 10/n

    Tommi Jaakkola, MIT CSAIL 13

    The effect of regularization cont’d

    l̃(D;w) =n∑

    i=1

    log P (yi|xi,w)− 12σ2w21 + const.

    σ2 = ∞ σ2 = 10/n σ2 = 1/n

    Tommi Jaakkola, MIT CSAIL 14

    The effect of regularization cont’d

    l̃(D;w) =n∑

    i=1

    log P (yi|xi,w)− 12σ2w21 + const.

    σ2 = 10/n σ2 = 1/n σ2 = 0.1/n

    Tommi Jaakkola, MIT CSAIL 15

    The effect of regularization: train/test• (Scaled) penalized log-likelihood criterion

    l̃(D;w)/n =1n

    n∑i=1

    log P (yi|xi,w)− 1n2σ2

    (w21 + w22) + const.

    Tommi Jaakkola, MIT CSAIL 16

    The effect of regularization: train/test• (Scaled) penalized log-likelihood criterion

    l̃(D;w)/n =1n

    n∑i=1

    log P (yi|xi,w)− c2(w21 + w

    22) + const.

    where c = 1/nσ2; increasing c results in strongerregularization.

    Tommi Jaakkola, MIT CSAIL 17

    The effect of regularization: train/test• (Scaled) penalized log-likelihood criterion

    l̃(D;w)/n =1n

    n∑i=1

    log P (yi|xi,w)− c2(w21 + w

    22) + const.

    where c = 1/nσ2; increasing c results in strongerregularization.

    • Resulting average log-likelihoods

    training log-lik. =1n

    n∑i=1

    log P (yi|xi, ŵMAP )

    test log-lik. = E(x,y)∼P { log P (y|x, ŵMAP ) }

    Tommi Jaakkola, MIT CSAIL 18

  • The effect of regularization: train/test

    l̃(D;w)/n =1n

    n∑i=1

    log P (yi|xi,w)− c2(w21 + w

    22) + const.

    training log-lik. =1n

    n∑i=1

    log P (yi|xi, ŵMAP )

    test log-lik. = E(x,y)∼P { log P (y|x, ŵMAP ) }

    0 0.2 0.4 0.6 0.8 1!0.8

    !0.7

    !0.6

    !0.5

    !0.4

    !0.3

    !0.2

    !0.1

    0

    c=1/(n!2)

    mean log!

    likelih

    ood

    train

    test

    Tommi Jaakkola, MIT CSAIL 19

    Likelihood, regularization, and discrimination• Regularization by penalizing ‖w1‖2 = w21 + w22 in

    l̃(D;w)/n =1n

    n∑i=1

    log P (yi|xi,w)− c2(w21 + w

    22) + const.

    does not directly limit the logistic regression model as aclassifier. For example:

    0 0.2 0.4 0.6 0.8 1!0.8

    !0.7

    !0.6

    !0.5

    !0.4

    !0.3

    !0.2

    !0.1

    0

    c=1/(n!2)

    mean log!

    likelih

    ood

    train

    test

    0 0.2 0.4 0.6 0.8 10.85

    0.9

    0.95

    1

    c=1/(n!2)

    cla

    ssific

    ation a

    ccura

    cy

    train

    test

    Tommi Jaakkola, MIT CSAIL 20

    Likelihood, regularization, and discrimination• Regularization by penalizing ‖w1‖2 = w21 + w22 in

    l̃(D;w)/n =1n

    n∑i=1

    log P (yi|xi,w)− c2(w21 + w

    22) + const.

    does not directly limit the logistic regression model as aclassifier.

    • Classification decisions only depend on the sign of thediscriminant function

    f(x;w) = w0 + xTw1 = (x− x0)Tw1where w1 = [w1, w2]T and x0 is chosen such that w0 =xT0 w1. Limiting ‖w1‖2 = w21 + w22 does not reduce thepossible signs.

    Tommi Jaakkola, MIT CSAIL 21

    Topics• Regularization

    – prior, penalties, MAP estimation– the effect of regularization, generalization– regularization and discrimination

    • Discriminative classification– criterion, margin– support vector machine

    Tommi Jaakkola, MIT CSAIL 22

    Discriminative classification• Consider again a binary classification task with y = ±1 labels

    (not 0/1 as before) and linear discriminant functions

    f(x;w) = w0 + xTw1

    parameterized by w0 and w1 = [w1, . . . , wd]T .

    • The predicted label is simply given by the sign of thediscriminant function ŷ = sign(f(x;w))

    • We are only interested in getting the labels correct; noprobabilities are associated with the predictions

    Tommi Jaakkola, MIT CSAIL 23

    Discriminative classification• When the training set {(x1, y1), . . . , (xn, yn)} is linearly

    separable we can find parameters w such that

    yi[w0 + xTi w1] > 0, i = 1, . . . , n

    i.e., the sign of the discriminant function agrees with thelabel

    !4 !2 0 2 4 6!2

    !1

    0

    1

    2

    3

    4

    5

    6

    (there are many possible solutions)

    Tommi Jaakkola, MIT CSAIL 24

  • Discriminative classification• Perhaps we can find a better discriminant boundary by

    requiring that the training examples are separated with afixed “margin”:

    yi[w0 + xTi w1]− 1 ≥ 0, i = 1, . . . , n

    !4 !2 0 2 4 6!2

    !1

    0

    1

    2

    3

    4

    5

    6

    Tommi Jaakkola, MIT CSAIL 25

    Discriminative classification• Perhaps we can find a better discriminant boundary by

    requiring that the training examples are separated with afixed “margin”:

    yi[w0 + xTi w1]− 1 ≥ 0, i = 1, . . . , n

    !4 !2 0 2 4 6!2

    !1

    0

    1

    2

    3

    4

    5

    6

    !1 0 1 2 3 40

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    The problem is the same as before. The notion of “margin”used here depends on the scale of ‖w1‖

    Tommi Jaakkola, MIT CSAIL 26

    Margin and regularization• We get a more meaningful (geometric) notion of margin by

    regularizing the problem:

    minimize12‖w1‖2 = 12

    d∑i=1

    w2i

    subject to

    yi[w0 + xTi w1]− 1 ≥ 0, i = 1, . . . , n

    • What can we say about the solution?

    Tommi Jaakkola, MIT CSAIL 27

    Margin and regularization• One dimensional example: f(x;w) = w0 + w1x

    Relevant constraints:

    1[w0 + w1x+]− 1 ≥ 0−1[w0 + w1x−]− 1 ≥ 0

    Maximum separation would beat the mid point with a margin|x+ − x−|/2.

    f(x; w) = w0 + w1x

    o ooo ox oxxxx x

    |x+ − x−|/2

    x+ x−

    Tommi Jaakkola, MIT CSAIL 28

    Margin and regularization• One dimensional example: f(x;w) = w0 + w1x

    Relevant constraints:

    1[w0 + w1x+]− 1 ≥ 0−1[w0 + w1x−]− 1 ≥ 0

    At the mid point the value ofthe margin is |x+ − x−|/2.

    f(x; w∗) = w∗0 + w∗1x

    o ooo ox oxxxx x

    |x+ − x−|/2

    x+ x−

    • We can find the maximum margin solution by minimizingthe slope |w1| while satisfying the classification constraints

    • The resulting margin is directly tied to the minimizing slope(slope = 1/margin): |w∗1| = 2/|x+ − x−|

    Tommi Jaakkola, MIT CSAIL 29

    Support vector machine• We minimize the regularization penalty

    12‖w1‖2 = 12

    d∑i=1

    w2i

    subject to the classificationconstraints

    yi[w0 + xTi w1]− 1 ≥ 0for i = 1, . . . , n.

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    xx

    x

    • Analogously to the one dimensional case, the “slope” isrelated to the geometric margin: ‖w∗1‖ = 1/margin.

    • The solution is again defined only on the basis of a subset ofexamples or “support vectors”

    Tommi Jaakkola, MIT CSAIL 30