machi nelearning: lecture 6 topics - mit csailmachi nelearning: lecture 6 tom mi s.jaakkola mit...
TRANSCRIPT
-
Machine learning: lecture 6
Tommi S. JaakkolaMIT CSAIL
Topics• Regularization
– prior, penalties, MAP estimation– the effect of regularization, generalization– regularization and discrimination
• Discriminative classification– criterion, margin– support vector machine
Tommi Jaakkola, MIT CSAIL 2
MAP estimation, regularization• Consider again a simple 2-d logistic regression model
P (y = 1|x,w) = g ( w0 + w1x1 + w2x2 )
• Before seeing any data we may prefer some values of theparameters over others (e.g., small over large values).
Tommi Jaakkola, MIT CSAIL 3
MAP estimation, regularization• Consider again a simple 2-d logistic regression model
P (y = 1|x,w) = g ( w0 + w1x1 + w2x2 )
• Before seeing any data we may prefer some values of theparameters over others (e.g., small over large values).
• We can express this preference through a prior distributionover the parameters (here omitting w0)
p(w1, w2;σ2) =1
2πσ2exp
{− 1
2σ2(w21 + w
22)
}where σ2 determines how tightly around zero we want toconstrain the values of w1 and w2.
Tommi Jaakkola, MIT CSAIL 4
MAP estimation, regularization• Consider again a simple 2-d logistic regression model
P (y = 1|x,w) = g ( w0 + w1x1 + w2x2 )• Before seeing any data we may prefer some values of the
parameters over others (e.g., small over large values).
• We can express this preference through a prior distributionover the parameters (here omitting w0)
p(w1, w2;σ2) =1
2πσ2exp
{− 1
2σ2(w21 + w
22)
}• To combine the prior with the availabale data we find the
MAP (maximum a posteriori) parameter estimates:
ŵMAP = argmaxw
[n∏
i=1
P (yi|xi,w)]
p(w1, w2;σ2)
Tommi Jaakkola, MIT CSAIL 5
MAP estimation, regularization• The estimation criterion is now given by a penalized log-
likelihood (cf. log-posterior):
l̃(D;w) =n∑
i=1
log P (yi|xi,w) + log p(w1, w2;σ2)
=n∑
i=1
log P (yi|xi,w)− 12σ2(w21 + w
22) + const.
• We’d like to understand how the solution changes as afunction of the prior variance σ2 (or more generally withdifferent priors)
Tommi Jaakkola, MIT CSAIL 6
-
The effect of regularization• Let’s first understand graphically how the addition of the
prior changes the solution
l̃(D;w) =
log-likelihood︷ ︸︸ ︷n∑
i=1
log P (yi|xi,w)log-prior︷ ︸︸ ︷
− 12σ2
(w21 + w22) +const.
!15 !10 !5 0 5 10 15
!15
!10
!5
0
5
10
15
!15 !10 !5 0 5 10 15
!15
!10
!5
0
5
10
15
log-prior log-likelihood
Tommi Jaakkola, MIT CSAIL 7
The effect of regularization• Let’s first understand graphically how the addition of the
prior changes the solution
l̃(D;w) =
log-likelihood︷ ︸︸ ︷n∑
i=1
log P (yi|xi,w)log-prior︷ ︸︸ ︷
− 12σ2
(w21 + w22) +const.
!15 !10 !5 0 5 10 15
!15
!10
!5
0
5
10
15
!15 !10 !5 0 5 10 15
!15
!10
!5
0
5
10
15
!15 !10 !5 0 5 10 15
!15
!10
!5
0
5
10
15
log-prior log-likelihood log-posterior
Tommi Jaakkola, MIT CSAIL 8
The effect of regularization cont’d
l̃(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2(w21 + w
22) + const.
σ2 = ∞
Tommi Jaakkola, MIT CSAIL 9
The effect of regularization cont’d
l̃(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2(w21 + w
22) + const.
σ2 = ∞ σ2 = 10/n
Tommi Jaakkola, MIT CSAIL 10
The effect of regularization cont’d
l̃(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2(w21 + w
22) + const.
σ2 = ∞ σ2 = 10/n σ2 = 1/n
Tommi Jaakkola, MIT CSAIL 11
The effect of regularization cont’d
l̃(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2w21 + const.
σ2 = ∞
Tommi Jaakkola, MIT CSAIL 12
-
The effect of regularization cont’d
l̃(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2w21 + const.
σ2 = ∞ σ2 = 10/n
Tommi Jaakkola, MIT CSAIL 13
The effect of regularization cont’d
l̃(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2w21 + const.
σ2 = ∞ σ2 = 10/n σ2 = 1/n
Tommi Jaakkola, MIT CSAIL 14
The effect of regularization cont’d
l̃(D;w) =n∑
i=1
log P (yi|xi,w)− 12σ2w21 + const.
σ2 = 10/n σ2 = 1/n σ2 = 0.1/n
Tommi Jaakkola, MIT CSAIL 15
The effect of regularization: train/test• (Scaled) penalized log-likelihood criterion
l̃(D;w)/n =1n
n∑i=1
log P (yi|xi,w)− 1n2σ2
(w21 + w22) + const.
Tommi Jaakkola, MIT CSAIL 16
The effect of regularization: train/test• (Scaled) penalized log-likelihood criterion
l̃(D;w)/n =1n
n∑i=1
log P (yi|xi,w)− c2(w21 + w
22) + const.
where c = 1/nσ2; increasing c results in strongerregularization.
Tommi Jaakkola, MIT CSAIL 17
The effect of regularization: train/test• (Scaled) penalized log-likelihood criterion
l̃(D;w)/n =1n
n∑i=1
log P (yi|xi,w)− c2(w21 + w
22) + const.
where c = 1/nσ2; increasing c results in strongerregularization.
• Resulting average log-likelihoods
training log-lik. =1n
n∑i=1
log P (yi|xi, ŵMAP )
test log-lik. = E(x,y)∼P { log P (y|x, ŵMAP ) }
Tommi Jaakkola, MIT CSAIL 18
-
The effect of regularization: train/test
l̃(D;w)/n =1n
n∑i=1
log P (yi|xi,w)− c2(w21 + w
22) + const.
training log-lik. =1n
n∑i=1
log P (yi|xi, ŵMAP )
test log-lik. = E(x,y)∼P { log P (y|x, ŵMAP ) }
0 0.2 0.4 0.6 0.8 1!0.8
!0.7
!0.6
!0.5
!0.4
!0.3
!0.2
!0.1
0
c=1/(n!2)
mean log!
likelih
ood
train
test
Tommi Jaakkola, MIT CSAIL 19
Likelihood, regularization, and discrimination• Regularization by penalizing ‖w1‖2 = w21 + w22 in
l̃(D;w)/n =1n
n∑i=1
log P (yi|xi,w)− c2(w21 + w
22) + const.
does not directly limit the logistic regression model as aclassifier. For example:
0 0.2 0.4 0.6 0.8 1!0.8
!0.7
!0.6
!0.5
!0.4
!0.3
!0.2
!0.1
0
c=1/(n!2)
mean log!
likelih
ood
train
test
0 0.2 0.4 0.6 0.8 10.85
0.9
0.95
1
c=1/(n!2)
cla
ssific
ation a
ccura
cy
train
test
Tommi Jaakkola, MIT CSAIL 20
Likelihood, regularization, and discrimination• Regularization by penalizing ‖w1‖2 = w21 + w22 in
l̃(D;w)/n =1n
n∑i=1
log P (yi|xi,w)− c2(w21 + w
22) + const.
does not directly limit the logistic regression model as aclassifier.
• Classification decisions only depend on the sign of thediscriminant function
f(x;w) = w0 + xTw1 = (x− x0)Tw1where w1 = [w1, w2]T and x0 is chosen such that w0 =xT0 w1. Limiting ‖w1‖2 = w21 + w22 does not reduce thepossible signs.
Tommi Jaakkola, MIT CSAIL 21
Topics• Regularization
– prior, penalties, MAP estimation– the effect of regularization, generalization– regularization and discrimination
• Discriminative classification– criterion, margin– support vector machine
Tommi Jaakkola, MIT CSAIL 22
Discriminative classification• Consider again a binary classification task with y = ±1 labels
(not 0/1 as before) and linear discriminant functions
f(x;w) = w0 + xTw1
parameterized by w0 and w1 = [w1, . . . , wd]T .
• The predicted label is simply given by the sign of thediscriminant function ŷ = sign(f(x;w))
• We are only interested in getting the labels correct; noprobabilities are associated with the predictions
Tommi Jaakkola, MIT CSAIL 23
Discriminative classification• When the training set {(x1, y1), . . . , (xn, yn)} is linearly
separable we can find parameters w such that
yi[w0 + xTi w1] > 0, i = 1, . . . , n
i.e., the sign of the discriminant function agrees with thelabel
!4 !2 0 2 4 6!2
!1
0
1
2
3
4
5
6
(there are many possible solutions)
Tommi Jaakkola, MIT CSAIL 24
-
Discriminative classification• Perhaps we can find a better discriminant boundary by
requiring that the training examples are separated with afixed “margin”:
yi[w0 + xTi w1]− 1 ≥ 0, i = 1, . . . , n
!4 !2 0 2 4 6!2
!1
0
1
2
3
4
5
6
Tommi Jaakkola, MIT CSAIL 25
Discriminative classification• Perhaps we can find a better discriminant boundary by
requiring that the training examples are separated with afixed “margin”:
yi[w0 + xTi w1]− 1 ≥ 0, i = 1, . . . , n
!4 !2 0 2 4 6!2
!1
0
1
2
3
4
5
6
!1 0 1 2 3 40
0.5
1
1.5
2
2.5
3
3.5
4
The problem is the same as before. The notion of “margin”used here depends on the scale of ‖w1‖
Tommi Jaakkola, MIT CSAIL 26
Margin and regularization• We get a more meaningful (geometric) notion of margin by
regularizing the problem:
minimize12‖w1‖2 = 12
d∑i=1
w2i
subject to
yi[w0 + xTi w1]− 1 ≥ 0, i = 1, . . . , n
• What can we say about the solution?
Tommi Jaakkola, MIT CSAIL 27
Margin and regularization• One dimensional example: f(x;w) = w0 + w1x
Relevant constraints:
1[w0 + w1x+]− 1 ≥ 0−1[w0 + w1x−]− 1 ≥ 0
Maximum separation would beat the mid point with a margin|x+ − x−|/2.
f(x; w) = w0 + w1x
o ooo ox oxxxx x
|x+ − x−|/2
x+ x−
Tommi Jaakkola, MIT CSAIL 28
Margin and regularization• One dimensional example: f(x;w) = w0 + w1x
Relevant constraints:
1[w0 + w1x+]− 1 ≥ 0−1[w0 + w1x−]− 1 ≥ 0
At the mid point the value ofthe margin is |x+ − x−|/2.
f(x; w∗) = w∗0 + w∗1x
o ooo ox oxxxx x
|x+ − x−|/2
x+ x−
• We can find the maximum margin solution by minimizingthe slope |w1| while satisfying the classification constraints
• The resulting margin is directly tied to the minimizing slope(slope = 1/margin): |w∗1| = 2/|x+ − x−|
Tommi Jaakkola, MIT CSAIL 29
Support vector machine• We minimize the regularization penalty
12‖w1‖2 = 12
d∑i=1
w2i
subject to the classificationconstraints
yi[w0 + xTi w1]− 1 ≥ 0for i = 1, . . . , n.
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
• Analogously to the one dimensional case, the “slope” isrelated to the geometric margin: ‖w∗1‖ = 1/margin.
• The solution is again defined only on the basis of a subset ofexamples or “support vectors”
Tommi Jaakkola, MIT CSAIL 30