Download - Classification Logistic Regression
![Page 1: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/1.jpg)
©Kevin Jamieson 2018 1
ClassificationLogistic Regression
Machine Learning – CSE546 Kevin Jamieson University of Washington
October 16, 2016
O
µtWl
due Thursday
![Page 2: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/2.jpg)
©Kevin Jamieson 2018
THUS FAR, REGRESSION: PREDICT A CONTINUOUS VALUE GIVEN SOME INPUTS
2
![Page 3: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/3.jpg)
©Kevin Jamieson 2018
Weather prediction revisted
3
Temperature
0
![Page 4: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/4.jpg)
©Kevin Jamieson 2018
Reading Your Brain, Simple Example
AnimalPerson
Pairwise classification accuracy: 85%[Mitchell et al.]
4
![Page 5: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/5.jpg)
©Kevin Jamieson 2018
Binary Classification
■ Learn: f:X —>Y X – features Y – target classes
■ Loss function:
■ Expected loss of f:
■ Suppose you know P(Y|X) exactly, how should you classify? Bayes optimal classifier:
5
Y 2 {0, 1}
O
floaty
Ex LI S HH t't 3 Ex E LHE fait YS IX xD
IT Hx ti IPA i IX x FEEMY it x c I P Y fact I x
![Page 6: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/6.jpg)
©Kevin Jamieson 2018
Binary Classification
■ Learn: f:X —>Y X – features Y – target classes
■ Loss function:
■ Expected loss of f:
■ Suppose you know P(Y|X) exactly, how should you classify? Bayes optimal classifier:
6
Y 2 {0, 1}`(f(x), y) = 1{f(x) 6= y}
EXY [1{f(X) 6= Y }] = EX [EY |X [1{f(x) 6= Y }|X = x]]
f(x) = argmaxy
P(Y = y|X = x)
EY |X [1{f(x) 6= Y }|X = x] =X
i
P (Y = i|X = x)1{f(x) 6= i} =X
i 6=f(x)
P (Y = i|X = x)
= 1� P (Y = f(x)|X = x)
PITT Xix
![Page 7: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/7.jpg)
©Kevin Jamieson 2018
Link Functions
■ Estimating P(Y|X): Why not use standard linear regression?
■ Combining regression and probability? Need a mapping from real values to [0,1] A link function!
7
O
PI w
We need a function that maps XERD or
O
![Page 8: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/8.jpg)
©Kevin Jamieson 2018
Logistic RegressionLogistic function (or Sigmoid):
■ Learn P(Y|X) directly Assume a particular functional form for link function Sigmoid applied to a linear function of the input features:
Z
Features can be discrete or continuous! 8
0
![Page 9: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/9.jpg)
©Kevin Jamieson 2018
Understanding the sigmoid
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=0, w1=-1
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=-2, w1=-1
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=0, w1=-0.5
9
![Page 10: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/10.jpg)
©Kevin Jamieson 2018
Sigmoid for binary classes
10
P(Y = 0|w,X) =1
1 + exp(w0 +P
k wkXk)
P(Y = 1|w,X) = 1� P(Y = 0|w,X) =exp(w0 +
Pk wkXk)
1 + exp(w0 +P
k wkXk)
P(Y = 1|w,X)
P(Y = 0|w,X)= exp(w0 +
X
k
wkXk)
Ex
exp wot WTX I
logl t w wix I
![Page 11: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/11.jpg)
©Kevin Jamieson 2018
Sigmoid for binary classes
11
P(Y = 0|w,X) =1
1 + exp(w0 +P
k wkXk)
P(Y = 1|w,X) = 1� P(Y = 0|w,X) =exp(w0 +
Pk wkXk)
1 + exp(w0 +P
k wkXk)
P(Y = 1|w,X)
P(Y = 0|w,X)= exp(w0 +
X
k
wkXk)
logP(Y = 1|w,X)
P(Y = 0|w,X)= w0 +
X
k
wkXk
Linear Decision Rule!
![Page 12: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/12.jpg)
©Kevin Jamieson 2018
Logistic Regression – a Linear classifier
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
12
Wo W3cO
i
i
![Page 13: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/13.jpg)
©Kevin Jamieson 2018
Loss function: Conditional Likelihood
■ Have a bunch of iid data of the form:
13
P (Y = 1|x,w) = exp(wTx)
1 + exp(wTx)
P (Y = �1|x,w) = 1
1 + exp(wTx)
{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}
P (Y = y|x,w) = 1
1 + exp(�y wTx)
■ This is equivalent to:
■ So we can compute the maximum likelihood estimator:
bwMLE = argmaxw
nY
i=1
P (yi|xi, w)
I H u
![Page 14: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/14.jpg)
©Kevin Jamieson 2018
Loss function: Conditional Likelihood
■ Have a bunch of iid data of the form:
14
{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}
P (Y = y|x,w) = 1
1 + exp(�y wTx)bwMLE = argmaxw
nY
i=1
P (yi|xi, w)
= argminw
nX
i=1
log(1 + exp(�yi xTi w))
![Page 15: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/15.jpg)
©Kevin Jamieson 2018
Loss function: Conditional Likelihood
■ Have a bunch of iid data of the form:
15
{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}
P (Y = y|x,w) = 1
1 + exp(�y wTx)bwMLE = argmaxw
nY
i=1
P (yi|xi, w)
= argminw
nX
i=1
log(1 + exp(�yi xTi w))
Logistic Loss: `i(w) = log(1 + exp(�yi xTi w))
Squared error Loss: `i(w) = (yi � xTi w)
2 (MLE for Gaussian noise)
e
![Page 16: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/16.jpg)
©Kevin Jamieson 2018
Loss function: Conditional Likelihood
■ Have a bunch of iid data of the form:
16
{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}
P (Y = y|x,w) = 1
1 + exp(�y wTx)bwMLE = argmaxw
nY
i=1
P (yi|xi, w)
= argminw
nX
i=1
log(1 + exp(�yi xTi w))= J(w)
What does J(w) look like? Is it convex?
I8 yixiw
Gzd oy Itexpc z
Ii S
f is convex if f beta Dg E fix cCi d fly
![Page 17: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/17.jpg)
©Kevin Jamieson 2018
Loss function: Conditional Likelihood
■ Have a bunch of iid data of the form:
17
{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}
P (Y = y|x,w) = 1
1 + exp(�y wTx)bwMLE = argmaxw
nY
i=1
P (yi|xi, w)
= argminw
nX
i=1
log(1 + exp(�yi xTi w))= J(w)
Good news: J(w) is convex function of w, no local optima problems
Bad news: no closed-form solution to maximize J(w)
Good news: convex functions easy to optimize
![Page 18: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/18.jpg)
©Kevin Jamieson 2018 18
Linear Separability
= argminw
nX
i=1
log(1 + exp(�yi xTi w)) When is this loss small?
![Page 19: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/19.jpg)
©Kevin Jamieson 2018 19
Large parameters → Overfitting
■ If data is linearly separable, weights go to infinity
In general, leads to overfitting: ■ Penalizing high weights can prevent overfitting…
o O O
![Page 20: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/20.jpg)
©Kevin Jamieson 2018
Regularized Conditional Log Likelihood
■ Add regularization penalty, e.g., L2:
20
argminw,b
nX
i=1
log�1 + exp(�yi (x
Ti w + b))
�+ �||w||22
Be sure to not regularize the o↵set b!
![Page 21: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/21.jpg)
©Kevin Jamieson 2018 21
Gradient Descent
Machine Learning – CSE546 Kevin Jamieson University of Washington
October 16, 2016
![Page 22: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/22.jpg)
©Kevin Jamieson 2018
Machine Learning Problems
22
■ Have a bunch of iid data of the form:{(xi, yi)}ni=1 xi 2 Rd yi 2 R
nX
i=1
`i(w)■ Learning a model’s parameters:
Each `i(w) is convex.
![Page 23: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/23.jpg)
©Kevin Jamieson 2018
Machine Learning Problems
23
■ Have a bunch of iid data of the form:{(xi, yi)}ni=1 xi 2 Rd yi 2 R
nX
i=1
`i(w)■ Learning a model’s parameters:
Each `i(w) is convex.
x
y
f convex:
f(y) � f(x) +rf(x)T (y � x) 8x, y
x
f(y) � f(x) +rf(x)T (y � x) + `2 ||y � x||22 8x, y
r2f(x) � `I 8x
f `-strongly convex:
f (�x+ (1� �)y) �f(x) + (1� �)f(y) 8x, y,� 2 [0, 1]
g is a subgradient at x if f(y) � f(x) + gT (y � x)
g is a subgradient at x if f(y) � f(x) + gT (y � x)or D
![Page 24: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/24.jpg)
©Kevin Jamieson 2018
Machine Learning Problems
24
■ Have a bunch of iid data of the form:{(xi, yi)}ni=1
Logistic Loss: `i(w) = log(1 + exp(�yi xTi w))
Squared error Loss: `i(w) = (yi � xTi w)
2
xi 2 Rd yi 2 R
■ Learning a model’s parameters: nX
i=1
`i(w)Each `i(w) is convex. 0
![Page 25: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/25.jpg)
©Kevin Jamieson 2018
Least squares
25
■ Have a bunch of iid data of the form:{(xi, yi)}ni=1
Squared error Loss: `i(w) = (yi � xTi w)
2
xi 2 Rd yi 2 R
■ Learning a model’s parameters: nX
i=1
`i(w)Each `i(w) is convex.
12 ||Xw � y||22How does software solve:
I Cxtx w XTy
Find Ax b
![Page 26: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/26.jpg)
©Kevin Jamieson 2018
Least squares
26
■ Have a bunch of iid data of the form:{(xi, yi)}ni=1
Squared error Loss: `i(w) = (yi � xTi w)
2
xi 2 Rd yi 2 R
■ Learning a model’s parameters: nX
i=1
`i(w)Each `i(w) is convex.
How does software solve:
…its complicated: Do you need high precision?Is X column/row sparse?Is bwLS sparse?Is XTX “well-conditioned”?Can XTX fit in cache/memory?
(LAPACK, BLAS, MKL…)
12 ||Xw � y||22
0
![Page 27: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/27.jpg)
©Kevin Jamieson 2018
Taylor Series Approximation
27
■ Taylor series in one dimension:
f(x+ �) = f(x) + f 0(x)� + 12f
00(x)�2 + . . .
■ Gradient descent:Initialize 36 0 randomlyfly comet
y Ie VtY
G in 9fees t
y
![Page 28: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/28.jpg)
©Kevin Jamieson 2018
Taylor Series Approximation
28
■ Taylor series in d dimensions:f(x+ v) = f(x) +rf(x)T v + 1
2vTr2f(x)v + . . .
■ Gradient descent:
Key Xe Joffre
![Page 29: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/29.jpg)
©Kevin Jamieson 2018
Gradient Descent
29
wt+1 = wt � ⌘rf(wt)
rf(w) =
f(w) = 12 ||Xw � y||22
XT Xu y xtxw XT
Wen we 2 XT Xue yEn
We Z XTX we 2 XTyI Z Xix wt t ZXTy
Wat Wg I Z Xix wt w 2XtXw Xy
![Page 30: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/30.jpg)
Z XTxw 2 xTy ZXT yw tyZ P f f w
O
![Page 31: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/31.jpg)
©Kevin Jamieson 2018
Gradient Descent
30
wt+1 = wt � ⌘rf(wt)
Example: X =
10�3 00 1
�y =
10�3
1
�w⇤ =
(wt+1 � w⇤) = (I � ⌘XTX)(wt � w⇤)
= (I � ⌘XTX)t+1(w0 � w⇤)
w0 =
00
�
f(w) = 12 ||Xw � y||22
Ol 3
xtx of 9 D diagonal Dk hthpog.gg
wee w two w
abs value L
wet z w l c Z Wo z Wee z
![Page 32: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/32.jpg)
lzc
a
![Page 33: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/33.jpg)
©Kevin Jamieson 2018
Taylor Series Approximation
31
■ Taylor series in one dimension:
f(x+ �) = f(x) + f 0(x)� + 12f
00(x)�2 + . . .
■ Newton’s method:
µ
if f'Cx y 2 f Cdc O
i
y x
![Page 34: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/34.jpg)
©Kevin Jamieson 2018
Taylor Series Approximation
32
■ Taylor series in d dimensions:f(x+ v) = f(x) +rf(x)T v + 1
2vTr2f(x)v + . . .
■ Newton’s method:
Xf Xt t ZVe
Ve I Hajj Offx
![Page 35: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/35.jpg)
©Kevin Jamieson 2018
Newton’s Method
33
rf(w) =
r2f(w) =
wt+1 = wt + ⌘vt
vt is solution to : r2f(wt)vt = �rf(wt)
f(w) = 12 ||Xw � y||22
![Page 36: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/36.jpg)
©Kevin Jamieson 2018
Newton’s Method
34
rf(w) =
r2f(w) =
wt+1 = wt + ⌘vt
vt is solution to : r2f(wt)vt = �rf(wt)
f(w) = 12 ||Xw � y||22
w1 = w0 � ⌘(XTX)�1XT (Xw0 � y) = w⇤
For quadratics, Newton’s method converges in one step! (Not a surprise, why?)
XT (Xw � y)
XTX
![Page 37: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/37.jpg)
©Kevin Jamieson 2018
General case
35
In general for Newton’s method to achieve f(wt)� f(w⇤) ✏:
So why are ML problems overwhelmingly solved by gradient methods?
vt is solution to : r2f(wt)vt = �rf(wt)Hint:
![Page 38: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/38.jpg)
©Kevin Jamieson 2018
General Convex case
36
f(wt)� f(w⇤) ✏
Newton’s method:
t ⇡ log(log(1/✏))
Gradient descent: • f is smooth and strongly convex: :
• f is smooth:
• f is potentially non-differentiable:
r2f(w) � bI
aI � r2f(w) � bI
||rf(w)||2 c
Other: BFGS, Heavy-ball, BCD, SVRG, ADAM, Adagrad,…Nocedal +Wright, Bubeck
Clean convergenice proofs: Bubeck
![Page 39: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/39.jpg)
©Kevin Jamieson 2018 37
Revisiting…Logistic Regression
Machine Learning – CSE546 Kevin Jamieson University of Washington
October 16, 2016
![Page 40: Classification Logistic Regression](https://reader030.vdocuments.net/reader030/viewer/2022032915/6242bb6ae4eb9e1fa9077aef/html5/thumbnails/40.jpg)
©Kevin Jamieson 2018
Loss function: Conditional Likelihood
■ Have a bunch of iid data of the form:
38
{(xi, yi)}ni=1 xi 2 Rd, yi 2 {�1, 1}
P (Y = y|x,w) = 1
1 + exp(�y wTx)bwMLE = argmaxw
nY
i=1
P (yi|xi, w)
= argminw
nX
i=1
log(1 + exp(�yi xTi w))f(w)
rf(w) =