logistic regression - university of washington · logistic regression logistic function (or...

59
1 Logistic Regression Machine Learning – CSEP546 Carlos Guestrin University of Washington January 27, 2014 ©Carlos Guestrin 2005-2014

Upload: others

Post on 30-Jul-2020

42 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

1

Logistic Regression

Machine Learning – CSEP546 Carlos Guestrin University of Washington

January 27, 2014 ©Carlos Guestrin 2005-2014

Page 2: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Reading Your Brain, Simple Example

Animal Person

Pairwise classification accuracy: 85% [Mitchell et al.]

2 ©Carlos Guestrin 2005-2014

Page 3: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Classification

n  Learn: h:X ! Y ¨ X – features ¨ Y – target classes

n  Simplest case: Thresholding

©Carlos Guestrin 2005-2014 3

Page 4: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

©Carlos Guestrin 2005-2014 4

Linear (Hyperplane) Decision Boundaries

Page 5: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Classification

n  Learn: h:X ! Y ¨  X – features ¨  Y – target classes

n  Thus far: just a decision boundary

n  What if you want probability of each class? P(Y|X)

©Carlos Guestrin 2005-2014 5

Page 6: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Ad Placement Strategies

n  Companies bid on ad prices

n  Which ad wins? (many simplifications here) ¨  Naively:

¨  But:

¨  Instead:

©Carlos Guestrin 2005-2014 6

Page 7: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Link Functions

n  Estimating P(Y|X): Why not use standard linear regression?

n  Combing regression and probability?

¨ Need a mapping from real values to [0,1] ¨ A link function!

©Carlos Guestrin 2005-2014 7

Page 8: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Logistic Regression Logistic function (or Sigmoid):

n  Learn P(Y|X) directly ¨  Assume a particular functional form for link

function ¨  Sigmoid applied to a linear function of the input

features:

Z

Features can be discrete or continuous! 8 ©Carlos Guestrin 2005-2014

Page 9: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Understanding the sigmoid

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=-2, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1=-0.5

9 ©Carlos Guestrin 2005-2014

Page 10: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Logistic Regression – a Linear classifier

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10 ©Carlos Guestrin 2005-2014

Page 11: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Very convenient!

implies

11 ©Carlos Guestrin 2005-2014

0

1

implies

0

1

implies

linear classification

rule!

0

1

Page 12: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Loss function: Conditional Likelihood

n  Have a bunch of iid data of the form:

n  Discriminative (logistic regression) loss function: Conditional Data Likelihood

12 ©Carlos Guestrin 2005-2014

Page 13: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Expressing Conditional Log Likelihood

13 ©Carlos Guestrin 2005-2014

`(w) =X

j

yj lnP (Y = 1|xj ,w) + (1� yj) lnP (Y = 0|xj ,w)

Page 14: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Maximizing Conditional Log Likelihood

Good news: l(w) is concave function of w, no local optima problems

Bad news: no closed-form solution to maximize l(w)

Good news: concave functions easy to optimize

14 ©Carlos Guestrin 2005-2014

Page 15: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Optimizing concave function – Gradient ascent

n  Conditional likelihood for Logistic Regression is concave. Find optimum with gradient ascent

n  Gradient ascent is simplest of optimization approaches ¨  e.g., Conjugate gradient ascent can be much better

Gradient:

Step size, η>0

Update rule:

15 ©Carlos Guestrin 2005-2014

Page 16: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Coordinate Descent v. Gradient Descent

©Carlos Guestrin 2005-2014 16

Illustration from Wikipedia

Page 17: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Maximize Conditional Log Likelihood: Gradient ascent

17 ©Carlos Guestrin 2005-2014

@`(w)

@wi=

NX

j=1

x

ji (y

j–P (Y = 1|xj,w)

Page 18: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Gradient Descent for LR: Intuition

1.  Encode data as numbers

2.  Until convergence: for each feature

a.  Compute average gradient over data points

b.  Update parameter

©Carlos Guestrin 2005-2014 18

Gender Age Location Income Referrer New or Returning

Clicked?

F

Young US High Google New N

M Middle US Low Direct New N

F Old BR Low Google Returning Y

M Young BR Low Bing Returning N

Gender (F=1, M=0)

Age (Young=0, Middle=1, Old=2)

Location (US=1, Abroad=0)

Income (High=1, Low=0)

Referrer New or Returning (New=1,, Returning =0)

Clicked? (Click=1, NoClick=0)

Page 19: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Gradient Ascent for LR

Gradient ascent algorithm: iterate until change < ε

For i=1,…,k,

repeat

19 ©Carlos Guestrin 2005-2014

(t)

(t)

Page 20: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Regularization in linear regression

n  Overfitting usually leads to very large parameter choices, e.g.:

n  Regularized least-squares (a.k.a. ridge regression), for λ>0:

-2.2 + 3.1 X – 0.30 X2 -1.1 + 4,700,910.7 X – 8,585,638.4 X2 + …

20 ©Carlos Guestrin 2005-2014

Page 21: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

©Carlos Guestrin 2005-2014 21

Linear Separability

Page 22: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

22

Large parameters → Overfitting

n  If data is linearly separable, weights go to infinity

¨  In general, leads to overfitting: n  Penalizing high weights can prevent overfitting…

©Carlos Guestrin 2005-2014

Page 23: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Regularized Conditional Log Likelihood

n  Add regularization penalty, e.g., L2:

n  Practical note about w0:

n  Gradient of regularized likelihood:

©Carlos Guestrin 2005-2014 23

`(w) = lnNY

j=1

P (yj |xj ,w)� �

2||w||22

Page 24: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

24

Standard v. Regularized Updates

n  Maximum conditional likelihood estimate

n  Regularized maximum conditional likelihood estimate

©Carlos Guestrin 2005-2014

(t)

(t)

w

⇤= argmax

wln

NY

j=1

P (yj |xj ,w)� �

2

kX

i=1

w2i

w

⇤= argmax

wln

NY

j=1

P (yj |xj ,w)

Page 25: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Please Stop!! Stopping criterion

n  When do we stop doing gradient descent?

n  Because l(w) is strongly concave: ¨  i.e., because of some technical condition

n  Thus, stop when:

©Carlos Guestrin 2005-2014 25

`(w) = lnY

j

P (yj |xj ,w))� �||w||22

`(w⇤)� `(w) 1

2�||r`(w)||22

Page 26: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Digression: Logistic regression for more than 2 classes

n  Logistic regression in more general case (C classes), where Y in {0,…,C-1}

26 ©Carlos Guestrin 2005-2014

Page 27: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Digression: Logistic regression more generally

n  Logistic regression in more general case, where Y in {0,…,C-1}

for c>0

for c=0 (normalization, so no weights for this class)

Learning procedure is basically the same as what we derived!

27 ©Carlos Guestrin 2005-2014

P (Y = c|x,w) =

exp(wc0 +Pk

i=1 wcixi)

1 +

PC�1c0=1 exp(wc00 +

Pki=1 wc0ixi)

P (Y = 0|x,w) =

1

1 +

PC�1c0=1 exp(wc00 +

Pki=1 wc0ixi)

Page 28: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

28

Stochastic Gradient Descent

Machine Learning – CSEP546 Carlos Guestrin University of Washington

January 27, 2014 ©Carlos Guestrin 2005-2014

Page 29: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

The Cost, The Cost!!! Think about the cost…

n  What’s the cost of a gradient update step for LR???

©Carlos Guestrin 2005-2014 29

(t)

Page 30: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Learning Problems as Expectations

n  Minimizing loss in training data: ¨  Given dataset:

n  Sampled iid from some distribution p(x) on features:

¨  Loss function, e.g., squared error, logistic loss,… ¨  We often minimize loss in training data:

n  However, we should really minimize expected loss on all data:

n  So, we are approximating the integral by the average on the training data ©Carlos Guestrin 2005-2014 30

`(w) = Ex

[`(w,x)] =

Zp(x)`(w,x)dx

`D(w) =1

N

NX

j=1

`(w,xj)

Page 31: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

SGD: Stochastic Gradient Ascent (or Descent)

n  “True” gradient: n  Sample based approximation:

n  What if we estimate gradient with just one sample??? ¨  Unbiased estimate of gradient ¨  Very noisy! ¨  Called stochastic gradient ascent (or descent)

n  Among many other names ¨  VERY useful in practice!!!

©Carlos Guestrin 2005-2014 31

r`(w) = Ex

[r`(w,x)]

Page 32: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Stochastic Gradient Ascent for Logistic Regression

n  Logistic loss as a stochastic function:

n  Batch gradient ascent updates:

n  Stochastic gradient ascent updates: ¨  Online setting:

©Carlos Guestrin 2005-2014 32

Ex

[`(w,x)] = Ex

⇥lnP (y|x,w)� �||w||22

w

(t+1)i w

(t)i + ⌘

8<

:��w(t)i +

1

N

NX

j=1

x

(j)i [y(j) � P (Y = 1|x(j)

,w

(t))]

9=

;

w

(t+1)i w

(t)i + ⌘t

n

��w(t)i + x

(t)i [y(t) � P (Y = 1|x(t)

,w

(t))]o

Page 33: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Stochastic Gradient Descent for LR: Intuition

1.  Until convergence: get a data point

a.  Encode data as numbers

b.  For each feature i.  Compute gradient for this data

point ii.  Update parameter

©Carlos Guestrin 2005-2014 33

Gender Age Location Income Referrer New or Returning

Clicked?

F

Young US High Google New N

M Middle US Low Direct New N

F Old BR Low Google Returning Y

M Young BR Low Bing Returning N

Gender (F=1, M=0)

Age (Young=0, Middle=1, Old=2)

Location (US=1, Abroad=0)

Income (High=1, Low=0)

Referrer New or Returning (New=1,, Returning =0)

Clicked? (Click=1, NoClick=0)

w

(t+1)i w

(t)i + ⌘t

n

��w(t)i + x

(t)i [y(t) � P (Y = 1|x(t)

,w

(t))]o

Page 34: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Stochastic Gradient Ascent: general case

n  Given a stochastic function of parameters: ¨  Want to find maximum

n  Start from w(0) n  Repeat until convergence:

¨  Get a sample data point xt ¨  Update parameters:

n  Works on the online learning setting! n  Complexity of each gradient step is constant in number of examples! n  In general, step size changes with iterations

©Carlos Guestrin 2005-2014 34

Page 35: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

What you should know…

n  Classification: predict discrete classes rather than real values

n  Logistic regression model: Linear model ¨ Logistic function maps real values to [0,1]

n  Optimize conditional likelihood n  Gradient computation n  Overfitting n  Regularization n  Regularized optimization n  Cost of gradient step is high, use stochastic

gradient descent

35 ©Carlos Guestrin 2005-2014

Page 36: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

36

What’s the Perceptron Optimizing?

Machine Learning – CSEP546 Carlos Guestrin University of Washington

January 27, 2014 ©Carlos Guestrin 2005-2014

Page 37: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Remember our friend the Perceptron Algorithm

n  At each time step: ¨ Observe a data point:

¨ Update parameters if make a mistake:

©Carlos Guestrin 2005-2014 37

Page 38: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

What is the Perceptron Doing???

n  When we discussed logistic regression: ¨ Started from maximizing conditional log-likelihood

n  When we discussed the Perceptron: ¨ Started from description of an algorithm

n  What is the Perceptron optimizing????

©Carlos Guestrin 2005-2014 38

Page 39: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

©Carlos Guestrin 2005-2014 39

Perceptron Prediction: Margin of Confidence

Page 40: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Hinge Loss

n  Perceptron prediction:

n  Makes a mistake when:

n  Hinge loss (same as maximizing the margin used by SVMs)

©Carlos Guestrin 2005-2014 40

Page 41: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Stochastic Gradient Descent for Hinge Loss

n  SGD: observe data point x(t), update each parameter

n  How do we compute the gradient for hinge loss?

©Carlos Guestrin 2005-2014 41

w

(t+1)i w

(t)i � ⌘t

@`(w(t), x

(t))

@wi

Page 42: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

(Sub)gradient of Hinge

n  Hinge loss:

n  Subgradient of hinge loss: ¨  If y(t) (w.x(t)) > 0: ¨  If y(t) (w.x(t)) < 0: ¨  If y(t) (w.x(t)) = 0: ¨  In one line:

©Carlos Guestrin 2005-2014 42

w

(t+1)i w

(t)i � ⌘t

@`(w(t), x

(t))

@wi

Page 43: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Stochastic Gradient Descent for Hinge Loss

n  SGD: observe data point x(t), update each parameter

n  How do we compute the gradient for hinge loss?

©Carlos Guestrin 2005-2014 43

w

(t+1)i w

(t)i � ⌘t

@`(w(t), x

(t))

@wi

Page 44: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Perceptron Revisited n  SGD for hinge loss:

n  Perceptron update:

n  Difference?

©Carlos Guestrin 2005-2014 44

w

(t+1) w

(t) +hy(t)(w(t) · x(t)) 0

iy(t)x(t)

w

(t+1) w

(t) + ⌘thy(t)(w(t) · x(t)) 0

iy(t)x(t)

Page 45: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

What you need to know n  Perceptron is optimizing hinge loss n  Subgradients and hinge loss n  (Sub)gradient decent for hinge objective

©Carlos Guestrin 2005-2014 45

Page 46: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

46

Support Vector Machines

Machine Learning – CSEP546 Carlos Guestrin University of Washington

January 27, 2014 ©Carlos Guestrin 2005-2014

Page 47: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Support Vector Machines

n  One of the most effective classifiers to date! n  Popularized kernels

n  There is a complicated derivation, but… n  Very simple based on what you’ve learned

thus far!

©Carlos Guestrin 2005-2014 47

Page 48: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

©Carlos Guestrin 2005-2014 48

Linear classifiers – Which line is better?

Page 49: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

©Carlos Guestrin 2005-2014 49

Pick the one with the largest margin!

w.x

+ w

0 =

0

“confidence” = yj(w · xj+ w0)

Page 50: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

©Carlos Guestrin 2005-2014 50

Maximize the margin

w.x

+ w

0 =

0

Page 51: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

SVMs = Hinge Loss + L2 Regularization

n  Maximizing Margin same as regularized hinge loss

n  But, SVM “convention” is confidence has to be at least 1…

©Carlos Guestrin 2005-2014 51

w.x

+ w

0 =

+1

w.x

+ w

0 =

-1

w.x

+ w

0 =

0

Page 52: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

L2 Regularized Hinge Loss n  Final objective, adding regularization:

n  But, again, in SVMs, convention slightly different (but equivalent)

©Carlos Guestrin 2005-2014 52

||w||222

+ CNX

j=1

�1� yj(w · xj + w0)

�+

Page 53: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

SVMs for Non-Linearly Separable meet my friend the Perceptron…

n  Perceptron was minimizing the hinge loss:

n  SVMs minimizes the regularized hinge loss!!

©Carlos Guestrin 2005-2014 53

NX

j=1

��yj(w · xj + w0)

�+

||w||22 + CNX

j=1

�1� yj(w · xj + w0)

�+

Page 54: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

Stochastic Gradient Descent for SVMs

n  Perceptron minimization:

n  SGD for Perceptron:

n  SVMs minimization:

n  SGD for SVMs:

©Carlos Guestrin 2005-2014 54

NX

j=1

��yj(w · xj + w0)

�+

||w||22 + CNX

j=1

�1� yj(w · xj + w0)

�+

w

(t+1) w

(t) +hy(t)(w(t) · x(t)) 0

iy(t)x(t)

Page 55: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

©Carlos Guestrin 2005-2014 55

What you need to know

n  Maximizing margin n  Derivation of SVM formulation n  Non-linearly separable case

¨ Hinge loss ¨ A.K.A. adding slack variables

n  SVMs = Perceptron + L2 regularization n  Can also use kernels with SVMs n  Can optimize SVMs with SGD

¨ Many other approaches possible

Page 56: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

56

Boosting

Machine Learning – CSEP546 Carlos Guestrin University of Washington

January 27, 2014 ©Carlos Guestrin 2005-2014

Page 57: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

57

Fighting the bias-variance tradeoff

n  Simple (a.k.a. weak) learners are good ¨ e.g., naïve Bayes, logistic regression, decision stumps

(or shallow decision trees) ¨ Low variance, don’t usually overfit too badly

n  Simple (a.k.a. weak) learners are bad ¨ High bias, can’t solve hard learning problems

n  Can we make weak learners always good??? ¨ No!!! ¨ But often yes…

©Carlos Guestrin 2005-2014

Page 58: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

The Simplest Weak Learner: Thresholding, a.k.a. Decision Stumps

n  Learn: h:X ! Y ¨ X – features ¨ Y – target classes

n  Simplest case: Thresholding

©Carlos Guestrin 2005-2014 58

Page 59: Logistic Regression - University of Washington · Logistic Regression Logistic function (or Sigmoid): ! Learn P(Y|X) directly " Assume a particular functional form for link function

59

Voting (Ensemble Methods) n  Instead of learning a single (weak) classifier, learn many weak classifiers that are

good at different parts of the input space n  Output class: (Weighted) vote of each classifier

¨  Classifiers that are most “sure” will vote with more conviction ¨  Classifiers will be most “sure” about a particular part of the space ¨  On average, do better than single classifier!

n  But how do you ??? ¨  force classifiers to learn about different parts of the input space? ¨  weigh the votes of different classifiers?

©Carlos Guestrin 2005-2014