logistic regression -- generative verses discriminative ...lsong/teaching/cse6740/lecture5.pdf ·...

Machine Learning

10-701/15-781, Spring 2008

Logistic Regression

-- generative verses discriminative

classifier

Le Song

Lecture 5, September 4, 2012

Based on slides from Eric Xing, CMU

Reading: Chap. 3.1.3-4 CB

Generative vs. Discriminative

Classifiers

Goal: Wish to learn f: X Y, e.g., P(Y|X)

Generative classifiers (e.g., Naïve Bayes):

Assume some functional form for P(X|Y), P(Y)

This is a ‘generative’ model of the data!

Estimate parameters of P(X|Y), P(Y) directly from training data

Use Bayes rule to calculate P(Y|X= xi)

Discriminative classifiers:

Directly assume some functional form for P(Y|X)

This is a ‘discriminative’ model of the data!

Estimate parameters of P(Y|X) directly from training data

Yi

Xi

Yi

Xi

Consider the a Gaussian

Generative Classifier

learning f: X Y, where

X is a vector of real-valued features, < X1…Xn >

Y is boolean

What does that imply about the form of P(Y|X)?

The joint probability of a datum and its label is:

Given a datum xn, we predict its label using the conditional probability of the label

given the datum:

Yi

Xi

22

1212 2

21

111

)-(-exp)(

),,|()(),|,(

/ knk

knn

kn

knn

x

yxpypyxp

''/'

/

)-(-exp)(

)-(-exp)(

),,|(

kknk

knk

nkn

x

x

xyp2

21

212

22

1212

2

2

21

21

1

Naïve Bayes Classifier

When X is multivariate-Gaussian vector:

The joint probability of a datum and it label is:

The naïve Bayes simplification

More generally:

Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density

)-()-(-exp)(

),,|()(),|,(

/ knT

knk

knn

kn

knn

xx

yxpypyxp

121

2121

111

Yi

Xi

jjk

jn

jk

k

jjkjk

kn

jn

kn

knn

x

yxpypyxp

jk

22

1212 2

21

111

)-(-exp)(

),,|()(),|,(

,/,

,,

,

m

jn

jnnnn yxpypyxp

1

),|()|(),|,(

Yi

Xi1 Xi

2 Xim …

The predictive distribution

Understanding the predictive distribution

Under naïve Bayes assumption:

For two class (i.e., K=2), and when the two classes haves the same

variance, ** turns out to be the logistic function

* ),|,(

),|,(

),|(

),,|,(),,,|(

' '''

k kknk

kknk

n

nkn

nkn

xN

xN

xp

xypxyp

1

1

1

1222

1221

1

2222

12

1221

121

11

1

1

1

)(2

log)-(exp

log)-(exp

log)][-]([)-(exp

j

jjjjjnCx

Cx

jjj j

jjn

j

j jjj

nj

x

nT xe

1

1

)|( nn xyp 11

**

log)(exp

log)(exp

),,,|(

' ,''

,'

'

,

,

k j jkj

kj

n

jk

k

j jkj

kj

n

jk

k

nkn

Cx

Cx

xyp

2

2

22

21

21

1

𝒙𝒏𝒋𝜽𝒋

𝒋

𝜽𝟎

The decision boundary

The predictive distribution

The Bayes decision rule:

For multiple class (i.e., K>2), * correspond to a softmax function

j

x

x

nkn

nTj

nTk

e

exyp

)|( 1

nT xM

j

jnj

nne

x

xyp

1

1

1

11

01

1

exp

)|(

nT

x

x

x

nn

nn x

e

ee

xyp

xyp

nT

nT

nT

1

1

1

11

2

1 ln

)|(

)|(ln

Discussion: Generative and

discriminative classifiers

Generative:

Modeling the joint distribution

of all data

Discriminative:

Modeling only points

at the boundary

How? Regression!

Linear regression

The data:

Both nodes are observed: X is an input vector

Y is a response vector

(we first consider y as a generic

continuous response vector, then

we consider the special case of

classification where y is a discrete

indicator)

A regression scheme can be

used to model p(y|x) directly,

rather than p(x,y)

Yi

Xi

N

),(,),,(),,(),,( NN yxyxyxyx 332211

Logistic regression (sigmoid

classifier)

The condition distribution: a Bernoulli

where is a logistic function

We can used the brute-force gradient method as in LR

But we can also apply generic laws by observing the p(y|x) is

an exponential family function, more specifically, a

generalized linear model (see next lecture!)

yy xxxyp 11 ))(()()|(

xT

ex

1

1)(

Training Logistic Regression:

MCLE

Estimate parameters =<0, 1, ... n> to maximize the

conditional likelihood of training data

Training data

Data likelihood =

Data conditional likelihood =

Expressing Conditional Log

Likelihood

Recall the logistic function:

and conditional likelihood:

Maximizing Conditional Log

Likelihood

The objective:

Good news: l() is concave function of

Bad news: no closed-form solution to maximize l()

𝑙(𝜃2) 𝑙(𝜃1)

1

3𝑙(𝜃1) +

2

3𝑙(𝜃2)

𝒍1

3𝜃1 +2

3𝜃2

𝑙(𝜃)

𝜃 𝜃1 𝜃2

Gradient Ascent

Property of sigmoid function:

The gradient:

The gradient ascent algorithm iterate until change < ε

For all i,

repeat

𝜕𝑙(𝜃)

𝜕𝜃𝑗= 𝑦𝑖 − 1 𝑥𝑖

𝑗−

𝑖

1

𝜇𝜇 1 − 𝜇

𝜕(−𝜃⊤𝑥)

𝜕𝜃𝑗= 𝑦𝑖 − 1 𝑥𝑖 +

𝑖

1 − 𝜇 𝑥𝑖𝑗

𝜇 𝑡 =1

1 + 𝑒−𝑡

How about MAP?

It is very common to use regularized maximum likelihood.

One common approach is to define priors on – Normal distribution, zero mean, identity covariance

Helps avoid very large weights and overfitting

MAP estimate

MLE vs MAP

Maximum conditional likelihood estimate

Maximum a posteriori estimate

Logistic regression: practical

issues

IRLS takes O(Nd3) per iteration, where N = number of training

cases and d = dimension of input x.

Quasi-Newton methods, that approximate the Hessian, work

faster.

Conjugate gradient takes O(Nd) per iteration, and usually

works best in practice.

Stochastic gradient descent can also be used if N is large c.f.

perceptron rule:

Naïve Bayes vs Logistic

Regression

Consider Y boolean, X continuous, X=<X1 ... Xn>

Number of parameters to estimate:

NB: 2n+1

LR: n+1

Estimation method:

NB parameter estimates are uncoupled (super easy)

LR parameter estimates are coupled (less easy)


Regression

Asymptotic comparison (# training examples infinity)

when model assumptions correct

NB, LR produce identical classifiers

when model assumptions incorrect

LR is less biased – does not assume conditional independence

therefore expected to outperform NB


Regression

Non-asymptotic analysis (see [Ng & Jordan, 2002] )

convergence rate of parameter estimates – how many training

examples needed to assure good estimates?

NB order log n (where n = # of attributes in X)

LR order n

NB converges more quickly to its (perhaps less helpful)

asymptotic estimates

Rate of convergence: logistic

regression

Let hDis,m be logistic regression trained on m examples in n

dimensions. Then with high probability:

Implication: if we want

for some small constant e0, it suffices to pick order n

examples

Convergences to its asymptotic classifier, in order n examples

result follows from Vapnik’s structural risk bound, plus fact that the "VC

Dimension" of an n-dimensional linear separators is n

Rate of covergence: naïve Bayes

parameters

Let any 𝜖1, 𝛿>0, and any l 0 be fixed.

Assume that for some fixed 𝜌 > 0,

we have that

Let

Then we probability at least 1-d, after m examples:

1. For discrete input, for all i and b

2. For continuous inputs, for all i and b

Some experiments from UCI data

sets

Summary

Logistic regression

– Functional form follows from Naïve Bayes assumptions

For Gaussian Naïve Bayes assuming variance

For discrete-valued Naïve Bayes too

But training procedure picks parameters without the

conditional independence assumption

– MLE training: pick W to maximize P(Y | X; )

– MAP training: pick W to maximize P( | X,Y)

‘regularization’

helps reduce overfitting

Gradient ascent/descent

– General approach when closed-form solutions unavailable

Generative vs. Discriminative classifiers

– Bias vs. variance tradeoff

¾j ;k = ¾j

logistic regression -- generative verses discriminative ...lsong/teaching/cse6740/lecture5.pdf ·...

Documents