cs 782 – machine learning lecture 4 linear models for classification probabilistic generative...

31
CS 782 – Machine Learning Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models

Upload: rosaline-hardy

Post on 16-Jan-2016

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

CS 782 – Machine LearningLecture 4

Linear Models for Classification

Probabilistic generative models

Probabilistic discriminative models

Page 2: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Probabilistic Generative Models

We have shown:

Decision boundary:

2

122110

211

01

ln2

1

2

1

|

Cp

Cpw

w

wCp

TT

T

μμμμ

μμ

xwx

5.0|| 21 xx CpCp

05.01

10

0

wx w wxw

TT

e

Page 3: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Probabilistic Generative Models

K >2 classes:

We can show the following result:

kkk

j

a

a

jjj

kkk

CpCpa

e

e

CpCp

CpCpCp

j

k

|ln

|

||

x

x

xx

kkTkkkk

j

wx

w

k

Cpw

e

exCp

jTj

kTk

ln2

1

|

10

1

0

0

μμ μ w

w

xw

Page 4: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Maximum likelihood solution

kT

k

eCp Dk

μxμx

x1

2

1

2/12/

1

2

1|

We have a parametric functional form for the class-conditional

densities:

We can estimate the parameters and the prior class

probabilities using maximum likelihood.

Two class case with shared covariance matrix.

Training data:

1,

01

,,1,

21

21

CpCp

CclassdenotestCclassdenotest

Nnt

nn

nn

:Priors

x

Page 5: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Maximum likelihood solution

1,

01

,,1,

21

21

CpCp

CclassdenotestCclassdenotest

Nnt

nn

nn

:Priors

x

For a data point from class we have and

therefore

For a data point from class we have and

therefore

,||, 1111 μxxx nnn NCpCpCp

,|1|, 2222 μxxx nnn NCpCpCp

nx 1C 1nt

nx 2C 0nt

Page 6: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Maximum likelihood solution

1,

01

,,1,

21

21

CpCp

CclassdenotestCclassdenotest

Nnt

nn

nn

:Priors

x

,||, 1111 μxxx nnn NCpCpCp

,|1|, 2222 μxxx nnn NCpCpCp

TNt

n

N

n

tn

tn

N

n

tn

N

nn

ttNN

CpCp

tpp

nn

nn

,,,|1,|

,,

,,,|,,,|

11

21

1

12

11

12121

t μxμx

xx

μμμμt

Assuming observations are drawn independently, we can

write the likelihood function as follows:

Page 7: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Maximum likelihood solution

μxμx

μμt

nn tn

N

n

tn NN

p

12

11

21

,|1,|

,,,|

We want to find the values of the parameters that maximize

the likelihood function, i.e., fit a model that best describes

the observed data.

As usual, we consider the log of the likelihood:

μxμx

μμt

N

nnnnnnn NttNtt

p

121

21

,|ln11ln1,|lnln

,,,|ln

Page 8: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Maximum likelihood solution

We first maximize the log likelihood with respect to . The

terms that depend on are

N

nnn tt

1

11ln

21

11

1

111

111

1

01

1

1

1

1

1

1

11

11

111ln1ln

NN

N

N

Nt

N

NttNt

tttt

N

nn

N

nn

N

nn

N

nn

N

nn

N

nn

N

nnn

μxμx

μμt

N

nnnnnnn NttNtt

p

121

21

,|ln11ln1,|lnln

,,,|ln

Page 9: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Maximum likelihood solution

Thus, the maximum likelihood estimate of is the fraction

of points in class

The result can be generalized to the multiclass case: the

maximum likelihood estimate of is given by the

fraction of points in the training set that belong to

1C

21

1

NN

NML

kCp

kC

Page 10: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Maximum likelihood solution

We now maximize the log likelihood with respect to . The

terms that depend on are1μ

μxμx

μμt

N

nnnnnnn NttNtt

p

121

21

,|ln11ln1,|lnln

,,,|ln

1μ μx

N

nnn Nt

11,|ln

]

1

2

1,|[

11

12

1

2/12/1

μxμx

μxn

Tn

eN Dn

N

nnn

N

nnn

N

nnn

N

n

Tnn

N

nn

Tnn

N

nnn

tN

Nttt

consttNt

111

111

11

1

11

11

11

111

1

1

00

2

1,|ln

μxμx μx

μxμxμ

μxμ

Page 11: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Maximum likelihood solution

Thus, the maximum likelihood estimate of is the sample

mean of all the input vectors assigned to class

By maximizing the log likelihood with respect to we

obtain a similar result for

nx 1C

N

nnntN 11

1

1xμ

N

nnntN 12

2 11

Page 12: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Maximum likelihood solution

Maximizing the log likelihood with respect to we obtain

the maximum likelihood estimate

Thus: the maximum likelihood estimate of the covariance

is given by the weighted average of the sample covariance

matrices associated with each of the classes.

This results extend to K classes.

ML

TnCn

nT

nCn

n

ML

NS

NS

SN

NS

N

N

222

2111

1

22

11

21

11μxμx μxμx

Page 13: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Probabilistic Discriminative Models

Two-class case:

Multiclass case:

Discriminative approach: use the functional form of the

generalized linear model for the posterior probabilities and

determine its parameters directly using maximum

likelihood.

01 | wCp T xwx

j

wx

w

kj

Tj

kTk

e

eCp

0

0

|w

xw

x

Page 14: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Probabilistic Discriminative Models

Advantages:

Fewer parameters to be determined

Improved predictive performance, especially when the

class-conditional density assumptions give a poor

approximation of the true distributions.

Page 15: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Probabilistic Discriminative Models

Two-class case:

In the terminology of statistics, this model is known as

logistic regression.

Assuming how many parameters do we need to

estimate?

xxwx ywCp T 01 |

xx |1| 12 CpCp

Mx

1M

Page 16: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Probabilistic Discriminative Models

How many parameters did we estimate to fit Gaussian

class-conditional densities (generative approach)?

22

22

1

221

22

22

1

MOMM

Mtotal

MMMMM

Mvectorsmean

Cp

Page 17: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Logistic Regression

We use maximum likelihood to determine the parameters

of the logistic regression model.

xxwx ywCp T 01 |

nnnn tn

N

n

tn

tn

N

n

tn

nn

nn

yyCPCPL

functionLikelihood

dataobservedthetoassociatediesprobabilit

posteriortheimizethatofvaluesthefindtowantWe

CclassdenotestCclassdenotest

Nnt

1

1

11

11

21

1|1|

:

max

01

,,1,

xxxxw

w

x

Page 18: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Logistic Regression

We consider the negative logarithm of the likelihood:

xxwx ywCp T 01 |

nnnn tn

N

n

tn

tn

N

n

tn yyCPCPL

1

1

11

11 1|1| xxxxw

w

xx

xxww

wE

ytyt

yyLE

N

nnnnn

tn

N

n

tn

nn

minarg

1ln1ln

1lnln

1

1

1

Page 19: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Logistic Regression

We compute the derivative of the error function with

respect to w (gradient):

We need to compute the derivative of the logistic sigmoid

function:

xxwx ywCp T 01 |

N

nnnnn ytytE

1

1ln1ln xxw

ww

aa

ee

e

e

ee

e

eaa

a

aa

a

a

aa

a

a

11

11

1

1

11

1

11

12

Page 20: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Logistic Regression

xxwx ywCp T 01 |

n

N

nnn

n

N

nnnn

N

nnn

n

N

nnnnnnn

N

nnnnnnn

N

nnnn

n

nnnn

n

n

N

nnnnn

tyE

tyyt

ytyyttytyt

yyy

tyy

y

t

ytytE

xw

xx

xxx

xx

xxw

ww

1

11

11

1

1

11

11

11

1ln1ln

Page 21: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Logistic Regression

n

N

nnn tyE xw

1

The gradient of E at w gives the direction of the

steepest increase of E at w. We need to minimize E. Thus

we need to update w so that we move along the opposite

direction of the gradient:

This technique is called gradient descent

It can be shown that E is a concave function of w. Thus,

it has a unique minimum.

An efficient iterative technique exists to find the optimal

w parameters (Newton-Raphson optimization).

www Ett 1

Page 22: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Batch vs. on-line learning

n

N

nnn tyE xw

1

The computation of the above gradient requires the

processing of the entire training set (batch technique)

If the data set is large, the above technique can be

costly;

For real time applications in which data become

available as continuous streams, we may want to update

the parameters as data points are presented to us (on-line

technique).

www Ett 1

Page 23: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

On-line learning

After the presentation of each data point n, we compute

the contribution of that data point to the gradient

(stochastic gradient):

The on-line updating rule for the parameters becomes:

nnnt

ntt tyE xwwww 1

nnnn tyE xw

econvergencensuretocarefullychosenbetoneedsvaluesIt

ratelearningcalledis

'

.0

www ntt E1

Page 24: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Multiclass Logistic Regression

Multiclass case: xxw

xw

k

j

wx

w

k ye

eCp

jTj

kTk

0

0

|

We use maximum likelihood to determine the parameters

of the logistic regression model.

N

n

K

k

tnk

N

n

K

k

tnkK

K

kn

nn

nknk yCPL

functionLikelihood

dataobservedthetoassociatediesprobabilitposteriorthe

imizethatofvaluesthefindtowantWe

Cclassdenotest

Nn

1 11 11

1

|,,

:

max,,

0,,1,,0

,,1,

xxww

w w

tx

Page 25: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Multiclass Logistic Regression

xxw

xw

k

j

wx

w

k ye

eCp

jTj

kTk

0

0

|

N

n

K

k

tnkK

nkyL1 1

1 ,, xww

We consider the negative logarithm of the likelihood:

j

N

n

K

knknkKK

E

ytLE

j

w

xwwww

wminarg

ln,,ln,,1 1

11

Page 26: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Multiclass Logistic Regression

xxw

xw

k

j

wx

w

k ye

eCp

jTj

kTk

0

0

|

We compute the gradient of the error function with respect

to one of the parameter vectors:

N

n

K

knknkK ytE

1 11 ln,, xww

N

n

K

knknk

jK

j

ytE1 1

1 ln,, xw

www

Page 27: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Multiclass Logistic Regression

xxw

xw

k

j

wx

w

k ye

eCp

jTj

kTk

0

0

|

Thus, we need to compute the derivatives of the softmax

function:

N

n

K

knknk

jK

j

ytE1 1

1 ,, xw

www

kkkk

j

a

aa

j

aa

j

a

a

kk

k

yyyy

e

eeee

e

e

ay

aj

kkjk

j

k

12

2

Page 28: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Multiclass Logistic Regression

xxw

xw

k

j

wx

w

k ye

eCp

jTj

kTk

0

0

|

Thus, we need to compute the derivatives of the softmax

function:

N

n

K

knknk

jK

j

ytE1 1

1 ,, xw

www

jk

j

a

aa

j

a

a

jk

j

yy

e

ee

e

e

ay

akjfor

j

jk

j

k

2,

Page 29: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Multiclass Logistic Regression

Compact expression:

jk

j

a

aa

j

a

a

jk

j

yy

e

ee

e

e

ay

akjfor

j

jk

j

k

2,

kkkk

j

a

aa

j

aa

j

a

a

kk

k

yyyy

e

eeee

e

e

ay

aj

kkjk

j

k

12

2

matrixidentitytheofelementstheareIwhere

yIyya

kj

jkjkkj

Page 30: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Multiclass Logistic Regression

xxw

xw

k

j

wx

w

k ye

eCp

jTj

kTk

0

0

|

n

N

nnjnj

N

n

K

knkjnknjnk

N

n

K

knnjkjnk

nknk

N

n

K

knknk

jK

ty

ItytyIyy

t

ytEj

x

xx

xw

www

1

1 11 1

1 11

1

ln,,

jkjkkj

yIyya

Page 31: CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models

Multiclass Logistic Regression

n

N

nnjnjK tyE

jxwww

1

1 ,,

It can be shown that E is a concave function of w. Thus,

it has a unique minimum.

For a batch solution, we can use the Newton-Raphson

optimization technique.

On-line solution (stochastic gradient descent):

nnjnjtjn

tj

tj tyE xwwww 1