cs 782 – machine learning lecture 4 linear models for classification probabilistic generative...

CS 782 – Machine LearningLecture 4

Linear Models for Classification

Probabilistic generative models

Probabilistic discriminative models

Probabilistic Generative Models

We have shown:

Decision boundary:

2

122110

211

01

ln2

1

2

1

|

Cp

Cpw

w

wCp

TT

T

μμμμ

μμ

xwx

5.0|| 21 xx CpCp

05.01

10

0

wx w wxw

TT

e

Probabilistic Generative Models

K >2 classes:

We can show the following result:

kkk

j

a

a

jjj

kkk

CpCpa

e

e

CpCp

CpCpCp

j

k

|ln

|

||

x

x

xx

kkTkkkk

j

wx

w

k

Cpw

e

exCp

jTj

kTk

ln2

1

|

10

1

0

0

μμ μ w

w

xw

Maximum likelihood solution

kT

k

eCp Dk

μxμx

x1

2

1

2/12/

1

2

1|

We have a parametric functional form for the class-conditional

densities:

We can estimate the parameters and the prior class

probabilities using maximum likelihood.

Two class case with shared covariance matrix.

Training data:

1,

01

,,1,

21

21

CpCp

CclassdenotestCclassdenotest

Nnt

nn

nn

:Priors

x


1,

01

,,1,

21

21

CpCp


Nnt

nn

nn

:Priors

x

For a data point from class we have and

therefore

For a data point from class we have and

therefore

,||, 1111 μxxx nnn NCpCpCp

,|1|, 2222 μxxx nnn NCpCpCp

nx 1C 1nt

nx 2C 0nt


1,

01

,,1,

21

21

CpCp


Nnt

nn

nn

:Priors

x

,||, 1111 μxxx nnn NCpCpCp

,|1|, 2222 μxxx nnn NCpCpCp

TNt

n

N

n

tn

tn

N

n

tn

N

nn

ttNN

CpCp

tpp

nn

nn

,,,|1,|

,,

,,,|,,,|

11

21

1

12

11

12121

t μxμx

xx

μμμμt

Assuming observations are drawn independently, we can

write the likelihood function as follows:


μxμx

μμt

nn tn

N

n

tn NN

p

12

11

21

,|1,|

,,,|

We want to find the values of the parameters that maximize

the likelihood function, i.e., fit a model that best describes

the observed data.

As usual, we consider the log of the likelihood:

μxμx

μμt

N

nnnnnnn NttNtt

p

121

21

,|ln11ln1,|lnln

,,,|ln


We first maximize the log likelihood with respect to . The

terms that depend on are

N

nnn tt

1

11ln

21

11

1

111

111

1

01

1

1

1

1

1

1

11

11

111ln1ln

NN

N

N

Nt

N

NttNt

tttt

N

nn

N

nn

N

nn

N

nn

N

nn

N

nn

N

nnn

μxμx

μμt

N

nnnnnnn NttNtt

p

121

21

,|ln11ln1,|lnln

,,,|ln


Thus, the maximum likelihood estimate of is the fraction

of points in class

The result can be generalized to the multiclass case: the

maximum likelihood estimate of is given by the

fraction of points in the training set that belong to

1C

21

1

NN

NML

kCp

kC


We now maximize the log likelihood with respect to . The

terms that depend on are1μ

μxμx

μμt

N

nnnnnnn NttNtt

p

121

21

,|ln11ln1,|lnln

,,,|ln

1μ μx

N

nnn Nt

11,|ln

]

1

2

1,|[

11

12

1

2/12/1

μxμx

μxn

Tn

eN Dn

N

nnn

N

nnn

N

nnn

N

n

Tnn

N

nn

Tnn

N

nnn

tN

Nttt

consttNt

111

111

11

1

11

11

11

111

1

1

00

2

1,|ln

xμ

μxμx μx

μxμxμ

μxμ


Thus, the maximum likelihood estimate of is the sample

mean of all the input vectors assigned to class

By maximizing the log likelihood with respect to we

obtain a similar result for

1μ

nx 1C

2μ

N

nnntN 11

1

1xμ

N

nnntN 12

2 11

xμ

2μ


Maximizing the log likelihood with respect to we obtain

the maximum likelihood estimate

Thus: the maximum likelihood estimate of the covariance

is given by the weighted average of the sample covariance

matrices associated with each of the classes.

This results extend to K classes.

ML

TnCn

nT

nCn

n

ML

NS

NS

SN

NS

N

N

222

2111

1

22

11

21

11μxμx μxμx

Probabilistic Discriminative Models

Two-class case:

Multiclass case:

Discriminative approach: use the functional form of the

generalized linear model for the posterior probabilities and

determine its parameters directly using maximum

likelihood.

01 | wCp T xwx

j

wx

w

kj

Tj

kTk

e

eCp

0

0

|w

xw

x


Advantages:

Fewer parameters to be determined

Improved predictive performance, especially when the

class-conditional density assumptions give a poor

approximation of the true distributions.


Two-class case:

In the terminology of statistics, this model is known as

logistic regression.

Assuming how many parameters do we need to

estimate?

xxwx ywCp T 01 |

xx |1| 12 CpCp

Mx

1M


How many parameters did we estimate to fit Gaussian

class-conditional densities (generative approach)?

22

22

1

221

22

22

1

MOMM

Mtotal

MMMMM

Mvectorsmean

Cp

Logistic Regression

We use maximum likelihood to determine the parameters

of the logistic regression model.

xxwx ywCp T 01 |

nnnn tn

N

n

tn

tn

N

n

tn

nn

nn

yyCPCPL

functionLikelihood

dataobservedthetoassociatediesprobabilit

posteriortheimizethatofvaluesthefindtowantWe


Nnt

1

1

11

11

21

1|1|

:

max

01

,,1,

xxxxw

w

x

Logistic Regression

We consider the negative logarithm of the likelihood:

xxwx ywCp T 01 |

nnnn tn

N

n

tn

tn

N

n

tn yyCPCPL

1

1

11

11 1|1| xxxxw

w

xx

xxww

wE

ytyt

yyLE

N

nnnnn

tn

N

n

tn

nn

minarg

1ln1ln

1lnln

1

1

1

Logistic Regression

We compute the derivative of the error function with

respect to w (gradient):

We need to compute the derivative of the logistic sigmoid

function:

xxwx ywCp T 01 |

N

nnnnn ytytE

1

1ln1ln xxw

ww

aa

ee

e

e

ee

e

eaa

a

aa

a

a

aa

a

a

11

11

1

1

11

1

11

12

Logistic Regression

xxwx ywCp T 01 |

n

N

nnn

n

N

nnnn

N

nnn

n

N

nnnnnnn

N

nnnnnnn

N

nnnn

n

nnnn

n

n

N

nnnnn

tyE

tyyt

ytyyttytyt

yyy

tyy

y

t

ytytE

xw

xx

xxx

xx

xxw

ww

1

11

11

1

1

11

11

11

1ln1ln

Logistic Regression

n

N

nnn tyE xw

1

The gradient of E at w gives the direction of the

steepest increase of E at w. We need to minimize E. Thus

we need to update w so that we move along the opposite

direction of the gradient:

This technique is called gradient descent

It can be shown that E is a concave function of w. Thus,

it has a unique minimum.

An efficient iterative technique exists to find the optimal

w parameters (Newton-Raphson optimization).

www Ett 1

Batch vs. on-line learning

n

N

nnn tyE xw

1

The computation of the above gradient requires the

processing of the entire training set (batch technique)

If the data set is large, the above technique can be

costly;

For real time applications in which data become

available as continuous streams, we may want to update

the parameters as data points are presented to us (on-line

technique).

www Ett 1

On-line learning

After the presentation of each data point n, we compute

the contribution of that data point to the gradient

(stochastic gradient):

The on-line updating rule for the parameters becomes:

nnnt

ntt tyE xwwww 1

nnnn tyE xw

econvergencensuretocarefullychosenbetoneedsvaluesIt

ratelearningcalledis

'

.0

www ntt E1

Multiclass Logistic Regression

Multiclass case: xxw

xw

k

j

wx

w

k ye

eCp

jTj

kTk

0

0

|

We use maximum likelihood to determine the parameters

of the logistic regression model.

N

n

K

k

tnk

N

n

K

k

tnkK

K

kn

nn

nknk yCPL

functionLikelihood

dataobservedthetoassociatediesprobabilitposteriorthe

imizethatofvaluesthefindtowantWe

Cclassdenotest

Nn

1 11 11

1

|,,

:

max,,

0,,1,,0

,,1,

xxww

w w

tx


xxw

xw

k

j

wx

w

k ye

eCp

jTj

kTk

0

0

|

N

n

K

k

tnkK

nkyL1 1

1 ,, xww

We consider the negative logarithm of the likelihood:

j

N

n

K

knknkKK

E

ytLE

j

w

xwwww

wminarg

ln,,ln,,1 1

11


xxw

xw

k

j

wx

w

k ye

eCp

jTj

kTk

0

0

|

We compute the gradient of the error function with respect

to one of the parameter vectors:

N

n

K

knknkK ytE

1 11 ln,, xww

N

n

K

knknk

jK

j

ytE1 1

1 ln,, xw

www


xxw

xw

k

j

wx

w

k ye

eCp

jTj

kTk

0

0

|

Thus, we need to compute the derivatives of the softmax

function:

N

n

K

knknk

jK

j

ytE1 1

1 ,, xw

www

kkkk

j

a

aa

j

aa

j

a

a

kk

k

yyyy

e

eeee

e

e

ay

aj

kkjk

j

k

12

2


xxw

xw

k

j

wx

w

k ye

eCp

jTj

kTk

0

0

|

Thus, we need to compute the derivatives of the softmax

function:

N

n

K

knknk

jK

j

ytE1 1

1 ,, xw

www

jk

j

a

aa

j

a

a

jk

j

yy

e

ee

e

e

ay

akjfor

j

jk

j

k

2,


Compact expression:

jk

j

a

aa

j

a

a

jk

j

yy

e

ee

e

e

ay

akjfor

j

jk

j

k

2,

kkkk

j

a

aa

j

aa

j

a

a

kk

k

yyyy

e

eeee

e

e

ay

aj

kkjk

j

k

12

2

matrixidentitytheofelementstheareIwhere

yIyya

kj

jkjkkj


xxw

xw

k

j

wx

w

k ye

eCp

jTj

kTk

0

0

|

n

N

nnjnj

N

n

K

knkjnknjnk

N

n

K

knnjkjnk

nknk

N

n

K

knknk

jK

ty

ItytyIyy

t

ytEj

x

xx

xw

www

1

1 11 1

1 11

1

ln,,

jkjkkj

yIyya


n

N

nnjnjK tyE

jxwww

1

1 ,,

It can be shown that E is a concave function of w. Thus,

it has a unique minimum.

For a batch solution, we can use the Newton-Raphson

optimization technique.

On-line solution (stochastic gradient descent):

nnjnjtjn

tj

tj tyE xwwww 1

cs 782 – machine learning lecture 4 linear models for classification probabilistic generative...

Documents