ece 8443 – pattern recognition lecture 06: maximum likelihood and bayesian estimation objectives:...

ECE 8443 – Pattern Recognition

LECTURE 06: MAXIMUM LIKELIHOOD ANDBAYESIAN ESTIMATION

• Objectives:Bias in ML EstimatesBayesian EstimationExample

• Resources:D.H.S.: Chapter 3 (Part 2)Wiki: Maximum LikelihoodM.Y.: Maximum Likelihood TutorialJ.O.S.: Bayesian Parameter EstimationJ.H.: Euro Coin

Audio:URL:

http://rii.ricoh.com/~stork/DHSch3part2.ppt

http://en.wikipedia.org/wiki/Maximum_likelihood

http://www.psy.vanderbilt.edu/faculty/palmeri/P351-modeling/readings/myung-tutorial-mle.pdf

http://www-ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html

http://www.isip.msstate.edu/publications/seminars/msstate_misc/2002/euro_coin/presentation_v0.pdf

http://www.ece.msstate.edu/research/isip/publications/seminars/msstate/2002/euro_coin/presentation_v0.pdf

http://www.mat.ulaval.ca/informatique/guide94/img14.png

http://www.isip.msstate.edu/publications/seminars/msstate_misc/2002/euro_coin/presentation_v0.pdf

http://www.isip.piconepress.com/publications/courses/ece_8443/lectures/2009_spring/lecture_06.mp3

http://www.isip.piconepress.com/publications/courses/ece_8443/lectures/2009_spring/lecture_06.pptx

http://www.isip.piconepress.com/publications/courses/ece_8443/lectures/2009_spring/lecture_06.pptx

ECE 8443: Lecture 06, Slide 2

• Consider the case where only the mean, = , is unknown:

)()(2

1])2ln[(

2

1

)]()(2

1exp[

)2(

1ln[))(ln(

1

12/12/

kkd

kkdp

xx

xxx

t

tk

0ln1

n

kkp x

)())(ln( 1 kp xxkwhich implies:

)(

)]()(2

1[])2ln[(

2

1[

)]()(2

1])2ln[(

2

1[

1

1

1

k

kkd

kkd

x

xx

xx

t

t

because:

Gaussian Case: Unknown Mean (Review)


• Rearranging terms:

• Significance???

n

kk

n

kk

n

k

n

kk

n

kk

n

kk

n

n

1

1

1 1

1

1

1

1ˆ

0ˆ

0ˆ

0)ˆ(

0)ˆ(

x

x

x

x

x

• Substituting into the expression for the total likelihood:

0)(ln1

1

1

n

kk

n

kkpl xx

Gaussian Case: Unknown Mean (Review)


• Let = [,2]. The log likelihood of a SINGLE point is:

))(2

1])2ln[(

2

1))(ln( 1

121

22

kt

k (xxxp k

22

21

2

12

2

)(

2

1

)(1

))(ln(

k

k

x

x

xpl θθθ k

• The full likelihood leads to:

n

k

n

kk

n

k

k

n

kk

xx

x

12

1

21

122

21

2

11

2

ˆ)ˆ(0ˆ2

)ˆ(ˆ2

1

0)ˆ(ˆ1

Gaussian Case: Unknown Mean and Variance (Review)


• This leads to these equations:

2

1

22

11

)ˆ1

ˆˆ

1ˆˆ

n

kk

n

kk

xn

xn

(

• In the multivariate case:

n

kkk

n

kk

n

n

1

2

1

ˆˆ1

ˆ

1ˆ

txx

x

• The true covariance is the expected value of the matrix ,

which is a familiar result.

tkk ˆˆ xx

Gaussian Case: Unknown Mean and Variance (Review)


• Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later.

• Expected value of the ML estimate of the mean:

n

i

n

ii

n

ii

n

xEn

xn

EE

1

1

1

1

][1

]1[]ˆ[

22

1 12

2

11

22

22

][1

]11

[

]ˆ[

])ˆ[(]ˆ[]ˆvar[

n

i

n

jji

n

jj

n

ii

xxEn

xn

xn

E

E

EE

Convergence of the Mean (Review)


• The expected value of xixj will be 2 for j k since the two random variables are independent.

• The expected value of xi2 will be 2 + 2.

• Hence, in the summation above, we have n2-n terms with expected value 2 and n terms with expected value 2 + 2.

• Thus,

n

nnnn

222222

21

]ˆvar[

• We see that the variance of the estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero).

which implies:

22

22 ])ˆ[(]ˆvar[]ˆ[ n

EE

Variance of the ML Estimate of the Mean (Review)


2

11

2

22

222

2222

)1(

][

][2][

][][2][)[(

n

ii

n

ii x

nx

xE

ExE

ExExExE

Note that this implies:

22

1

2

n

iix

• Now we can combine these results. Recall our expression for the ML estimate of the variance:

n

iixn

E1

22 ]ˆ1[ˆ

• We will need one more result:

Variance Relationships


))(]ˆ[2)((1

])ˆ[]ˆ[2][(1

)]ˆˆ2([1

ˆ1[ˆ

22

1

22

2

1

2

2

1

2

1

22

nxEn

ExExEn

xxEn

xn

E

i

n

i

n

iii

i

n

ii

n

ii

• Expand the covariance and simplify:

n

nn

nn

xxExxEn

xxEn

xxExE ii

n

jij

ji

n

jji

n

jjii

2222222

111

)((1

))1((1

])[][(1

][1

][]ˆ[

• One more intermediate term to derive:

Covariance Expansion


2

1

2

1

22

1

2

2

1

2

2222

1

22

2222

1

22

22

1

222

)1(

)1(1)/11(

1)(

1

)(1

)22(1

))()(2)((1

))(]ˆ[2)((1

ˆ

n

n

n

n

nn

nn

n

nn

nnn

nnn

nxEn

n

i

n

i

n

i

n

i

n

i

n

i

i

n

i

• Substitute our previously derived expression for the second term:

Biased Variance Estimate


22

1

22 1]ˆ

1[ˆ

n

nx

nE

n

ii

• An unbiased estimator is:

n

i

tiin 1ˆˆ

1

1 xxC

• These are related by:

Cn

n )1(ˆ

which is asymptotically unbiased. See Burl, AJWills and AWM for excellent examples and explanations of the details of this derivation.

• Therefore, the ML estimate is biased:

However, the ML estimate converges (and is MSE).

Expectation Simplification

http://www.cs.colorado.edu/~mburl/courses/CSCI5622/Fall2003/lecture6.pdf


• In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(i), and class-conditional densities, p(x|i).

• Bayes: treat the parameters as random variables having some known prior distribution. Observations of samples converts this to a posterior.

• Bayesian learning: sharpen the a posteriori density causing it to peak near the true value.

• Supervised vs. unsupervised: do we know the class assignments of the training data.

• Bayesian estimation and ML estimation produce very similar results in many cases.

• Reduces statistical inference (prior knowledge or beliefs about the world) to probabilities.

Introduction to Bayesian Parameter Estimation


• Posterior probabilities, P(i|x), are central to Bayesian classification.

• Bayes formula allows us to compute P(i|x) from the priors, P(i), and the likelihood, p(x|i).

• But what If the priors and class-conditional densities are unknown?• The answer is that we can compute the posterior, P(i|x), using all of the

information at our disposal (e.g., training data).• For a training set, D, Bayes formula becomes:

c

jjj

iii

DPDp

DPDpDP

1)(),(

)(),(

evidence

priorlikelihood),|(

x

xx

• We assume priors are known: P(i|D) = P(i).

• Also, assume functional independence:

Di have no influence on

This gives:

c

jjjj

iiii

PDp

PDpDP

1)(),(

)(),(),|(

x

xx

Class-Conditional Densities

jiifDp j ),( x


• Assume the parametric form of the evidence, p(x), is known: p(x|).

• Any information we have about prior to collecting samples is contained in a known prior density p().

• Observation of samples converts this to a posterior, p(|D), which we hope is peaked around the true value of .

• Our goal is to estimate a parameter vector:

dDpDp )()( ,xx

• We can write the joint distribution as a product:

dDpp

dDpDpDp

)(

),()(

()

()

x

xx

because the samples are drawn independently.

The Parameter Distribution

)( Dp x

)Dp (• This equation links the class-conditional density

to the posterior, . But numerical solutions are typically required!


• Case: only mean unknown ),()( 2 Np x

• Known prior density: ),()( 200 Np

• Using Bayes formula:

)()(

])()([

)()(

)()(

)(

)()()(

)()()()(

1

pp

pDp

dpDp

pDp

Dp

pDpDp

pDpDpDp

n

kk

x

• Rationale: Once a value of is

known, the density for x is

completely known. is a

normalization factor that

depends on the data, D.

Univariate Gaussian Case


• Applying our Gaussian assumptions:

n

k

k

n

k

kn

k

k

n

k

kk

n

k

k

n

k

kDp

12

2

220

020

2

12

2

220

020

2

12

2

20

20

12

2

22

2

20

200

2

1

22

0

0

2

0

0

01

2

)2(2

2

1exp

)2(2

2

1exp

2

1exp

22

2

1exp

2

1exp

2

1exp

2

1

2

1exp

2

1|

x

xx

xx

x

x

Univariate Gaussian Case


• Now we need to work this into a simpler form:

Univariate Gaussian Case (Cont.)

n

kkn

n

n

kk

n

kk

n

k

kn

k

n

k

k

nwhere

nn

nn

n

Dp

1

20

02

220

2

20

0

12

220

2

20

0

122

0

2

2

2

20

0

122

0

2

12

2

12

2

220

020

2

1ˆ

)ˆ(1

21

2

1(exp

12

1

2

1(exp

222

1exp

222

1exp

)2(2

2

1exp|

x

x

x

x

x


Univariate Gaussian Case (Cont.)• p(|D) is an exponential of a quadratic function, which makes it a normal

distribution. Because this is true for any n, it is referred to as a reproducing density.

• p() is referred to as a conjugate prior.

• Write p(|D) ~ N(n,n2):

• Expand the quadratic term:

• Equate coefficients of our two functions:

])(2

1exp[

2

1)( 2

n

n

nDp

)]2

(2

1exp[

2

1])(

2

1exp[

2

1)(

2

222

n

nn

nn

n

n

Dp

20

02

220

2

2

22

)ˆ(1

21

2

1exp

2

2

1exp

2

1

n

n

nn

n

nn



• Rearrange terms so that the dependencies on are clear:

• Associate terms related to 2 and :

• There is actually a third equation involving terms not related to :

but we can ignore this since it is not a function of and is a complicated equation to solve.

20

022

20

2222

ˆ:

11:

nn

nn

nn

n

n

20

02

220

2

22

22

2

)ˆ(1

21

2

1exp

21

2

1exp

2

1exp

2

1

n

n

n

nn

n

n

nn

n

n

n

nn

n

n

or

2

1

2

1

2

1exp

2

1

2

1exp

2

1

02

2

2

2


• Two equations and two unknowns. Solve for n and n2. First, solve for n

2 :

220

220

220

20

2

20

2

2

11

nnnn


• Next, solve for n:

• Summarizing:

220

2202

0220

2

220

20 ˆ)(

n

nn

n

n

nn

220

2

0220

20

220

220

20

0220

220

2

20

2

02

2

ˆ

1)(ˆ

)(ˆ

nn

n

nn

n

n

n

n

nnnn


• n represents our best guess after n samples.

• n2 represents our uncertainty about this guess.

• n2 approaches 2/n for large n – each additional observation decreases our

uncertainty.

• The posterior, p(|D), becomes more sharply peaked as n grows large. This is known as Bayesian learning.

Bayesian Learning


• Getting ahead a bit, let’s see how we can put these ideas to work on a simple example due to David MacKay, and explained by Jon Hamaker.

“The Euro Coin”

http://www.inference.phy.cam.ac.uk/mackay/abstracts/euro.html

http://www.ece.msstate.edu/research/isip/publications/seminars/msstate/2002/euro_coin/presentation_v0.pdf


Summary

• Review of maximum likelihood parameter estimation in the Gaussian case, with an emphasis on convergence and bias of the estimates.

• Introduction of Bayesian parameter estimation.

• The role of the class-conditional distribution in a Bayesian estimate.

• Estimation of the posterior and probability density function assuming the only unknown parameter is the mean, and the conditional density of the “features” given the mean, p(x|), can be modeled as a Gaussian distribution.

ece 8443 – pattern recognition lecture 06: maximum likelihood and bayesian estimation objectives:...

Documents