dept. of electrical and computer engineering lecture 5

14
Theory & Applications of Pattern Recognition P R Dept. of Electrical and Computer Engineering 0909.402.02 / 0909.504.04 Lecture 5 Parametric Density Estimation Parameter Estimation Maximum Likelihood Estimation Bayesian Estimation Background: Pattern Classification, Duda, Hart and Stork, Copyright ® John Wiley and Sons, 2001, PR logo Copyright © Robi Polikar, 2001 Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

Upload: others

Post on 10-Jun-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dept. of Electrical and Computer Engineering Lecture 5

Theory & Applications of

Pattern Recognition

PR

Dept. of Electrical and Computer Engineering0909.402.02 / 0909.504.04Lecture 5

Parametric Density Estimation

Parameter EstimationMaximum Likelihood Estimation

Bayesian Estimation

Background: Pattern Classification, Duda, Hart and Stork, Copyright ® John Wiley and Sons, 2001, PR logo Copyright © Robi Polikar, 2001

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

Page 2: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

Today in PR

Yes, Bayes is good, but we don’t know the distributions…So what good is it?Parametric Density Estimation – Parameter Estimation

Maximum likelihood estimation (MLE)• Ex: The Gaussian case (what else, that is the only one we know how to work with!)

Bias & variance dilemma – Part IBayesian estimation (BE)

• Ex: (Yes, you guessed it right) …The Gaussian caseBayesian learning

Page 3: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

Parameter Estimation

Bayes classifier cannot be used if we don’t know the prob. distributionsThis is typically the rule, not the exceptionIn most applications of practical interest, we do not know the underlying distributions

The distributions can be estimated, if there is sufficient dataSufficient ??? Make that “a ton of” , or better yet… “tons of data”Estimating the prior distribution is relatively easy, however, estimating the class-conditional distribution is difficult, and it gets giga-difficult as dimensionality increases….

If we know the form of the distribution, say normal (but of course, what else), but not its parameters, say mean and variance, the problem reduces to that of parameter estimation from distribution estimation.Nonparametric density estimation Coming soon to this classroom!

Page 4: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

Parameter Estimation

Two fundamental techniques that often achieve similar results:(1) Maximum Likelihood Estimation (MLE) & (2) Bayesian Parameter Estimation (BE)In MLE

We assume that the parameters θ are fixed, but yet unknown

The best estimate of these parameters are then obtained as those that maximize the probability of obtaining the samples that are actually observed – that is, the MLE seeks the solutions that best explains the observed data – what value of would make the observed data X most likely ?

In BEThe parameters are considered as random variables with assumed (known) priors, p(θ)Observation of data X allows us to convert these priors to posteriors (probability that the value of parameter is given that the specific data are observed estimate p (θ |X ) )

( )θxθθ

|maxargˆ p=

θ̂

θ̂

MLE BE

θ

Page 5: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

Maximum Likelihood Estimation

We assume that samples are collected randomly from a given form of distribution, whose parameters are unknown.Unknown parameters are denoted by the vector θ.We partition our training data X into c class specific subsets D1…Dc, assuming that the data in Dj have been drawn randomly according to the distribution p(x|ωj)If , for example, we know each distribution is normal, p(x|ωj) ~N(µj, Σj) but we do not know the mean and covariance, then our problem is to estimate

To make the dependence of p(x|ωj) on θj more specific, we can write p(x|ωj, θj ). Furthermore, to make our life easier – and why not – we assume that all θj are independent, i.e., knowing one does not tell us anything about the others. This will allow us to work for each θj separately, without worrying about interdependence of the parameters. We may therefore drop the class subscripts j

=

j

jj Σ

µθ

Page 6: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

MLE Problem

Determine θj for each class j=1,…,c, using a training dataset Dj which has n samples drawn independently (i.i.d) according to (dropping all subscripts j henceforth)

The value θ that maximizes this “likelihood” is our best estimate of the unknown parameter.ˆ

Likelihood of data θ wrt x

( ) ( )∏=

=n

kkpDp

1

|| θxθ

Page 7: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

The MLE

Few of the infinite number of possible Gaussian distributions of some variance and unknown mean θfrom which the observed values might be drawn

The likelihood p(D | θ) as a function of the mean.Ifwe have lots of training points, this likelihood would be very narrow.

The θ that maximizes this likelihood –the value we are looking for – also maximizes the logarithm of the likelihood l(θ). Since log is a monotonically increasing function

Observed data points (1D)

Page 8: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

Log - Likelihood

The θ that maximizes the likelihood function, also maximizes its logarithm, which is often easier to work with: The resulting function is called log-likelihood function

It should be obvious that maximizing a sum of terms is easier then maximizing a product of terms The θ that maximizes this function can be obtained by setting its gradient wrt to θ to

zero, and solving for θ.

)|(ln)( θθ Dpl ≡

0θxθθ ∑=

=∇=∇n

kpl

1)|(ln

)(maxargˆ θθθ

l=

( ) ( ) ( ) ( )θxθθxθ |ln||11

k

n

k

n

kk plpDp ∑∏

==

=⇒=

Page 9: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

Special Cases - Normal

1. The Gaussian Case , unknown µ: We observe n data points of d-dimension each

2. The Gaussian Case, unknown µ and Σ

I’m gonna do these on the board so thata. You are less likely to fall a sleep if you are writing at the same timeb. I’ll claim I solved examples in the class- and get away with that!

( )

( ) ( )[ ] ( ) ( )µxµxθx

0θxµ

−Σ−−Σ−=

=∇

=∑

kT

kd

k

n

kk

p

p

1

1

212ln

21|ln

|ln

π∑=

=n

kkn 1

1ˆ xµ

( )( )Tk

n

kk

n

kk nn

µxµxΣxµ ˆˆ1ˆ1ˆ11

−−== ∑∑==

Page 10: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

Bias & Variance

How good are reliable are these estimates? If assess the goodness and reliability throughBias: How close is the estimate to the true value ? goodnessVariance: How much would this estimate change had we tried this again with a different dataset also drawn from the same distribution? reliability

The variance is the range of values we obtain by estimating θ in different trials. The Bias is how far the mean of these different estimates are from the true value.

Ideally, we want both bias and variance of our estimate to be zero (or as small as possible)

And if you guessed we can usually reduce one of them at a time, at the expense of the other, you are RIGHT!

Page 11: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

Bias & Variance of MLE

For the estimates to be unbiased, we want the expected value (average) of all estimates we make to be equal to the true value. For the MLE of the mean:

Hence the MLE of the mean is an unbiased estimate of the true mean.For the variance

Hence the MLE of the variance is a biased estimate of the true variance, in part due to using µ instead of true µ. Note however that as n→∞, the MLE of the variance approaches true variance! This justifies the use of MLE.An unbiased estimate of the variance (or covariance matrix) is

[ ] µxxµ ==

= ∑∑

==

n

kk

n

kk E

nnEE

11

11]ˆ[

( ) 22

1

22 1ˆ1]ˆ[ σσµσ ≠−

=

−= ∑

= nnx

nEE

n

kk

( ) ( )∑=

−−−

=n

kk

Tkn 1

ˆˆ1

1ˆ µxµxΣ

ˆ

Page 12: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

MLE: After Thoughts

So there are two different estimates of covariance matrix, which one is correct? We prefer the unbiased one, but the unbiased estimate is too just that – an estimate!There is no such thing as more correct, these two estimates are just two different estimates. As n→∞ these two estimates approach each other (that is good!)

So far we have assumed that we have some knowledge of the underlying distribution, but what if our assumption is wrong?

E.g., we assume that the p(x|ωj) ~N(µj, 10), where in fact the true variance is, say4. Would the classifier we come up with this estimate be the best one, at least among all classifiers derived from N(µj, 10)?

• Unfortunately, the answer is NO!• Therefore: If the assumed model is wrong, the resulting classifier will be non-optimal.

Page 13: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

Bayesian Estimation

Unlike the MLE, where we assume that the unknown parameters have point values, BE assumes that the unknown parameters come from a distribution with known priors.Therefore, BE provides a distribution of values for θ rather than point values. BE provides more information, but often times much more difficult to compute.For most practical applications, if the assumptions are correct, MLE gives good results.

Page 14: Dept. of Electrical and Computer Engineering Lecture 5

PR

Theory and Applications of Pattern Recognition © 2003, Robi Polikar, Rowan University, Glassboro, NJ

Bayesian Estimation