ece 8443 – pattern recognition lecture 06: maximum likelihood and bayesian estimation objectives:...
TRANSCRIPT
ECE 8443 – Pattern Recognition
LECTURE 06: MAXIMUM LIKELIHOOD ANDBAYESIAN ESTIMATION
• Objectives:Bias in ML EstimatesBayesian EstimationExample
• Resources:D.H.S.: Chapter 3 (Part 2)Wiki: Maximum LikelihoodM.Y.: Maximum Likelihood TutorialJ.O.S.: Bayesian Parameter EstimationJ.H.: Euro Coin
Audio:URL:
ECE 8443: Lecture 06, Slide 2
• Consider the case where only the mean, = , is unknown:
)()(2
1])2ln[(
2
1
)]()(2
1exp[
)2(
1ln[))(ln(
1
12/12/
kkd
kkdp
xx
xxx
t
tk
0ln1
n
kkp x
)())(ln( 1 kp xxkwhich implies:
)(
)]()(2
1[])2ln[(
2
1[
)]()(2
1])2ln[(
2
1[
1
1
1
k
kkd
kkd
x
xx
xx
t
t
because:
Gaussian Case: Unknown Mean (Review)
ECE 8443: Lecture 06, Slide 3
• Rearranging terms:
• Significance???
n
kk
n
kk
n
k
n
kk
n
kk
n
kk
n
n
1
1
1 1
1
1
1
1ˆ
0ˆ
0ˆ
0)ˆ(
0)ˆ(
x
x
x
x
x
• Substituting into the expression for the total likelihood:
0)(ln1
1
1
n
kk
n
kkpl xx
Gaussian Case: Unknown Mean (Review)
ECE 8443: Lecture 06, Slide 4
• Let = [,2]. The log likelihood of a SINGLE point is:
))(2
1])2ln[(
2
1))(ln( 1
121
22
kt
k (xxxp k
22
21
2
12
2
)(
2
1
)(1
))(ln(
k
k
x
x
xpl θθθ k
• The full likelihood leads to:
n
k
n
kk
n
k
k
n
kk
xx
x
12
1
21
122
21
2
11
2
ˆ)ˆ(0ˆ2
)ˆ(ˆ2
1
0)ˆ(ˆ1
Gaussian Case: Unknown Mean and Variance (Review)
ECE 8443: Lecture 06, Slide 5
• This leads to these equations:
2
1
22
11
)ˆ1
ˆˆ
1ˆˆ
n
kk
n
kk
xn
xn
(
• In the multivariate case:
n
kkk
n
kk
n
n
1
2
1
ˆˆ1
ˆ
1ˆ
txx
x
• The true covariance is the expected value of the matrix ,
which is a familiar result.
tkk ˆˆ xx
Gaussian Case: Unknown Mean and Variance (Review)
ECE 8443: Lecture 06, Slide 6
• Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later.
• Expected value of the ML estimate of the mean:
n
i
n
ii
n
ii
n
xEn
xn
EE
1
1
1
1
][1
]1[]ˆ[
22
1 12
2
11
22
22
][1
]11
[
]ˆ[
])ˆ[(]ˆ[]ˆvar[
n
i
n
jji
n
jj
n
ii
xxEn
xn
xn
E
E
EE
Convergence of the Mean (Review)
ECE 8443: Lecture 06, Slide 7
• The expected value of xixj will be 2 for j k since the two random variables are independent.
• The expected value of xi2 will be 2 + 2.
• Hence, in the summation above, we have n2-n terms with expected value 2 and n terms with expected value 2 + 2.
• Thus,
n
nnnn
222222
21
]ˆvar[
• We see that the variance of the estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero).
which implies:
22
22 ])ˆ[(]ˆvar[]ˆ[ n
EE
Variance of the ML Estimate of the Mean (Review)
ECE 8443: Lecture 06, Slide 8
2
11
2
22
222
2222
)1(
][
][2][
][][2][)[(
n
ii
n
ii x
nx
xE
ExE
ExExExE
Note that this implies:
22
1
2
n
iix
• Now we can combine these results. Recall our expression for the ML estimate of the variance:
n
iixn
E1
22 ]ˆ1[ˆ
• We will need one more result:
Variance Relationships
ECE 8443: Lecture 06, Slide 9
))(]ˆ[2)((1
])ˆ[]ˆ[2][(1
)]ˆˆ2([1
ˆ1[ˆ
22
1
22
2
1
2
2
1
2
1
22
nxEn
ExExEn
xxEn
xn
E
i
n
i
n
iii
i
n
ii
n
ii
• Expand the covariance and simplify:
n
nn
nn
xxExxEn
xxEn
xxExE ii
n
jij
ji
n
jji
n
jjii
2222222
111
)((1
))1((1
])[][(1
][1
][]ˆ[
• One more intermediate term to derive:
Covariance Expansion
ECE 8443: Lecture 06, Slide 10
2
1
2
1
22
1
2
2
1
2
2222
1
22
2222
1
22
22
1
222
)1(
)1(1)/11(
1)(
1
)(1
)22(1
))()(2)((1
))(]ˆ[2)((1
ˆ
n
n
n
n
nn
nn
n
nn
nnn
nnn
nxEn
n
i
n
i
n
i
n
i
n
i
n
i
i
n
i
• Substitute our previously derived expression for the second term:
Biased Variance Estimate
ECE 8443: Lecture 06, Slide 11
22
1
22 1]ˆ
1[ˆ
n
nx
nE
n
ii
• An unbiased estimator is:
n
i
tiin 1ˆˆ
1
1 xxC
• These are related by:
Cn
n )1(ˆ
which is asymptotically unbiased. See Burl, AJWills and AWM for excellent examples and explanations of the details of this derivation.
• Therefore, the ML estimate is biased:
However, the ML estimate converges (and is MSE).
Expectation Simplification
ECE 8443: Lecture 06, Slide 12
• In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(i), and class-conditional densities, p(x|i).
• Bayes: treat the parameters as random variables having some known prior distribution. Observations of samples converts this to a posterior.
• Bayesian learning: sharpen the a posteriori density causing it to peak near the true value.
• Supervised vs. unsupervised: do we know the class assignments of the training data.
• Bayesian estimation and ML estimation produce very similar results in many cases.
• Reduces statistical inference (prior knowledge or beliefs about the world) to probabilities.
Introduction to Bayesian Parameter Estimation
ECE 8443: Lecture 06, Slide 13
• Posterior probabilities, P(i|x), are central to Bayesian classification.
• Bayes formula allows us to compute P(i|x) from the priors, P(i), and the likelihood, p(x|i).
• But what If the priors and class-conditional densities are unknown?• The answer is that we can compute the posterior, P(i|x), using all of the
information at our disposal (e.g., training data).• For a training set, D, Bayes formula becomes:
c
jjj
iii
DPDp
DPDpDP
1)(),(
)(),(
evidence
priorlikelihood),|(
x
xx
• We assume priors are known: P(i|D) = P(i).
• Also, assume functional independence:
Di have no influence on
This gives:
c
jjjj
iiii
PDp
PDpDP
1)(),(
)(),(),|(
x
xx
Class-Conditional Densities
jiifDp j ),( x
ECE 8443: Lecture 06, Slide 14
• Assume the parametric form of the evidence, p(x), is known: p(x|).
• Any information we have about prior to collecting samples is contained in a known prior density p().
• Observation of samples converts this to a posterior, p(|D), which we hope is peaked around the true value of .
• Our goal is to estimate a parameter vector:
dDpDp )()( ,xx
• We can write the joint distribution as a product:
dDpp
dDpDpDp
)(
),()(
()
()
x
xx
because the samples are drawn independently.
The Parameter Distribution
)( Dp x
)Dp (• This equation links the class-conditional density
to the posterior, . But numerical solutions are typically required!
ECE 8443: Lecture 06, Slide 15
• Case: only mean unknown ),()( 2 Np x
• Known prior density: ),()( 200 Np
• Using Bayes formula:
)()(
])()([
)()(
)()(
)(
)()()(
)()()()(
1
pp
pDp
dpDp
pDp
Dp
pDpDp
pDpDpDp
n
kk
x
• Rationale: Once a value of is
known, the density for x is
completely known. is a
normalization factor that
depends on the data, D.
Univariate Gaussian Case
ECE 8443: Lecture 06, Slide 16
• Applying our Gaussian assumptions:
n
k
k
n
k
kn
k
k
n
k
kk
n
k
k
n
k
kDp
12
2
220
020
2
12
2
220
020
2
12
2
20
20
12
2
22
2
20
200
2
1
22
0
0
2
0
0
01
2
)2(2
2
1exp
)2(2
2
1exp
2
1exp
22
2
1exp
2
1exp
2
1exp
2
1
2
1exp
2
1|
x
xx
xx
x
x
Univariate Gaussian Case
ECE 8443: Lecture 06, Slide 17
• Now we need to work this into a simpler form:
Univariate Gaussian Case (Cont.)
n
kkn
n
n
kk
n
kk
n
k
kn
k
n
k
k
nwhere
nn
nn
n
Dp
1
20
02
220
2
20
0
12
220
2
20
0
122
0
2
2
2
20
0
122
0
2
12
2
12
2
220
020
2
1ˆ
)ˆ(1
21
2
1(exp
12
1
2
1(exp
222
1exp
222
1exp
)2(2
2
1exp|
x
x
x
x
x
ECE 8443: Lecture 06, Slide 18
Univariate Gaussian Case (Cont.)• p(|D) is an exponential of a quadratic function, which makes it a normal
distribution. Because this is true for any n, it is referred to as a reproducing density.
• p() is referred to as a conjugate prior.
• Write p(|D) ~ N(n,n2):
• Expand the quadratic term:
• Equate coefficients of our two functions:
])(2
1exp[
2
1)( 2
n
n
nDp
)]2
(2
1exp[
2
1])(
2
1exp[
2
1)(
2
222
n
nn
nn
n
n
Dp
20
02
220
2
2
22
)ˆ(1
21
2
1exp
2
2
1exp
2
1
n
n
nn
n
nn
ECE 8443: Lecture 06, Slide 19
Univariate Gaussian Case (Cont.)
• Rearrange terms so that the dependencies on are clear:
• Associate terms related to 2 and :
• There is actually a third equation involving terms not related to :
but we can ignore this since it is not a function of and is a complicated equation to solve.
20
022
20
2222
ˆ:
11:
nn
nn
nn
n
n
20
02
220
2
22
22
2
)ˆ(1
21
2
1exp
21
2
1exp
2
1exp
2
1
n
n
n
nn
n
n
nn
n
n
n
nn
n
n
or
2
1
2
1
2
1exp
2
1
2
1exp
2
1
02
2
2
2
ECE 8443: Lecture 06, Slide 20
• Two equations and two unknowns. Solve for n and n2. First, solve for n
2 :
220
220
220
20
2
20
2
2
11
nnnn
Univariate Gaussian Case (Cont.)
• Next, solve for n:
• Summarizing:
220
2202
0220
2
220
20 ˆ)(
n
nn
n
n
nn
220
2
0220
20
220
220
20
0220
220
2
20
2
02
2
ˆ
1)(ˆ
)(ˆ
nn
n
nn
n
n
n
n
nnnn
ECE 8443: Lecture 06, Slide 21
• n represents our best guess after n samples.
• n2 represents our uncertainty about this guess.
• n2 approaches 2/n for large n – each additional observation decreases our
uncertainty.
• The posterior, p(|D), becomes more sharply peaked as n grows large. This is known as Bayesian learning.
Bayesian Learning
ECE 8443: Lecture 06, Slide 22
• Getting ahead a bit, let’s see how we can put these ideas to work on a simple example due to David MacKay, and explained by Jon Hamaker.
“The Euro Coin”
ECE 8443: Lecture 06, Slide 23
Summary
• Review of maximum likelihood parameter estimation in the Gaussian case, with an emphasis on convergence and bias of the estimates.
• Introduction of Bayesian parameter estimation.
• The role of the class-conditional distribution in a Bayesian estimate.
• Estimation of the posterior and probability density function assuming the only unknown parameter is the mean, and the conditional density of the “features” given the mean, p(x|), can be modeled as a Gaussian distribution.