vikramkrishnamurthy2013 part 3: ml parameter estimationbo/vikram/mle.pdf · 2013-04-29 · online...
TRANSCRIPT
c©Vikram Krishnamurthy 2013 1
Part 3: ML
Parameter
Estimation
Aim: The key question answered here is: Given a partially
observed stochastic dynamical system, how does one
estimate the parameters of the system?
Also joint recursive parameter and state estimation
algorithms are described.
The algorithms make extensive use of filtering.
OUTLINE
• ML Parameter Estimation
– ML criterion
– 2 Simple Examples
– EM Algorithm
– Case Studies
EM algorithm for Errors-in-Variables Estimation
Baum Welch Algorithm for HMMs
• Recursive Parameter Estimation
c©Vikram Krishnamurthy 2013 2
Example: Blind Deconvolution
FIR CHANNEL
H(q−1)
xt
vt
st yt
Assumptions: Assume unknown FIR channel
H(q−1) = h0 + h1q1 + · · ·+ hLq−L
Digital input xt is assumed Markov (possibly unknown
probabilities and state levels)
vt is white Gaussian (possibly unknown variance).
EM algorithm for ML estimation can be used to compute:
(i) Parameters (channel, trans prob, noise var, levels)
(ii)and simultaneously provide optimal state estimate.
The EM algorithm is off-line and operates on a fixed
batch of data. This motivates the need for a recursive or
online joint parameter and state estimation algorithm – in
particular the channel characteristics might change slowly
with time – then a recursive estimator is necessary.
Recursive EM algorithms can also be obtained.
c©Vikram Krishnamurthy 2013 3
ML Estimation
Given a sequence of measurements YN△= (y1, . . . , yN )
likelihood function
L(θ,N)△= p(YN ; θ), θ ∈ Θ
where Θ is the parameter space.
Likelihood function is a measure of the plausibility of the
data under parameter θ. Our aim is to pick θ which
makes data most plausible.
Aim: Compute maximum likelihood (ML) parameter
estimate
θML(N)△= argmax
θ∈ΘL(θ,N)
Often it is more convenient to maximize logL(θ,N).
Clearly argmaxθ L(θ,N) = argmaxθ logL(θ,N).
Why ML Estimation? MLE often has 2 nice properties
1. Strong Consistency: Let θ∗ be true parameter. Then
limN→∞
θML(N) → θ∗ w.p.1
2. Asymptotic Normality: The MLE is normally
distributed about the true parameter:√N(θML(N)− θ∗) → N(0, I−1
θ∗ )
where Iθ∗ is the Fisher Information Matrix.
c©Vikram Krishnamurthy 2013 4
Perspective – Point Estimation
Let θ∗ denote the true model. MLE is an example of a
“point” estimate. Given data vector Y , we seek an
estimator g(Y ) of the true parameter θ∗ ∈ Θ.
Unbiased: An estimator g(Y ) is unbiased if
EY |θ∗{g(Y )} = θ∗, for any θ∗ ∈ Θ
Cramer Rao Bound: Any unbiased estimator satisfies
cov(g) = EY |θ∗{(g(Y )− θ∗)(g(Y )− θ∗)′} ≥ I−1θ∗
Efficient Estimator: g(Y ) is an efficient estimator if it
meets the CR bound.
Remark: Often the MLE is an efficient estimator.
We will discuss numerical algorithms for MLE estimation
for two partially observed stochastic dynamical systems:
1. Linear Gaussian State Space Models.
2. Hidden Markov Models.
The consistency and asymptotic normality of the MLE for
Linear Gaussian State Space Models is shown in several
textbooks – e.g. Caines, Hannan & Deistler.
The consistency of the MLE for HMMs was proved in
c©Vikram Krishnamurthy 2013 5
1992 by Leroux. The proof is considerably more difficult.
Asymptotic normality was only proved very recently.
2 Simple Examples
For partially observed models MLE needs to be
numerically computed (as shown later). For fully observed
models MLE can sometimes be analytically computed.
Here are 2 examples.
1. MLE for Gaussian Linear Model: Suppose
Y = Ψθ + ǫ, ǫ ∼ N(0N ,ΣN×N )
Then likelihood function is
p(Y ; θ) = (2π)−N/2|Σ|−1/2 exp
(
−1
2(Y −Ψθ)′Σ−1(Y −Ψθ)
)
It is more convenient to maximize the log likelihood.
log p(Y ; θ) = −N
2log 2π − 1
2log |Σ|
− 1
2(Y −Ψθ)′Σ−1(Y −Ψθ)
Setting ddθ
log p(Y ; θ) = 0 yields
θML = (Ψ′Σ−1Ψ)−1Ψ′Σ−1Y
which coincides with least squares parameter estimate.
c©Vikram Krishnamurthy 2013 6
2. MLE for Markov Chain: Suppose
YN = (y1, . . . , yN ) is a S state chain. Parameter is
transition prob matrix θ = (aij , i, j ∈ {1, . . . , S})
Note the parameter constraints:
S∑
j=1
aij = 1, 0 ≤ aij ≤ 1
The likelihood and log likelihood functions are
p(YN ; θ) = p(yn|yn−1; θ)p(yn−1|yn−2; θ) · · · p(y1|y0; θ)p(y0; θ)
log p(YN ; θ) =
N∑
k=1
log p(yk|yk−1; θ) + log p(y0; θ)
=
N∑
k=1
S∑
i=1
S∑
j=1
δ(yk−1 = i, yk = j) log aij +
S∑
i=1
δ(y0 − i)π0(i)
=s∑
i=1
S∑
j=1
Jij(N) log aij +S∑
i=1
δ(y0 − i)π0(i)
(note f(yk, yk−1) =∑s
i=1
∑sj=1 f(i, j)I(yk = j, yk−1 = i));
Jij = # jumps from state i to state j from time 1 to N .
Then ddaj
log p(YN ; θ) = 0 subject to constraint yields
aij =Jij(N)
∑Sj=1 Jij(N)
=Jij(N)
Di(N)=
#jumps from i to j
#of visits in i
c©Vikram Krishnamurthy 2013 7
Numerical Algorithms for
MLE
In HMM and Gaussian State space model, the MLE can
be computed numerically. 2 algorithms are widely used:
1 Newton Raphson (NR) Algorithm for HMM
MLE: Given data YT = (y1, . . . , yT ) and initial parameter
estimate θ(0) ∈ Λ.
For iterations I = 1, 2, . . . ,, given model θ(I) at iteration I:
• Compute L(θ), ∇θL(θ), ∇2θL(θ) at θ = θ(I)
recursively using optimal filter as follows
(i) Compute HMM filter αθk, k = 1, . . . , T as
αk+1(j) = P (xk+1 = qj , Yk+1) =
S∑
i=1
αk(i)aijbj(yk+1)
Likelihood L(θ) = P (YT |θ) =∑S
i=1 αθT (i)
(ii) Compute derivative ∇θL(θ) =∑S
i=1 RθT (i) where
filter sensitivity Rθk(i) = ∇θα
θk(i), k = 1, . . . , T is
Rθk+1(j) =
(∇θ(b
θj (yk+1)
)S∑
i=1
aθijα
θk(i)
+ bθj (yk+1)S∑
i=1
(∇θa
θij
)αθk(i) + bθj (yk+1)
S∑
i=1
aθijR
θk(i)
c©Vikram Krishnamurthy 2013 8
• Update parameter estimate via Newton Raphson as:
θ(I+1) = θ(I) +[∇2
θL(θ)]−1 ∇θL(θ)
∣∣∣∣θ=θ(I)
Often Hessian ∇2L is approximated (Pseudo Newton).
2. Expectation Maximization (EM) Algorithm:
• Developed in 1976 by Dempster, Laird, Rubin.
• Widely used in last 15 years
• Recent variants based on MCMC yield Stochastic EM
algorithms that are globally convergent. Algorithms
such as Data augmentation can be viewed as
stochastic versions of EM.
• EM can also be used for MAP state estimation in
Generalized HMMs (Jump Markov Linear Systems)
Aside: Optimal Fixed Interval Smoother Given
state space model:
xk+1 = f(xk) + wk
yk = h(xk) + vk Yk = (y1, . . . , yk)
Recall optimal smoother computes E{xk|YT } for T > k.
There are 3 types of smoothing problems:
1. Fixed point smoother: For fixed k0, compute
E{xk0 |Yk} for k = k0, k0 + 1, . . ..
c©Vikram Krishnamurthy 2013 9
2. Fixed interval smoother: Compute E{xk|YT } for
k = 1, . . . , T (we will use this in the EM algorithm below).
3. Fixed lag smoother: For fixed lag ∆ > 0, compute
E{xk|Yk+∆}.
Derivation of Fixed Interval Smoother. Recall
αk(x) = p(xk, Yk), computed recursively as
αk+1(x) = p(yk+1|xk+1)
∫
αk(z)p(xk+1 = x|xk = z)dz
Let βk(x) = p(Yk+1:T |xk) where Yk+1:T = (yk+1, . . . , yT ).
Compute β via the backward recursion k = T, T − 1, . . . , 0.
βk(z) =
∫
p(yk+1|xk+1 = x)p(xk+1 = x|xk = z)βk+1(x)dx,
Initialized by βT (z) = 1 ∀z.
Then clearly fixed interval smoothed estimate is
p(xk = x|YT ) = γk(x) =αk(x)βk(x)
∫αk(x)βk(x)dx
E{xk|YT } =
∫
xγk(x)dx
Similarly γk(xk, xk+1) = p(xk, xk+1|YT ) is computed as:
γk(xk, xk+1) =αk(xk)p(xk+1|xk)p(yk+1|xk+1)βk(xk+1)
∫ ∫[Numerator]dxkdxk+1
c©Vikram Krishnamurthy 2013 10
Example: HMM Smoothing: For S state HMM θ
αk+1(j) = P (xk+1 = qj |Yk+1) =
S∑
i=1
αk(i)aijbj(yk+1)
βk(i) = p(Yk+1:T |xk = qi) =
S∑
j=1
βk+1(j)aijbj(yk+1)
γk(i) = P (xk = qi|YT ) =αk(i)βk(i)
∑Si=1 αk(i)βk(i)
γk(i, j) = P (xk = qi, xk+1 = qj |YT )
=αk(i)aijbj(yk+1)βk+1(j)
∑Si=1
∑Sj=1 αk(i)aijbj(yk+1)βk+1(j)
Expected duration time in state i given data YT is
E{DT (i)|YT } =
T∑
k=1
γk(i)
Expected number of jumps from state i to state j
E{NT (i, j)|YT } =
T∑
k=1
γk(i, j)
Also note that γk(i) =∑S
j=1 γk(i, j). So
S∑
j=1
E{NT (i, j)|YT } = E{DT (i)|YT }
c©Vikram Krishnamurthy 2013 11
Implementation: αk and βk are S dimensional vectors.
Computational cost: O(S2T ), memory cost O(ST ).
Called forward backward algorithm in Rabiner’s HMM
paper. For Kalman case, βk and γk are Gaussian. βk
computed by backward Kalman filter.
EM Algorithm for HMM maximum likelihood
parameter estimation of transition probability:
Consider HMM with unknown trans probability θ∗ = A.
Choose initial θ(0).
For iterations I = 1, 2, . . .:
Step 1: Use model θ = θ(I) to compute αθk(i), β
θk(i),
γθk(i), D̂
θT (i) =
∑Tk=1 γ
θk(i), N̂
θT (i, j) =
∑Tk=1 γ
θk(i, j).
Called the Expectation (E) step.
Step 2: In analogy to MLE of Markov chain, compute
new model θ(I+1) as
a(I+1)ij =
N̂θT (i, j)
D̂θT (i)
=E{NT (i, j)|YT , θ}E{DT (i)|YT , θ}
, where θ = θ(I)
This can be interpreted as maximizing the complete data
likelihood function (see later). Maximization (M) step.
Go to Step 1.
Above update is guaranteed to generate valid transition
probability estimates since∑S
j=1 N̂T (i, j) = D̂T (i). Unlike
Newton Raphson, no constraints are required.
c©Vikram Krishnamurthy 2013 12
EM Algorithm
Consider partially observed stoch dynamical system
xk+1 = f(xk; θ) + wk
yk = h(xk) + vk
Let XT = (x1, . . . , xT ), YT = (y1, . . . , yT ).
Aim: Given a sequence of observations YT compute MLE.
From an initial parameter estimate θ(0), EM iteratively
generates a sequence of estimates θ(I), I = 1, 2, . . . as
follows:
Each iteration consists of 2 steps:
• E Step: Evaluate auxiliary (complete) likelihood
Q(θ(I), θ) = E{ln p(XT , YT ; θ)|YT , θ(k)}
• M step: Maximize auxiliary (complete) likelihood, i.e,
compute
θ(I+1) = maxθ
Q(θ(I), θ)
Remark: Notice that the EM algorithm involves
computing smoothed state densities – these involve
forward and backward state filters. Thus optimal filtering
is an important ingredient of the EM algorithm.
c©Vikram Krishnamurthy 2013 13
Advantages of EM Algorithm
• Monotone property: L(θ(I+1)) ≥ L(θ(I)) (equality
holds at a local maximum)
NR does not have monotone property.
• In many cases, EM is much simpler to apply than
NR. (e.g. HMMs, Error-in-variables models)
• EM is numerically more robust than NR; inverse of
Hessian is not required in EM.
• Recent variants of the EM speed up convergence –
SAGE, AECM, [MV97]
Dis-advantages of EM Algorithm
• Linear convergence: NR has quadratic convergence
rate
• NR automatically yields estimates of parameter
estimate variance. EM does not.
c©Vikram Krishnamurthy 2013 14
Example 1: EM algorithm for
Linear Gaussian State Space
Model Estimation
Consider scalar linear Gaussian state space model. (Easily
generalized to multidimensional models.)
State xt = a xt−1 + wt
Observations yt = xt + vt
wt ∼ N(0, σ2w), vt ∼ N(0, σ2
v) white Gaussian processes.
Aim: Estimate θ = (a, σ2w, σ
2v).
Applications: Speech coding, Econometrics [Gho89],
Multisensor speech enhancement [WOFB94], errors in
variables model [EK99].
EM Algorithm
E Step: The aim is to compute
Q(θ(I), θ) = E{ln p(YT , XT |θ)|YT , θ(I)}
c©Vikram Krishnamurthy 2013 15
Result: The auxiliary likelihood Q(θ(I), θ) is:
Q(θ(I), θ) = −T
2ln σ2
v − 1
2σ2v
T∑
t=1
E{(yt − xt)2|YT , θ
(I)}
− T
2lnσ2
w − 1
2σ2w
T∑
t=1
E{(xt − a xt−1)2|YT , θ
(I)}
So we need to compute:
E{xt|YT , θ}, E{xt xt−1|YT , θ}, E{x2t |YT , θ}, E{x2
t−1|YT , θ}These are obtained via a Kalman Smoother
M Step: Compute θ(k+1) = maxθ Q(θ(I), θ)
Setting ∂Q/∂θ = 0 yields:
a =
∑Tt=1 E{xt xt−1|YT , θ
(I)}∑T
t=1 E{x2t |YT , θ(k)}
σ2v =
1
T
T∑
t=1
(
y2t + E{x2
t |YT } − 2E{xt yt|YT , θ(I)}
)
σ2w =
1
T
T∑
t=1
E{(xt − a xt−1)2|YT , θ
(I)}
Set θ(I+1) = (d, σ2v , σ
2w)
Remarks: (i) The update for a is similar to the Yule
Walker equations (apart from conditioning on YT ).
(ii) Estimates σv and σw are non-negative by construction.
c©Vikram Krishnamurthy 2013 16
Example 2: EM algorithm for
HMM Estimation
Consider S state Markov chain xk ∈ q = {q1, . . . , qS} with
trans prob matrix A. = (aij), i, j ∈ {1, . . . , S}.Consider Markov chain xk observed in Gaussian noise:
yk = xk + vk, vk ∼ N(0, σ2v) white
Aim: Estimate θ = (q,A, σ2v).
Application: Neurobiology, Channel Equalization,
Target Tracking, Speech Recognition
EM Algorithm: (called Baum Welch algorithm)
E Step: Compute Q(θ(I), θ) = E{ln p(YT , XT |θ)|YT , θ(I)}
Result: The auxiliary likelihood Q(θ(I), θ) is:
Q(θ(I), θ) = −T
2ln σ2
v − 1
2σ2v
T∑
t=1
S∑
i=1
E{(yt − qi)2}γt(i)
+
T∑
t=1
S∑
i=1
S∑
j=1
γt(i, j) log aij
where γt(i) = p(xt = qi|YT ; θ(I)),
γt(i, j) = p(xt = qi, xt+1 = qj |YT ; θ(I)) require a HMM
state smoother (forward backward HMM filter).
c©Vikram Krishnamurthy 2013 17
M Step: Setting ∂Q/∂θ = 0 yields θ(I+1):
aij =
∑Tt=1 γt(i, j)
∑tt=1 γt(i)
=E{#jumps from i to j|YT , θ
(I)}E{#of visits in i|YT , θ(I)}
qi =
∑Tt=1 γt(i)yt
∑Tt=1 γt(i)
σ2v =
1
T
T∑
t=1
S∑
i=1
γt(i)(yt − qi)2
Remarks: 1. Nice property of EM is that estimates
0 ≤ aij < 1,∑
j aij = 1 is guaranteed by construction.
Similarly, σ2v ≥ 0.
2. Can generalize the above to much more general HMMs
– e.g. state dependent noise, Markov Modulated ARX
time series.
3. The above EM is a smoother-based EM – the statistics
are computed in terms of the smoothed density γ. In
1990s filter based EMs have been developed – e.g. Elliott,
James Krishnamurthy and LeGland.
4. The EM algorithm can be formulated for continuous
time HMMs.
c©Vikram Krishnamurthy 2013 18
Derivation of Q(θ(I), θ) for
HMM
ln p(YT , XT |θ) = ln
T∏
t=1
p(yt|xt)p(xt|xt−1)
=
T∑
t=1
ln p(yt|xt) +
T∑
t=1
ln p(xt|xt−1)
=
T∑
t=1
S∑
i=1
δ(xt = i) ln p(yt|xt = i)
+
T∑
t=1
∑
i
∑
j
δ(xt = i, xt+1 = j) lnP (xt+1) = qj |xt = qi)
=
S∑
i=1
T∑
t=1
δ(xt = i)
[
ln(1√2πσv
)− (yt − qi)2
2σ2v
]
+∑
i
∑
j
T∑
t=1
δ(xt = i, xt+1 = j) ln aij
Q(θ(I), θ) = E{ln p(YT ,XT |θ)|YT , θ(I)}
= const− T
2ln σ2
v −∑
i
∑
t
γθ(I)
t (i)(yt − qi)
2
2σ2v
+∑
i
∑
j
∑
t
γθ(I)
t (i, j) ln aij
c©Vikram Krishnamurthy 2013 19
Proof of EM algorithm
Theorem: Given an observation sequence YT , and
Q(θ(I), θ) = E{ln p(XT , YT |θ)|θ(I), YT }. Then
θ(I+1) = argmaxθ
Q(θ(I), θ) =⇒ P (YT |θ(I+1)) ≥ P (YT |θ(I))
To prove the theorem, first consider following lemma.
Lemma: For any θ, Q fn increases slower than log
likelihood in terms of θ. That is:
Q(θ(I), θ)−Q(θ(I), θ(I)) ≤ lnP (YT |θ)− lnP (YT |θ(I))
Then clearly choosing θ(I+1) such that
Q(θ(I), θ(I+1)) ≥ Q(θ(I), θ(I))
implies that P (YT |θ(I+1)) ≥ P (YT |θ(I)). In particular, the
best choice θ(I+1) = argmaxθ Q((θ(I), θ) implies
P (YT |θ(I+1)) ≥ P (YT |θ(I)).Remark 1.: Just because likelihoods are monotone
increasing does not mean EM converges. For convergence,
require continuity of Q, compactness of θ ∈ Θ, etc, see
(Wu, Annals of Statistics, 1983, pp.95–103). Wu uses
Zangwill’s global convergence theorem which is a standard
tool in optimization theory to prove global convergence of
an algorithm
c©Vikram Krishnamurthy 2013 20
Remark 2:
Q(θ(I), θ)−Q(θ(I), θ(I)) = E{ln P (YT , XT |θ)P (YT , XT |θ(I))
|YT , θ(I)}
is the Kullback-Liebler information measure widely used
in information theory. Dempster, Laird and Rubin
invented EM algorithm, 1977.
Proof of Lemma:
Q(θ(I), θ)−Q(θ(I), θ(I)) = E{ln P (YT , XT |θ)P (YT ,XT |θ(I))
|YT , θ(I)}
by Jensen’s inequality ≤ lnE{ P (YT , XT |θ)P (YT ,XT |θ(I))
|YT , θ(I)}
= ln
∫P (YT , XT |θ)
P (YT ,XT |θ(I))P (XT |YT , θ
(I))dXT
= ln
∫P (YT , XT |θ)
P (XT |YT , θ(I))P (YT |θ(I))P (XT |YT , θ
(I))dXT
= ln
∫P (YT ,XT |θ)P (YT |θ(I))
dXT = lnP (YT |θ)
P (YT |θ(I))
Jensen’s inequality:
f(X) convex =⇒ E{f(X)} ≥ f(E{X})
Hence f(X) concave =⇒ E{f(X)} ≤ f(E{X})
c©Vikram Krishnamurthy 2013 21
Consistency of MLE
Suppose y1, . . . , yT is an iid sequence of observations.
θ∗ ∈ Θ true parameter. θT : MLE based on y1, . . . , yT .
Aim: How to prove that limT→∞ θT → θ∗ w.p.1. (Strong
consistency of the MLE). We use Wald’s approach.
Assume Θ is compact (i.e., closed bounded interval in
RS). Then MLE
θT = argmaxθ∈Θ
1
Tlog p(y1, . . . , yT |θ) = argmax
θ∈Θ
1
T
T∑
k=1
log p(yk|θ)
Assuming Eθ∗{| log p(yk|θ)|} < ∞, then by SLLN,
limT→∞
1
T
T∑
k=1
log p(yk|θ) = Eθ∗{log p(yk|θ)}︸ ︷︷ ︸
Kullback-Liebler K(θ,θ∗)
w.p.1
So limT→∞
1
Tlog p(y1, . . . , yT |θ) → K(θ, θ∗)
Lemma: From Jensen’s inequality argmaxθ K(θ, θ∗) = θ∗
So we would intuitively expect
argmaxθ
limT→∞
1
Tlog p(y1, . . . , yT |θ) → argmax
θK(θ, θ∗) w.p.1
– that is, θT → θ∗ w.p.1 . More rigorously we need
limT→∞
supθ∈Θ
1
Tlog p(y1, . . . , yT |θ) w.p.1→ K(θ, θ∗) uniform convergence
c©Vikram Krishnamurthy 2013 22
Recursive Parameter
Estimation
Aim: Joint parameter and state estimation in real time.
Recall for HMM case Q(θ(I), θ) =∑T
t=1 φt:
Q(θ(I), θ) = −T
2lnσ2
v − 1
2σ2v
T∑
t=1
S∑
i=1
E{(yt − qi)2}γk(i)
+
T∑
t=1
S∑
i=1
S∑
j=1
γt(i, j) log aij
=
T∑
t=1
φt (this is defn of φt)
Recursive Estimation Algorithm:
θt+1 = θt + γt∂φt
∂θt
where γt is step size.
There are 3 popular choices of step size
1. Recursive EM Algorithm: γt = ( ∂2Q
∂θ2)−1
2. Gradient Algorithm: γt = 1/t.
3. Iterate Averaging (Polyak): γt = 1/√t followed by
averaging θt.
4. Tracking Algorithms: Constant step size γt = γ
c©Vikram Krishnamurthy 2013 23
Example: Markov Modulated
Time Series
Markov Modulated AR process:
zk+1 = a(xk)zk + b(xk)wk
zk: observations, xk: S state unobserved Markov chain.
Arises in econometrics, LPC of speech, fault detection.
1. Similar algorithm to HMM filter yields
E{xk|z1, . . . , zk}. Also EM and recursive EM can be used
for parameter estimation.
2. Image Enhanced Tracking:
zk+1 = a(xk)zk + b(xk)wk
yk = xk + vk
xk is mode of target – e.g. orientation.
yk denotes noisy measurements of mode (e.g. imager).
Aim: Estimate coordinate of target zk given y1, . . . , yk.
Again a filter similar to HMM filter can be derived.
Parameters can be estimated via EM algorithm
3. Markov Modulated Poisson Process: Here Nt is a
Poisson process whose rate λ(xk) is Markov modulated. A
MMPP filter is similar to a HMM filter. Also EM can be
used to compute parameters.