vikramkrishnamurthy2013 part 3: ml parameter estimationbo/vikram/mle.pdf · 2013-04-29 · online...

c©Vikram Krishnamurthy 2013 1

Part 3: ML

Parameter

Estimation

Aim: The key question answered here is: Given a partially

observed stochastic dynamical system, how does one

estimate the parameters of the system?

Also joint recursive parameter and state estimation

algorithms are described.

The algorithms make extensive use of filtering.

OUTLINE

• ML Parameter Estimation

– ML criterion

– 2 Simple Examples

– EM Algorithm

– Case Studies

EM algorithm for Errors-in-Variables Estimation

Baum Welch Algorithm for HMMs

• Recursive Parameter Estimation


Example: Blind Deconvolution

FIR CHANNEL

H(q−1)

xt

vt

st yt

Assumptions: Assume unknown FIR channel

H(q−1) = h0 + h1q1 + · · ·+ hLq−L

Digital input xt is assumed Markov (possibly unknown

probabilities and state levels)

vt is white Gaussian (possibly unknown variance).

EM algorithm for ML estimation can be used to compute:

(i) Parameters (channel, trans prob, noise var, levels)

(ii)and simultaneously provide optimal state estimate.

The EM algorithm is off-line and operates on a fixed

batch of data. This motivates the need for a recursive or

online joint parameter and state estimation algorithm – in

particular the channel characteristics might change slowly

with time – then a recursive estimator is necessary.

Recursive EM algorithms can also be obtained.


ML Estimation

Given a sequence of measurements YN△= (y1, . . . , yN )

likelihood function

L(θ,N)△= p(YN ; θ), θ ∈ Θ

where Θ is the parameter space.

Likelihood function is a measure of the plausibility of the

data under parameter θ. Our aim is to pick θ which

makes data most plausible.

Aim: Compute maximum likelihood (ML) parameter

estimate

θML(N)△= argmax

θ∈ΘL(θ,N)

Often it is more convenient to maximize logL(θ,N).

Clearly argmaxθ L(θ,N) = argmaxθ logL(θ,N).

Why ML Estimation? MLE often has 2 nice properties

1. Strong Consistency: Let θ∗ be true parameter. Then

limN→∞

θML(N) → θ∗ w.p.1

2. Asymptotic Normality: The MLE is normally

distributed about the true parameter:√N(θML(N)− θ∗) → N(0, I−1

θ∗ )

where Iθ∗ is the Fisher Information Matrix.


Perspective – Point Estimation

Let θ∗ denote the true model. MLE is an example of a

“point” estimate. Given data vector Y , we seek an

estimator g(Y ) of the true parameter θ∗ ∈ Θ.

Unbiased: An estimator g(Y ) is unbiased if

EY |θ∗{g(Y )} = θ∗, for any θ∗ ∈ Θ

Cramer Rao Bound: Any unbiased estimator satisfies

cov(g) = EY |θ∗{(g(Y )− θ∗)(g(Y )− θ∗)′} ≥ I−1θ∗

Efficient Estimator: g(Y ) is an efficient estimator if it

meets the CR bound.

Remark: Often the MLE is an efficient estimator.

We will discuss numerical algorithms for MLE estimation

for two partially observed stochastic dynamical systems:

1. Linear Gaussian State Space Models.

2. Hidden Markov Models.

The consistency and asymptotic normality of the MLE for

Linear Gaussian State Space Models is shown in several

textbooks – e.g. Caines, Hannan & Deistler.

The consistency of the MLE for HMMs was proved in


1992 by Leroux. The proof is considerably more difficult.

Asymptotic normality was only proved very recently.

2 Simple Examples

For partially observed models MLE needs to be

numerically computed (as shown later). For fully observed

models MLE can sometimes be analytically computed.

Here are 2 examples.

1. MLE for Gaussian Linear Model: Suppose

Y = Ψθ + ǫ, ǫ ∼ N(0N ,ΣN×N )

Then likelihood function is

p(Y ; θ) = (2π)−N/2|Σ|−1/2 exp

(

−1

2(Y −Ψθ)′Σ−1(Y −Ψθ)

)

It is more convenient to maximize the log likelihood.

log p(Y ; θ) = −N

2log 2π − 1

2log |Σ|

− 1

2(Y −Ψθ)′Σ−1(Y −Ψθ)

Setting ddθ

log p(Y ; θ) = 0 yields

θML = (Ψ′Σ−1Ψ)−1Ψ′Σ−1Y

which coincides with least squares parameter estimate.


2. MLE for Markov Chain: Suppose

YN = (y1, . . . , yN ) is a S state chain. Parameter is

transition prob matrix θ = (aij , i, j ∈ {1, . . . , S})

Note the parameter constraints:

S∑

j=1

aij = 1, 0 ≤ aij ≤ 1

The likelihood and log likelihood functions are

p(YN ; θ) = p(yn|yn−1; θ)p(yn−1|yn−2; θ) · · · p(y1|y0; θ)p(y0; θ)

log p(YN ; θ) =

N∑

k=1

log p(yk|yk−1; θ) + log p(y0; θ)

=

N∑

k=1

S∑

i=1

S∑

j=1

δ(yk−1 = i, yk = j) log aij +

S∑

i=1

δ(y0 − i)π0(i)

=s∑

i=1

S∑

j=1

Jij(N) log aij +S∑

i=1

δ(y0 − i)π0(i)

(note f(yk, yk−1) =∑s

i=1

∑sj=1 f(i, j)I(yk = j, yk−1 = i));

Jij = # jumps from state i to state j from time 1 to N .

Then ddaj

log p(YN ; θ) = 0 subject to constraint yields

aij =Jij(N)

∑Sj=1 Jij(N)

=Jij(N)

Di(N)=

#jumps from i to j

#of visits in i


Numerical Algorithms for

MLE

In HMM and Gaussian State space model, the MLE can

be computed numerically. 2 algorithms are widely used:

1 Newton Raphson (NR) Algorithm for HMM

MLE: Given data YT = (y1, . . . , yT ) and initial parameter

estimate θ(0) ∈ Λ.

For iterations I = 1, 2, . . . ,, given model θ(I) at iteration I:

• Compute L(θ), ∇θL(θ), ∇2θL(θ) at θ = θ(I)

recursively using optimal filter as follows

(i) Compute HMM filter αθk, k = 1, . . . , T as

αk+1(j) = P (xk+1 = qj , Yk+1) =

S∑

i=1

αk(i)aijbj(yk+1)

Likelihood L(θ) = P (YT |θ) =∑S

i=1 αθT (i)

(ii) Compute derivative ∇θL(θ) =∑S

i=1 RθT (i) where

filter sensitivity Rθk(i) = ∇θα

θk(i), k = 1, . . . , T is

Rθk+1(j) =

(∇θ(b

θj (yk+1)

)S∑

i=1

aθijα

θk(i)

+ bθj (yk+1)S∑

i=1

(∇θa

θij

)αθk(i) + bθj (yk+1)

S∑

i=1

aθijR

θk(i)


• Update parameter estimate via Newton Raphson as:

θ(I+1) = θ(I) +[∇2

θL(θ)]−1 ∇θL(θ)

∣∣∣∣θ=θ(I)

Often Hessian ∇2L is approximated (Pseudo Newton).

2. Expectation Maximization (EM) Algorithm:

• Developed in 1976 by Dempster, Laird, Rubin.

• Widely used in last 15 years

• Recent variants based on MCMC yield Stochastic EM

algorithms that are globally convergent. Algorithms

such as Data augmentation can be viewed as

stochastic versions of EM.

• EM can also be used for MAP state estimation in

Generalized HMMs (Jump Markov Linear Systems)

Aside: Optimal Fixed Interval Smoother Given

state space model:

xk+1 = f(xk) + wk

yk = h(xk) + vk Yk = (y1, . . . , yk)

Recall optimal smoother computes E{xk|YT } for T > k.

There are 3 types of smoothing problems:

1. Fixed point smoother: For fixed k0, compute

E{xk0 |Yk} for k = k0, k0 + 1, . . ..


2. Fixed interval smoother: Compute E{xk|YT } for

k = 1, . . . , T (we will use this in the EM algorithm below).

3. Fixed lag smoother: For fixed lag ∆ > 0, compute

E{xk|Yk+∆}.

Derivation of Fixed Interval Smoother. Recall

αk(x) = p(xk, Yk), computed recursively as

αk+1(x) = p(yk+1|xk+1)

∫

αk(z)p(xk+1 = x|xk = z)dz

Let βk(x) = p(Yk+1:T |xk) where Yk+1:T = (yk+1, . . . , yT ).

Compute β via the backward recursion k = T, T − 1, . . . , 0.

βk(z) =

∫

p(yk+1|xk+1 = x)p(xk+1 = x|xk = z)βk+1(x)dx,

Initialized by βT (z) = 1 ∀z.

Then clearly fixed interval smoothed estimate is

p(xk = x|YT ) = γk(x) =αk(x)βk(x)

∫αk(x)βk(x)dx

E{xk|YT } =

∫

xγk(x)dx

Similarly γk(xk, xk+1) = p(xk, xk+1|YT ) is computed as:

γk(xk, xk+1) =αk(xk)p(xk+1|xk)p(yk+1|xk+1)βk(xk+1)

∫ ∫[Numerator]dxkdxk+1


Example: HMM Smoothing: For S state HMM θ

αk+1(j) = P (xk+1 = qj |Yk+1) =

S∑

i=1

αk(i)aijbj(yk+1)

βk(i) = p(Yk+1:T |xk = qi) =

S∑

j=1

βk+1(j)aijbj(yk+1)

γk(i) = P (xk = qi|YT ) =αk(i)βk(i)

∑Si=1 αk(i)βk(i)

γk(i, j) = P (xk = qi, xk+1 = qj |YT )

=αk(i)aijbj(yk+1)βk+1(j)

∑Si=1

∑Sj=1 αk(i)aijbj(yk+1)βk+1(j)

Expected duration time in state i given data YT is

E{DT (i)|YT } =

T∑

k=1

γk(i)

Expected number of jumps from state i to state j

E{NT (i, j)|YT } =

T∑

k=1

γk(i, j)

Also note that γk(i) =∑S

j=1 γk(i, j). So

S∑

j=1

E{NT (i, j)|YT } = E{DT (i)|YT }


Implementation: αk and βk are S dimensional vectors.

Computational cost: O(S2T ), memory cost O(ST ).

Called forward backward algorithm in Rabiner’s HMM

paper. For Kalman case, βk and γk are Gaussian. βk

computed by backward Kalman filter.

EM Algorithm for HMM maximum likelihood

parameter estimation of transition probability:

Consider HMM with unknown trans probability θ∗ = A.

Choose initial θ(0).

For iterations I = 1, 2, . . .:

Step 1: Use model θ = θ(I) to compute αθk(i), β

θk(i),

γθk(i), D̂

θT (i) =

∑Tk=1 γ

θk(i), N̂

θT (i, j) =

∑Tk=1 γ

θk(i, j).

Called the Expectation (E) step.

Step 2: In analogy to MLE of Markov chain, compute

new model θ(I+1) as

a(I+1)ij =

N̂θT (i, j)

D̂θT (i)

=E{NT (i, j)|YT , θ}E{DT (i)|YT , θ}

, where θ = θ(I)

This can be interpreted as maximizing the complete data

likelihood function (see later). Maximization (M) step.

Go to Step 1.

Above update is guaranteed to generate valid transition

probability estimates since∑S

j=1 N̂T (i, j) = D̂T (i). Unlike

Newton Raphson, no constraints are required.


EM Algorithm

Consider partially observed stoch dynamical system

xk+1 = f(xk; θ) + wk

yk = h(xk) + vk

Let XT = (x1, . . . , xT ), YT = (y1, . . . , yT ).

Aim: Given a sequence of observations YT compute MLE.

From an initial parameter estimate θ(0), EM iteratively

generates a sequence of estimates θ(I), I = 1, 2, . . . as

follows:

Each iteration consists of 2 steps:

• E Step: Evaluate auxiliary (complete) likelihood

Q(θ(I), θ) = E{ln p(XT , YT ; θ)|YT , θ(k)}

• M step: Maximize auxiliary (complete) likelihood, i.e,

compute

θ(I+1) = maxθ

Q(θ(I), θ)

Remark: Notice that the EM algorithm involves

computing smoothed state densities – these involve

forward and backward state filters. Thus optimal filtering

is an important ingredient of the EM algorithm.


Advantages of EM Algorithm

• Monotone property: L(θ(I+1)) ≥ L(θ(I)) (equality

holds at a local maximum)

NR does not have monotone property.

• In many cases, EM is much simpler to apply than

NR. (e.g. HMMs, Error-in-variables models)

• EM is numerically more robust than NR; inverse of

Hessian is not required in EM.

• Recent variants of the EM speed up convergence –

SAGE, AECM, [MV97]

Dis-advantages of EM Algorithm

• Linear convergence: NR has quadratic convergence

rate

• NR automatically yields estimates of parameter

estimate variance. EM does not.


Example 1: EM algorithm for

Linear Gaussian State Space

Model Estimation

Consider scalar linear Gaussian state space model. (Easily

generalized to multidimensional models.)

State xt = a xt−1 + wt

Observations yt = xt + vt

wt ∼ N(0, σ2w), vt ∼ N(0, σ2

v) white Gaussian processes.

Aim: Estimate θ = (a, σ2w, σ

2v).

Applications: Speech coding, Econometrics [Gho89],

Multisensor speech enhancement [WOFB94], errors in

variables model [EK99].

EM Algorithm

E Step: The aim is to compute

Q(θ(I), θ) = E{ln p(YT , XT |θ)|YT , θ(I)}


Result: The auxiliary likelihood Q(θ(I), θ) is:

Q(θ(I), θ) = −T

2ln σ2

v − 1

2σ2v

T∑

t=1

E{(yt − xt)2|YT , θ

(I)}

− T

2lnσ2

w − 1

2σ2w

T∑

t=1

E{(xt − a xt−1)2|YT , θ

(I)}

So we need to compute:

E{xt|YT , θ}, E{xt xt−1|YT , θ}, E{x2t |YT , θ}, E{x2

t−1|YT , θ}These are obtained via a Kalman Smoother

M Step: Compute θ(k+1) = maxθ Q(θ(I), θ)

Setting ∂Q/∂θ = 0 yields:

a =

∑Tt=1 E{xt xt−1|YT , θ

(I)}∑T

t=1 E{x2t |YT , θ(k)}

σ2v =

1

T

T∑

t=1

(

y2t + E{x2

t |YT } − 2E{xt yt|YT , θ(I)}

)

σ2w =

1

T

T∑

t=1

E{(xt − a xt−1)2|YT , θ

(I)}

Set θ(I+1) = (d, σ2v , σ

2w)

Remarks: (i) The update for a is similar to the Yule

Walker equations (apart from conditioning on YT ).

(ii) Estimates σv and σw are non-negative by construction.


Example 2: EM algorithm for

HMM Estimation

Consider S state Markov chain xk ∈ q = {q1, . . . , qS} with

trans prob matrix A. = (aij), i, j ∈ {1, . . . , S}.Consider Markov chain xk observed in Gaussian noise:

yk = xk + vk, vk ∼ N(0, σ2v) white

Aim: Estimate θ = (q,A, σ2v).

Application: Neurobiology, Channel Equalization,

Target Tracking, Speech Recognition

EM Algorithm: (called Baum Welch algorithm)

E Step: Compute Q(θ(I), θ) = E{ln p(YT , XT |θ)|YT , θ(I)}

Result: The auxiliary likelihood Q(θ(I), θ) is:

Q(θ(I), θ) = −T

2ln σ2

v − 1

2σ2v

T∑

t=1

S∑

i=1

E{(yt − qi)2}γt(i)

+

T∑

t=1

S∑

i=1

S∑

j=1

γt(i, j) log aij

where γt(i) = p(xt = qi|YT ; θ(I)),

γt(i, j) = p(xt = qi, xt+1 = qj |YT ; θ(I)) require a HMM

state smoother (forward backward HMM filter).


M Step: Setting ∂Q/∂θ = 0 yields θ(I+1):

aij =

∑Tt=1 γt(i, j)

∑tt=1 γt(i)

=E{#jumps from i to j|YT , θ

(I)}E{#of visits in i|YT , θ(I)}

qi =

∑Tt=1 γt(i)yt

∑Tt=1 γt(i)

σ2v =

1

T

T∑

t=1

S∑

i=1

γt(i)(yt − qi)2

Remarks: 1. Nice property of EM is that estimates

0 ≤ aij < 1,∑

j aij = 1 is guaranteed by construction.

Similarly, σ2v ≥ 0.

2. Can generalize the above to much more general HMMs

– e.g. state dependent noise, Markov Modulated ARX

time series.

3. The above EM is a smoother-based EM – the statistics

are computed in terms of the smoothed density γ. In

1990s filter based EMs have been developed – e.g. Elliott,

James Krishnamurthy and LeGland.

4. The EM algorithm can be formulated for continuous

time HMMs.


Derivation of Q(θ(I), θ) for

HMM

ln p(YT , XT |θ) = ln

T∏

t=1

p(yt|xt)p(xt|xt−1)

=

T∑

t=1

ln p(yt|xt) +

T∑

t=1

ln p(xt|xt−1)

=

T∑

t=1

S∑

i=1

δ(xt = i) ln p(yt|xt = i)

+

T∑

t=1

∑

i

∑

j

δ(xt = i, xt+1 = j) lnP (xt+1) = qj |xt = qi)

=

S∑

i=1

T∑

t=1

δ(xt = i)

[

ln(1√2πσv

)− (yt − qi)2

2σ2v

]

+∑

i

∑

j

T∑

t=1

δ(xt = i, xt+1 = j) ln aij

Q(θ(I), θ) = E{ln p(YT ,XT |θ)|YT , θ(I)}

= const− T

2ln σ2

v −∑

i

∑

t

γθ(I)

t (i)(yt − qi)

2

2σ2v

+∑

i

∑

j

∑

t

γθ(I)

t (i, j) ln aij


Proof of EM algorithm

Theorem: Given an observation sequence YT , and

Q(θ(I), θ) = E{ln p(XT , YT |θ)|θ(I), YT }. Then

θ(I+1) = argmaxθ

Q(θ(I), θ) =⇒ P (YT |θ(I+1)) ≥ P (YT |θ(I))

To prove the theorem, first consider following lemma.

Lemma: For any θ, Q fn increases slower than log

likelihood in terms of θ. That is:

Q(θ(I), θ)−Q(θ(I), θ(I)) ≤ lnP (YT |θ)− lnP (YT |θ(I))

Then clearly choosing θ(I+1) such that

Q(θ(I), θ(I+1)) ≥ Q(θ(I), θ(I))

implies that P (YT |θ(I+1)) ≥ P (YT |θ(I)). In particular, the

best choice θ(I+1) = argmaxθ Q((θ(I), θ) implies

P (YT |θ(I+1)) ≥ P (YT |θ(I)).Remark 1.: Just because likelihoods are monotone

increasing does not mean EM converges. For convergence,

require continuity of Q, compactness of θ ∈ Θ, etc, see

(Wu, Annals of Statistics, 1983, pp.95–103). Wu uses

Zangwill’s global convergence theorem which is a standard

tool in optimization theory to prove global convergence of

an algorithm


Remark 2:

Q(θ(I), θ)−Q(θ(I), θ(I)) = E{ln P (YT , XT |θ)P (YT , XT |θ(I))

|YT , θ(I)}

is the Kullback-Liebler information measure widely used

in information theory. Dempster, Laird and Rubin

invented EM algorithm, 1977.

Proof of Lemma:

Q(θ(I), θ)−Q(θ(I), θ(I)) = E{ln P (YT , XT |θ)P (YT ,XT |θ(I))

|YT , θ(I)}

by Jensen’s inequality ≤ lnE{ P (YT , XT |θ)P (YT ,XT |θ(I))

|YT , θ(I)}

= ln

∫P (YT , XT |θ)

P (YT ,XT |θ(I))P (XT |YT , θ

(I))dXT

= ln

∫P (YT , XT |θ)

P (XT |YT , θ(I))P (YT |θ(I))P (XT |YT , θ

(I))dXT

= ln

∫P (YT ,XT |θ)P (YT |θ(I))

dXT = lnP (YT |θ)

P (YT |θ(I))

Jensen’s inequality:

f(X) convex =⇒ E{f(X)} ≥ f(E{X})

Hence f(X) concave =⇒ E{f(X)} ≤ f(E{X})


Consistency of MLE

Suppose y1, . . . , yT is an iid sequence of observations.

θ∗ ∈ Θ true parameter. θT : MLE based on y1, . . . , yT .

Aim: How to prove that limT→∞ θT → θ∗ w.p.1. (Strong

consistency of the MLE). We use Wald’s approach.

Assume Θ is compact (i.e., closed bounded interval in

RS). Then MLE

θT = argmaxθ∈Θ

1

Tlog p(y1, . . . , yT |θ) = argmax

θ∈Θ

1

T

T∑

k=1

log p(yk|θ)

Assuming Eθ∗{| log p(yk|θ)|} < ∞, then by SLLN,

limT→∞

1

T

T∑

k=1

log p(yk|θ) = Eθ∗{log p(yk|θ)}︸︷︷︸

Kullback-Liebler K(θ,θ∗)

w.p.1

So limT→∞

1

Tlog p(y1, . . . , yT |θ) → K(θ, θ∗)

Lemma: From Jensen’s inequality argmaxθ K(θ, θ∗) = θ∗

So we would intuitively expect

argmaxθ

limT→∞

1

Tlog p(y1, . . . , yT |θ) → argmax

θK(θ, θ∗) w.p.1

– that is, θT → θ∗ w.p.1 . More rigorously we need

limT→∞

supθ∈Θ

1

Tlog p(y1, . . . , yT |θ) w.p.1→ K(θ, θ∗) uniform convergence


Recursive Parameter

Estimation

Aim: Joint parameter and state estimation in real time.

Recall for HMM case Q(θ(I), θ) =∑T

t=1 φt:

Q(θ(I), θ) = −T

2lnσ2

v − 1

2σ2v

T∑

t=1

S∑

i=1

E{(yt − qi)2}γk(i)

+

T∑

t=1

S∑

i=1

S∑

j=1

γt(i, j) log aij

=

T∑

t=1

φt (this is defn of φt)

Recursive Estimation Algorithm:

θt+1 = θt + γt∂φt

∂θt

where γt is step size.

There are 3 popular choices of step size

1. Recursive EM Algorithm: γt = ( ∂2Q

∂θ2)−1

2. Gradient Algorithm: γt = 1/t.

3. Iterate Averaging (Polyak): γt = 1/√t followed by

averaging θt.

4. Tracking Algorithms: Constant step size γt = γ


Example: Markov Modulated

Time Series

Markov Modulated AR process:

zk+1 = a(xk)zk + b(xk)wk

zk: observations, xk: S state unobserved Markov chain.

Arises in econometrics, LPC of speech, fault detection.

1. Similar algorithm to HMM filter yields

E{xk|z1, . . . , zk}. Also EM and recursive EM can be used

for parameter estimation.

2. Image Enhanced Tracking:

zk+1 = a(xk)zk + b(xk)wk

yk = xk + vk

xk is mode of target – e.g. orientation.

yk denotes noisy measurements of mode (e.g. imager).

Aim: Estimate coordinate of target zk given y1, . . . , yk.

Again a filter similar to HMM filter can be derived.

Parameters can be estimated via EM algorithm

3. Markov Modulated Poisson Process: Here Nt is a

Poisson process whose rate λ(xk) is Markov modulated. A

MMPP filter is similar to a HMM filter. Also EM can be

used to compute parameters.

vikramkrishnamurthy2013 part 3: ml parameter estimationbo/vikram/mle.pdf · 2013-04-29 · online...

Documents