recursive least squares

Recursive Least Squares Estimation∗

(Com 477/577 Notes)

Yan-Bin Jia

Dec 9, 2014

1 Estimation of a Constant

We start with estimation of a constant based on several noisy measurements. Suppose we have aresistor but do not know its resistance. So we measure it several times using a cheap (and noisy)multimeter. How do we come up with a good estimate of the resistance based on these noisymeasurements?

More formally, suppose x = (x1, x2, . . . , xn)T is a constant but unknown vector, and y =(y1, y2, . . . , yl)

T is an l-element noisy measurement vector. Our task is to find the “best” estimatex of x. Here we look at perhaps the simplest case where each yi is a linear combination of xj ,1 ≤ j ≤ n, with addition of some measurement noise νi. Thus, we are working with the followinglinear system,

y = Hx + ν,

where ν = (ν1, ν2, . . . , νl)T , and H is an l × n matrix; or with all terms listed,

y1

...yl

=

H11 · · · H1n

.... . .

...Hk1 · · · Hkn

x1

...xn

+

ν1

...νl

.

Given an estimate x, we consider the difference between the noisy measurements and the pro-jected values Hx:

ǫ = y − Hx.

Under the least squares principle, we will try to find the value of x that minimizes the cost function

J(x) = ǫTǫ

= (y − Hx)T (y − Hx)

= yTy − x

T Hy − yT Hx + x

T HT Hx.

The necessary condition for the minimum is the vanishing of the partial derivative of J withrespect to x, that is,

∂J

∂x= −2yT H + 2xT HT H = 0.

∗The material is adapted from Sections 3.1–3.3 in Dan Simon’s book Optimal State Estimation [1].

1

We solve the equation, obtainingx = (HT H)−1HT

y. (1)

The inverse (HT H)−1 exists if l > n and H is non-singular. In other words, when the numberof measurements is no fewer than the number of variables, and these measurements are linearlyindependent.

Example 1. Suppose we are trying to estimate the resistance x of an unmarked resistor based on l noisymeasurements using a multimeter. In this case,

y = Hx + ν, (2)

whereH = (1, · · · , 1)T . (3)

Substitution of the above into equation (1) gives us the optimal estimate of x as

x = (HT H)−1HTy

=1

lHT

y

=y1 + · · · + yl

l.

2 Weighed Least Squares Estimation

So far we have placed equal confidence on all the measurements. Now we look at varying confidencein the measurements. For instance, some of our measurements of an unmarked resistor were takenwith an expensive multimeter with low noise, while others were taken with a cheap multimeter bya tired student late at night. Even though the second set of measurements is less reliable, we couldget some information about the resistance. We should never throw away measurements, no matterhow unreliable they may seem. This will be shown in the section.

We assume that each measurement yi, 1 ≤ i ≤ l, may be taken under a different condition sothat the variance νi of the measurement noise may be distinct too:

E(ν2

i ) = σ2

i , 1 ≤ i ≤ l.

Assume that the noise for each measurement has zero mean and is independent. The covariancematrix for all measurement noise is

R = E(ννT )

=

σ2

1· · · 0

.... . .

...0 · · · σ2

l

.

Write the difference y − Hx as ǫ = (ǫ1, ǫ2, . . . , ǫl)T . We will minimize the sum of squared

differences weighted over the variations of the measurements:

J(x) = ǫT R−1

ǫ =ǫ2

1

σ2

1

+ǫ2

2

σ2

2

+ · · · +ǫ2

l

σ2

l

.

2

If a measurement yi is noisy, we care less about the discrepancy between it and the ith elementof Hx because we do not have much confidence in this measurement. The cost function J can beexpanded as follows:

J(x) = (y − Hx)T R−1(y − Hx)

= yT R−1

y − xT HT R−1

y − yT R−1Hx + x

T HT R−1Hx.

At a minimum, the partial derivative of J must vanish, yielding

∂J

∂x= −2yT R−1H + 2xT HT R−1H = 0.

Immediately, we solve the above equation for the best estimate of x:

x = (HT R−1H)−1HT R−1y. (4)

Note that the measurement noise matrix R must be non-singular for a solution to exist. In otherwords, each measurement yi must be corrupted by some noise for the estimation method to work.

Example 2. We get back to the problem in Example 1 of resistance estimation, for which the equations aregiven in (2) and (3). Suppose each of the l noisy measurements has variance

E(ν2

i ) = σ2

i .

The measurement noise covariance is given as

R = diag(σ2

1, . . . , σ2

l ).

Substituting H, R, y into (4), we obtain the estimate

x =

(1, . . . , 1)

1/σ2

1

. . .

1/σ2

l

1...1

−1

(1, . . . , 1)

1/σ2

1

. . .

1/σ2

l

y1

...yl

=

(

l∑

i=1

1

σ2

i

)−1(l∑

i=1

yi

σ2

i

)

.

3 Recursive Least Squares Estimation

Equation (4) is adequate when we have made all the measurements. More often, we obtain mea-surements sequentially and want to update our estimate with each new measurement. In this case,the matrix H needs to be augmented. We would have to recompute the estimate x according to (4)for every new measurement. This update can become very expensive. And the overall computationcan become prohibitive as the number of measurements becomes large.

This section shows how to recursively compute the weighted least squares estimate. Morespecifically, suppose we have an estimate xk−1 after k − 1 measurements, and obtain a new mea-surement yk. To be general, every measurement is now an m-vector with values yielded by, say,several measuring instruments. How can we update the estimate to xk without solving equation (4)?

3

A linear recursive estimator can be written in the following form:

yk = Hkx + νk,

xk = xk−1 + Kk(yk − Hkxk−1). (5)

Here Hk is an m × n matrix, and Kk is n × m and referred to as the estimator gain matrix. Werefer to yk − Hkxk−1 as the correction term. Namely, the new estimate xk is modified from theprevious estimate xk with a correction via the gain vector. The measurement noise has zero mean,i.e., E(νk) = 0.

The current estimation error is

ǫk = x − xk

= x − xk−1 − Kk(yk − Hkxk−1)

= ǫk−1 − Kk(Hkx + νk − Hkxk−1)

= ǫk−1 − KkHk(x − xk−1) − Kkνk

= (I − KkHk)ǫk−1 − Kkνk, (6)

where I is the n × n identity matrix. The mean of this error is then

E(ǫk) = (I − KkHk)E(ǫk−1) − KkE(νk).

If E(νk) = 0 and E(ǫk−1) = 0, then E(ǫk) = 0. So if the measurement noise νk has zero mean forall k, and the initial estimate of x is set equal to its expected value, then xk = xk for all k. Withthis property, the estimator (5) is called unbiased. The property holds regardless of the value ofthe gain vector Kk. It says that on the average the estimate x will be equal to the true value x.

The key is to determine the optimal value of the gain vector Kk. The optimality criterion usedby us is to minimize the aggregated variance of the estimation errors at time k:

Jk = E(‖x − xk‖2)

= E(ǫTk ǫk)

= E(

tr(ǫkǫTk

)

)

= Tr(Pk), (7)

where Tr is the trace operator1, and the n × n matrix Pk = E(ǫkǫTk ) is the estimation-error

covariance, Next, we obtain Pk with a substitution of (6):

Pk = E

(

(

(I − KkHk)ǫk−1 − Kkνk

)(

(I − KkHk)ǫk−1 − Kkνk

)T)

= (I − KkHk)E(ǫk−1ǫTk−1

)(I − KkHk)T − KkE(νkǫ

Tk−1

)(I − KkHk)T

− (I − KkHk)E(ǫk−1νTk )KT

k + KkE(νkνTk )KT

k .

The estimation error ǫk−1 at time k − 1 is independent of the measurement noise νk at time k,which implies that

E(νkǫTk−1

) = E(νk)E(ǫTk−1

) = 0,

E(ǫk−1νTk ) = E(ǫk−1)E(νT

k ) = 0.

1The trace of a matrix is the sum of its diagonal elements.

4

Given the definition of the m × m matrix Rk = E(νkνTk ) as covariance of νk, the expression of Pk

becomesPk = (I − KkHk)Pk−1(I − KkHk)

T + KkRkKTk . (8)

Equation (8) is the recurrence for the covariance of the least squares estimation error. It isconsistent with the intuition that as the measurement noise (Rk) increases, the uncertainty (Pk)increases. Note that Pk as a covariance matrix is positive definite.

What remains is to find the value of the gain vector Kk that minimizes the cost function givenby (6). The mean of the estimation error is zero independent of the value of Kk already. Thusthe minimizing value of Kk will make the cost function consistently close to zero. We need todifferentiate Jk with respect to Kk.

The derivative of a function f with respect to a matrix A = (aij) is a matrix ∂f∂A

= ( ∂f∂aij

).

Theorem 1 Let C and X be matrices of the same dimension r × s. Suppose C does not depend

on X. Then the following holds:

∂Tr(CXT )

∂X= C, (9)

∂Tr(XCXT )

∂X= XC + XCT . (10)

A proof of the theorem is given in Appendix A. In the case that C is symmetric, ∂∂X

Tr(XCXT ) =2XC. With these facts in mind, we first substitute (8) into (7) and differentiate the resultingexpression with respect to Kk:

∂Jk

∂Kk

=∂

∂Kk

Tr(

Pk−1 − KkHkPk−1 − Pk−1HTk KT

k + Kk(HkPk−1HTk )KT

k

)

+∂

∂Kk

Tr(KkRkKTk )

= −2∂

∂Kk

(Pk−1HTk KT

k ) + 2Kk(HkPk−1HTk ) + 2KkRk (by (10))

= −2Pk−1HTk + 2KkHkPk−1H

Tk + 2KkRk (by (9))

= −2Pk−1HTk + 2Kk(HkPk−1H

Tk + Rk)

In the second equation above, we also used that Pk−1 is independent of Kk and that KkHkPk−1 andPk−1H

Tk KT

k are transposes of each other (since Pk−1 is symmetric). Setting the partial derivativeto zero, we solve for Kk:

Kk = Pk−1HTk (HkPk−1H

Tk + Rk)

−1. (11)

Write Sk = HkPk−1HTk + Rk, so

Kk = Pk−1HTk S−1

k . (12)

Substitute the above for Kk into equation (8) for Pk. The operation followed by an expansion leadsto a few steps of manipulation as follows:

Pk = (I − Pk−1HTk S−1

k Hk)Pk−1(I − Pk−1HTk S−1

k Hk)T + Pk−1H

Tk S−1

k RkS−1

k HkPk−1

= Pk−1 − Pk−1HTk S−1

k HkPk−1 − Pk−1HTk S−1

k HkPk−1 +

Pk−1HTk S−1

k HkPk−1HTk S−1

k HkPk−1 + Pk−1HTk S−1

k RkS−1

k HkPk−1

5


k HkPk−1 − Pk−1HTk S−1


k SkS−1

k HkPk−1

(after merging the underlined terms into Sk)

= Pk−1 − 2Pk−1HTk S−1


k HkPk−1


k HkPk−1 (13)

= Pk−1 − KkHkPk−1 (by (12))

= (I − KkHk)Pk−1. (14)

Note that in the above Pk is symmetric as a covariance matrix, and so is Sk.We take the inverses of both sides of equation (13) and plug into the expression for Sk. Expan-

sion and merging of terms yield

P−1

k = P−1

k−1+ HT

k R−1

k Hk, (15)

from which we obtain an alternative expression for the convariance matrix:

Pk =(

P−1

k−1+ HT

k R−1

k Hk

)

−1

. (16)

This expression is more complicated than (14) since it requires three matrix inversions. Neverthe-less, it has computational advantages in certain situations in practice [1, pp.156–158].

We can also derive an alternate form for the convariance Pk as follows. Start with a multiplica-tion of the right of (11) with PkP

−1

k . Then, substitute (15) for P−1

k into the resulting expression.Multiply the Pk−1Hk factor inside the parenthesized factor on its left, and extract HT

k R−1

k out ofthe parentheses. The last two parenthesized factors will cancel each other, yielding

Kk = PkHTk R−1

k . (17)

4 The Estimation Algorithm

The algorithm for recursive least squares estimation is summarized as follows.

1. Initialize the estimator:

x0 = E(x),

P0 = E(

(x − x0)(x − x0)T)

.

In the case of no prior knowledge about x, simply let P0 = ∞I. In the case of perfect priorknowledge, let P0 = 0.

2. Iterate the follow two steps.

(a) Obtain a new measurement yk, assuming that it is given by the equation

yk = Hkx + νk,

where the noise νk has zero mean and covariance Rk. The measurement noise at eachtime step k is independent. So,

E(ννT ) =

{

0, if i 6= j,Rj , if i = j.

Essentially, we assume white measurement noise.

6

(b) Update the estimate x and the covariance of the estimation error sequentially accordingto (11), (5), (14), which are re-listed below:

Kk = Pk−1HTk (HkPk−1H

Tk + Rk)

−1, (18)

xk = xk−1 + Kk(yk − Hkxk−1), (19)

Pk = (I − KkHk)Pk−1, (20)

or according to (16), (17), and (19):

Pk =(

P−1

k−1+ HT

k R−1

k Hk

)

−1

,

Kk = PkHTk R−1

k ,

xk = xk−1 + Kk(yk − Hkxk−1).

Note that (19) and (20) can switch their order in one round of update.

Example 3. We revisit the resistance estimation problem presented in Examples 1 and 2. Now, we wantto iteratively improve our estimate of the resistance x. At the kth sampling, our measurement is

yk = Hkx + νk = x + νk,

Rk = E(ν2

k).

Here, the measurement vector Hk is a scalar 1. Furthermore, we suppose that each measurement has thesame covariance so Rk is a constant written as R.

Before the first measurement, we have some idea about the resistance x. This becomes our initialestimate. Also, we have some uncertainty about this initial estimate, which becomes our initial covariance.Together we have

x0 = E(x),

P0 = E((x − x0)2).

If we have no idea about the resistance, set P0 = ∞. If we are certain about the resistance value, set P0 = 0.(Of course, then there would be no need to take measurements.)

After the first measurement (k=1), we update the estimate and the error covariance according to equa-tions (18)–(20) as follows:

K1 =P0

P0 + R,

x1 = x0 +P0

P0 + R(y1 − x0),

P1 =

(

1 −P0

P0 + R

)

P0 =P0R

P0 + R.

After the second measurement, the estimates become

K2 =P1

P1 + R=

P0

2P0 + R,

x2 = x1 +P1

P1 + R(y2 − x1)

=P0 + R

2P0 + Rx1 +

P0

2P0 + Ry2,

P2 =P1R

P1 + R=

P0R

2P0 + R.

7

By induction, we can show that

Kk =P0

kP0 + R,

xk =(k − 1)P0 + R

kP0 + Rxk−1 +

P0

kP0 + Ryk,

Pk =P0R

kP0 + R.

Note that if x is known perfectly a priori, then P0 = 0, which implies that Kk = 0 and xk = x0, for allk. The optimal estimate of x is independent of any measurements that are obtained. At the opposite end ofthe spectrum, if x is completely unknown a priori, then P0 = ∞. The above equation for xk becomes,

xk = limP0→∞

(k − 1)P0 + R

kP0 + Rxk−1 +

P0

kP0 + Ryk

=k − 1

kxk−1 +

1

kyk

=1

k

(

(k − 1)xk−1 + yk

)

.

The right hand side of the last equation above is just the running average yk = 1

k

∑k

j=1yj of the measure-

ments. To see this, we first have

k∑

j=1

yj =

k−1∑

j=1

yj + yk

= (k − 1)

1

k − 1

k−1∑

j=1

yj

+ yk

= (k − 1)yk−1 + yk.

Since x1 = y1, the recurrences for xk and yk are the same. Hence xk = yk for all k.

Example 4. Suppose that a tank contains a concentration x1 of chemical 1, and a concentration x2 ofchemical 2. We have an instrument to detect the combined concentration x1 + x2 of the two chemicals butnot able to tell the values of x1 and x2. Chemical 2 leaks from the tank so that its concentration decreasesby 1% from one measurement to the next. The measurement equation is given as

yk = x1 + 0.99k−1x2 + νk,

where Hk = (1, 0.99k−1)T , and νk is a random variable with zero mean and a variance R = 0.01.Let the real values be x = (x1, x2)

T = (10, 5)T . Suppose the initial estimates are x1 = 8 and x2 = 7 withP0 equal to the identity matrix. We apply the recursive least squares algorithm. The next figure2 shows theevolutions of the estimates x1 and x2, along with those of the variance of the estimation errors. It can beseen that after a couple dozen measurements, the estimates are getting very close to the true values 10 and 5.The variances of the estimation errors asymptotically approach zero. This means that we have increasinglymore confidence in the estimates with more measurements obtained.

2Figure 3.1, p. 92 of [1].

8

A Proof of Theorem 1

Proof

Denote C = (cij), X = (xij), and CXT = (dij). The trace of CXT is

Tr(CXT ) =

r∑

t=1

dtt

=r∑

t=1

s∑

k=1

ctkxtk.

From the above, we easily obtain its partial derivatives with respect to the entries of X:

∂

∂xij

Tr(CXT ) = cij.

This establishes (9).To prove (10), we have

∂

∂XTr(XCXT ) =

∂

∂XTr(XCY T )

∣

∣

∣

Y =X+

∂

∂XTr(Y CXT )

∣

∣

∣

Y =X

=∂

∂XTr(Y CT XT )

∣

∣

∣

Y =X+Y C

∣

∣

∣

Y =X(by (9))

= Y CT∣

∣

∣

Y =X+XC

= XCT + XC.

9

References

[1] D. Simon. Optimal State Estimations. John Wiley & Sons, Inc., Hoboken, New Jersey, 2006.

10

recursive least squares

Documents