part 5 - prediction error methods

System IdentificationControl Engineering B.Sc., 3rd yearTechnical University of Cluj-Napoca

Romania

Lecturer: Lucian Busoniu

ARX models Model structures General PEM Implementation & Extensions

Part V

Prediction error methods


Table of contents

1 Identification of ARX models

2 Model structures

3 General prediction error methods

4 Implementation and extensions

We stay in the single-output, single-input case for all but the last section.


Classification

Recall Types of models from Part I:

1 Mental or verbal models2 Graphs and tables (nonparametric)3 Mathematical models, with two subtypes:

First-principles, analytical modelsModels from system identification

Prediction error methods produce parametric, polynomial models.


Table of contents


Analytical development

Theoretical guarantee

Matlab example

2 Model structures




Recall: discrete-time model


ARX model structure

We consider first the ARX model structure, where the output y(k) atthe current discrete time step is computed based on previous inputand output values:

y(k) + a1y(k − 1) + a2y(k − 2) + . . . + anay(k − na)

= b1u(k − 1) + b2u(k − 2) + . . . + bnbu(k − nb) + e(k)

equivalent toy(k) = −a1y(k − 1)− a2y(k − 2)− . . .− anay(k − na)

b1u(k − 1) + b2u(k − 2) + . . . + bnbu(k − nb) + e(k)

e(k) is the noise at step k .

Model parameters: a1, a2, . . . , ana and b1, b2, . . . , bnb.

Name: AutoRegressive (y(k) a function of previous y values) witheXogenous input (dependence on u)


Polynomial representation

Backward shift operator q−1:

q−1z(k) = z(k − 1)

where z(k) is any discrete-time signal.

Then:


= (1 + a1q−1 + a2q−2 + . . . + anaq−na)y(k) =: A(q−1)y(k)

and:b1u(k − 1) + b2u(k − 2) + . . . + bnbu(k − nb)

= (b1q−1 + b2q−2 + . . . + bnbq−nb)u(k) =: B(q−1)u(k)


ARX model in polynomial form

Therefore, the ARX model is written compactly:

A(q−1)y(k) = B(q−1)u(k) + e(k)

The symbolic representation in the figure holds because:

y(k) =1

A(q−1)[B(q−1)u(k) + e(k)]

Remark: The ARX model is quite general, it can describe arbitrarylinear relationships between inputs and outputs. However, the noiseenters the model in a restricted way, and later we introduce modelsthat generalize this.


Linear regression model

Returning to the explicit recursive representation:

y(k) =− a1y(k − 1)− a2y(k − 2)− . . .− anay(k − na)


=[−y(k − 1) · · · −y(k − na) u(k − 1) · · · u(k − nb)

]·[a1 · · · ana b1 · · · bnb

]>+ e(k)

=:ϕ>(k)θ + e(k)

So in fact ARX obeys the standard model structure in linearregression!

Regressor vector: ϕ ∈ Rna+nb, previous output and input values.Parameter vector: θ ∈ Rna+nb, polynomial coefficients.


Identification problem

Consider now that we are given a vector of data u(k), y(k),k = 1, . . . , N, and we have to find the model parameters θ.

Then for any k :y(k) = ϕ>(k)θ + ε(k)

where ε(k) is now interpreted as an equation error.

Objective: minimize the mean squared error:

V (θ) =1N

N∑k=1

ε(k)2

Remark: When k ≤ na, nb, zero- and negative-time values for u andy are needed to construct ϕ. They can be taken 0 (assuming thesystem is in zero initial conditions).


Linear system

y(1) =[−y(−1) · · · −y(−na) u(−1) · · · u(−nb)

]θ

y(2) =[−y(0) · · · −y(1− na) u(0) · · · u(1− nb)

]θ

· · ·y(N) =

[−y(N − 1) · · · −y(N − na) u(N − 1) · · · u(N − nb)

]θ

Matrix form:y(1)y(2)

...y(N)

=

−y(−1) · · · −y(−na) u(−1) · · · u(−nb)−y(0) · · · −y(1− na) u(0) · · · u(1− nb)

......

......

−y(N − 1) · · · −y(N − na) u(N − 1) · · · u(N − nb)

· θY = Φθ

with notations Y ∈ RN and Φ ∈ RN×(na+nb).


ARX solution

From linear regression, to minimize 12

∑Nk=1 ε(k)2 the parameters are:

θ = (Φ>Φ)−1Φ>Y

Since the new V (θ) = 1N

∑Nk=1 ε(k)2 is proportional to the criterion

above, the same solution also minimizes V (θ).

However, the form above is impractical in system identification, sincethe number of data points N can be very large. Better form:

Φ>Φ =N∑

k=1

ϕ(k)ϕ>(k), Φ>Y =N∑

k=1

ϕ(k)y(k)

⇒ θ =

[N∑

k=1

ϕ(k)ϕ>(k)

]−1 [N∑

k=1

ϕ(k)y(k)

]


ARX solution (continued)

Remaining issue: the sum of N terms can grow very large, leading tonumerical problems: (matrix of very large numbers)−1· vector of verylarge numbers.

Solution: Normalize element values by diving them by N. Inequations, N simplifies so it has no effect on the analyticaldevelopment, but in practice it keeps the numbers reasonable.

θ =

[1N

N∑k=1

ϕ(k)ϕ>(k)

]−1 [1N

N∑k=1

ϕ(k)y(k)

]


Table of contents




Matlab example

2 Model structures




Main result

Assumptions1 There exists a true parameter vector θ0 so that:

y(k) = ϕ(k)θ0 + v(t)

with v(k) a stationary stochastic process independent from u(k).2 E

{ϕ(k)ϕ>(k)

}is a nonsingular matrix.

3 E {ϕ(k)v(k)} = 0.

Theorem

ARX identification is consistent: the estimated parameters θ tend tothe true parameters θ0 as N →∞.


Discussion of assumptions

1 Assumption 1 is equivalent to the existence of true polynomialsA0(q−1), B0(q−1) so that:

A0(q−1)y(k) = B0(q−1)u(k) + v(k)

To motivate Assumption 2, recall

θ =[

1N

∑Nk=1 ϕ(k)ϕ>(k)

]−1 [1N

∑Nk=1 ϕ(k)y(k)

]As N →∞, 1

N

∑Nk=1 ϕ(k)ϕ>(k) → E

{ϕ(k)ϕ>(k)

}.

2 E{ϕ(k)ϕ>(k)

}is nonsingular if the data is “sufficiently

informative” (e.g., u(k) should not be a simple feedback fromy(k); see Soderstrom & Stoica for more discussion).

3 E {ϕ(k)v(k)} = 0 if v(k) is white noise. For more details onAssumption 3 and the role of E {ϕ(k)v(k)} = 0, see Part VI.


Table of contents




Matlab example

2 Model structures




Experimental data

Consider we are given the following, separate, identification andvalidation data sets.

plot(id); and plot(val);

Remarks: Identification input: a so-called pseudo-random binarysignal, an approximation of (non-zero-mean) white noise. Validationinput: a sequence of steps.


Identifying an ARX modelmodel = arx(id, [na, nb, nk]);

Arguments:

1 Identification data.2 Array containing the orders of A and B and the delay nk .

Structure slightly different from theory: includes the explicit minimumdelay nk between inputs and outputs, useful to model systems withtime delays.

y(k) + a1y(k − 1)+a2y(k − 2) + . . . + anay(k − na)

= b1u(k − nk)+b2u(k − nk − 1) + . . . + bnbu(k − nk − nb + 1) + e(k)

A(q−1)y(k) = B(q−1)u(k − nk) + e(k), where:

A(q−1) = (1 + a1q−1 + a2q−2 + . . . + anaq−na)

B(q−1) = (b1 + b2q−1 + bnbq−nb+1)

Note: can transform this into theoretical structure by rewriting theright-hand side using a B(q−1) of order nk + nb − 1, with nk − 1leading zeros.


Model validation

Assuming the system is second-order with no time delay, we takena = 2, nb = 1, nk = 1. Validation: compare(model, val);

Results are quite bad.


Structure selection

Better idea: try many different structures and choose the best one.

Na = 1:15;Nb = 1:15;Nk = 1:5;NN = struc(Na, Nb, Nk);V = arxstruc(id, val, NN);

struc generates all combinations of orders in Na, Nb, Nk.arxstruc identifies for each combination an ARX model (on thedata in 1st argument), simulates it (on the data in the 2ndargument), and returns all the MSEs on the first row of V (seehelp arxstruc for the format of V).


Structure selection (continued)

To choose the structure with the smallest MSE:N = selstruc(V, 0);

For our data, N= [8, 7, 1].

Alternatively, graphical selection: N = selstruc(V, ’plot’);

(Later we learn other structure selection criteria than smallest MSE.)


Validation of best ARX model

model = arx(id, N); compare(model, val);

A better fit is obtained.


Table of contents


2 Model structures




Motivation

Before deriving the general PEM, we will introduce the general classof models to which it can be applied.


General model structure

y(k) = G(q−1)u(k) + H(q−1)e(k)

where G and H are discrete-time transfer functions – ratios ofpolynomials. Signal e(t) is zero-mean white noise.

By placing the common factors in the denominators of G and H inA(q−1), we get the more detailed form:

A(q−1)y(k) =B(q−1)

F (q−1)u(k) +

C(q−1)

D(q−1)e(k)

where A, B, C, D, F are all polynomials, of orders na, nb, nc, nd , nf .


General model structure (continued)

A(q−1)y(k) =B(q−1)

F (q−1)u(k) +

C(q−1)

D(q−1)e(k)

Very general form, all other linear forms are special cases of this. Notfor practical use, but to describe algorithms in a generic way.


ARMAX model structure

Setting F = D = 1 (i.e. orders nf = nd = 0), we get:

A(q−1)y(k) = B(q−1)u(k) + C(q−1)e(k)

Name: AutoRegressive, Moving Average (referring to noise model)with eXogenous input (dependence on u)


ARMAX: explicit form

A(q−1)y(k) = B(q−1)u(k) + C(q−1)e(k)

A(q−1) = 1 + a1q−1 + . . . + anaq−na

B(q−1) = b1q−1 + . . . + bnbq−nb

C(q−1) = 1 + c1q−1 + . . . + cncq−nc

y(k) + a1y(k − 1) + . . . + anay(k − na)

= b1u(k − 1) + . . . + bnbu(k − nb)

+ e(k) + c1e(k − 1) + . . . + cnce(k − nc)

with parameter vector:

θ = [a1, . . . , ana, b1, . . . , bnb, c1, . . . , cnc ]> ∈ Rna+nb+nc

Compared to ARX, ARMAX can model more intricate relationshipsbetween the disturbance and the inputs and outputs.


Special case of ARMAX: ARX

Setting C = 1 in ARMAX (nc = 0), we get:

A(q−1)y(k) = B(q−1)u(k) + e(k)

precisely the ARX model we worked with before.


Special case of ARX: FIR

Further setting A = 1 (na = 0) in ARX, we get:

y(k) = B(q−1)u(k) + e(k) =nb∑j=1

bju(k − j) + e(k)

=M−1∑j=0

h(j)u(k − j) + e(k)

the FIR model from correlation analysis!

(where h(0),the impulse response at time 0, is assumed 0 – i.e.system does not respond instantaneously to changes in input)


Difference between ARX and FIR

ARX: A(q−1)y(k) = B(q−1)u(k) + e(k)

FIR: y(k) = B(q−1)u(k) + e(k)

Since ARX includes recursive relationships between current andprevious outputs, it will be sufficient to take orders na and nb (atmost) equal to the order of the dynamical system.

FIR needs a sufficiently large order nb (or length M) to model theentire transient regime of the impulse response (in principle, we onlyrecover the correct model as M →∞).

⇒ more parameters ⇒ more data needed to identify them.


Overall relationship

General Form ⊃ ARMAX ⊃ ARX ⊃ FIR

ARMAX to ARX: Less freedom in modeling disturbance.ARX to FIR: More parameters required.


Output error

Other model forms are possible that are not special cases of ARMAX,e.g. Output Error, OE:

y(k) =B(q−1)

F (q−1)u(k) + e(k)

obtained for na = nc = nd = 0, i.e. A = C = D = 1.

This corresponds to simple, additive measurement noise on theoutput (the “output error”).


Table of contents


2 Model structures


Stepping stones: ARX and ARMAX

General case

Matlab example

Theoretical guarantees



ARX reinterpreted as a PEM

1 Compute predictions at each step, y(k) = ϕ>(k)θ givenparameters θ.

2 Compute prediction errors at each step, ε(k) = y(k)− y(k).3 Find a parameter vector θ minimizing criterion

V (θ) = 1N

∑Nk=1 ε2(k).

Prediction error methods are obtained by extending this procedure togeneral model structures.


ARX reinterpreted as a PEM (continued)

Remarks:

ARX predictor y(k) is chosen to achieve the error ε(k) = e(k).We will aim to achieve the same error in the general PEM,intuitively because we cannot do better.For the mathematically inclined: Such a predictor is optimal in the senseof minimizing the variance of ε(k).

The prediction error is for ARX just a rearrangement of theequation y(k) = ϕ>(k)θ + ε(k) = y(k) + ε(k).The criterion can be more general than the MSE, but we willalways use MSE in this section. For ARX, we already know howto minimize the MSE (from linear regression); for more generalmodels new methods will be introduced.


Intermediate step: PEM for 1st order ARMAX

To make things easier to follow, we first derive PEM for a 1st orderARMAX.

Looking at slide ARMAX: explicit form, we write 1st order ARMAX in aslightly different way:

y(k) = −ay(k − 1) + bu(k − 1) + ce(k − 1) + e(k)


1st order ARMAX: predictor

To achieve error e(k), the predictor must be:

y(k) = −ay(k − 1) + bu(k − 1) + ce(k − 1) (5.1)

This depends on unknown noise e(k − 1). However, we derive arecursive predictor formula that does not.

y(k − 1) = −ay(k − 2) + bu(k − 2) + ce(k − 2) (5.2)

From Eqn. (5.1) +c Eqn. (5.2):

y(k) + cy(k − 1)

=− ay(k − 1) + bu(k − 1) + ce(k − 1)

+ c(−ay(k − 2) + bu(k − 2) + ce(k − 2))

=− ay(k − 1) + bu(k − 1) + ce(k − 1)

+ c(−ay(k − 2) + bu(k − 2) + ce(k − 2) + e(k − 1)− e(k − 1))

=− ay(k − 1) + bu(k − 1) + ce(k − 1) + cy(k − 1)− ce(k − 1)

=(c − a)y(k − 1) + bu(k − 1)


1st order ARMAX: predictor (continued)

Final recursion:

y(k) = −cy(k − 1) + (c − a)y(k − 1) + bu(k − 1)

Requires initialization at y(0); this initial value is usually taken 0.

This initialization needs some technical conditions to work properly, we do notgo into them here.


1st order ARMAX: prediction error

Similar recursion:

ε(k) = y(k)− y(k)

ε(k − 1) = y(k − 1)− y(k − 1)

⇒ ε(k) + cε(k − 1) = y(k) + cy(k − 1)− (y(k) + cy(k − 1))

= y(k) + cy(k − 1)− ((c − a)y(k − 1) + bu(k − 1))

= y(k) + ay(k − 1)− bu(k − 1)

⇒ ε(k) = −cε(k − 1) + y(k) + ay(k − 1)− bu(k − 1)

Requires initialization of ε(0), usually taken 0.


Finding the parameters

Once the errors are available, the parameter θ is found by minimizingcriterion V (θ) = 1

N

∑Nk=1 ε2(k).


Table of contents


2 Model structures



General case

Matlab example




Recall: General model structure

y(k) = G(q−1)u(k) + H(q−1)e(k)

where G and H are discrete-time transfer functions.


Prediction error

We start by deriving e(k).

y(k) = G(q−1)u(k) + H(q−1)e(k)

⇒ e(k) = H−1(q−1)(y(k)−G(q−1)u(k))

where H−1 is the inverse of the fraction of polynomials H.

Recall that predictor will be derived so that the prediction errorε(k) = y(k)− y(k) = e(k), so the formula above can also be used tocompute ε(k).


Predictor

To achieve error e(k), the predictor must be:

y(k) = y(k)− e(k) = Gu(k) + (H − 1)e(k)

= Gu(k) + (H − 1)H−1(y(k)−Gu(k))

= Gu(k) + (1− H−1)(y(k)−Gu(k))

= Gu(k) + (1− H−1)y(k)−Gu(k) + H−1Gu(k))

= (1− H−1)y(k) + H−1Gu(k)

=: L1y(k) + L2u(k)

where we skipped q−1 to make the equations readable.

Remarks:

New notations L1(q−1) = 1− H−1(q−1),L2(q−1) = H−1(q−1)G(q−1).In order to have a causal predictor (which only depends on pastvalues of the output and input), we need G(0) = 0 and H(0) = 1,leading to L1(0) = L2(0) = 0.


1st order ARMAX: finding the parameters

Once the predictors and errors are available, the parameter θ is foundby minimizing criterion V (θ) = 1

N

∑Nk=1 ε2(k).

Again, linear regression will not work in general. Computationalmethods to solve the minimization problem will be introduced in thenext section.


Specializing the framework to ARX

It is instructive to see how the formulas simplify in the ARX case.Rewriting ARX in the general model template:

y(k) = Gu(k) + He(k) =BA

u(k) +1A

e(k)

We have:

L1 = 1− H−1 = 1− A, L2 = H−1G = By(k) = L1y(k) + L2y(k) = (1− A)y(k) + Bu(k) = ϕ(k)θ

ε(k) = H−1(y(k)−Gu(k)) = Ay(k)− Bu(k)

which can be easily seen to be equivalent with the ARX formulation.


Specializing the framework to 1st order ARMAX

Homework! Plug in the polynomials for 1st order ARMAX and verifythat you obtain the same result as before.


Table of contents


2 Model structures



General case

Matlab example




Experimental data

Consider again the experimental data on which ARX was applied.plot(id); and plot(val);

Recall theoretical ARMAX:

A(q−1)y(k) = B(q−1)u(k) + C(q−1)e(k)


Identifying an ARMAX modelmARMAX = armax(id, [na, nb, nc, nk]);

Arguments:

1 Identification data.2 Array containing the orders of A, B, C and the delay nk .

Like for ARX, structure includes the explicit minimum delay nkbetween inputs and outputs.


= b1u(k − nk) + b2u(k − nk − 1) + . . . + bnbu(k − nk − nb + 1)

+ e(k) + c1e(k − 1) + c2e(k − 2) + . . . + cnce(k − nc)

A(q−1)y(k) = B(q−1)u(k − nk) + C(q−1)e(k), where:

A(q−1) = (1 + a1q−1 + a2q−2 + . . . + anaq−na)

B(q−1) = (b1 + b2q−1 + bnbq−nb+1)

C(q−1) = (1 + c1q−1 + c2q−2 + . . . + cncq−nc)

Remark: As for ARX, can transform into theoretical structure by usinga new B(q−1) of order nk + nb − 1, with nk − 1 leading zeros.


ARMAX model validation

Considering the system is 2nd order with no time delay, take na = 2,nb = 2, nc = 2, nk = 1. Validation: compare(val, mARMAX);

In contrast to ARX, results are good. Flexible noise model pays off.


Identifying an OE model

Recall OE model structure:

y(k) =B(q−1)

F (q−1)u(k) + e(k)


Identifying an OE model (continued)

mOE = oe(id, [nb, nf, nk]);

Arguments:

1 Identification data.2 Array containing the orders of B, F , and the delay nk .

y(k) =B(q−1)

F (q−1)u(k − nk) + e(k), where:

B(q−1) = (b1 + b2q−1 + bnbq−nb+1)

F (q−1) = (1 + f1q−1 + f2q−2 + . . . + fnf q−nf )

Remark: Like before, can transform into theoretical structure bychanging B.


OE model validation

Considering the system is second-order with no time delay, we takenb = 2, nf = 2, nk = 1. Validation: compare(val, mOE);

Results as good as ARMAX. System seems to obey both modelstructures.


Table of contents


2 Model structures



General case

Matlab example




Preliminaries: Vector derivative and Hessian

Consider any function V (θ), V : Rn → R. Then:

dVdθ

=

∂V∂θ1∂V∂θ2...

∂V∂θn

,d2Vdθ2 =

∂2V∂θ2

1

∂2V∂θ1∂θ2

. . . ∂2V∂θ1θn

∂2V∂θ2∂θ1

∂2V∂θ2

2. . . ∂2V

∂θ2θn

......

......

∂2V∂θn∂θ1

∂2V∂θnθ2

. . . ∂2V∂θ2

n


Assumptions

Assumptions (simplified)1 Signals u(k) and y(k) are stationary stochastic processes.2 The input signal u(k) is “sufficiently informative”.3 Transfer functions G(q−1) and H(q−1) depend smoothly on θ.4 The Hessian d2V

dθ2 is nonsingular.


Discussion of assumptions

Assumption 2 is informal, formal name is ‘persistently exciting’input (it comes later).Recall that G and H have the parameters as coefficients, wedenote them by G(q−1; θ) and G(q−1; θ) to make thedependence explicit. Assumption 3 ensures we can differentiateG and H w.r.t. θ.Note that dG

dθ , dHdθ are vectors of transfer functions!

Recall V (θ) = 1N

∑Nk=1 ε2(k), the MSE. Assumption 4 ensures V

is not “flat” around minima, so the minima are uniquelyidentifiable.


Guarantee

Theorem 1

Define the limit V∞(θ) = limN→∞ V (θ). Given Assumptions 1–4, theidentification solution θ = arg minθ V (θ) converges to a minimumpoint θ∗ of V∞(θ) as N →∞.

Remark: This is a type of consistency guarantee, in the limit ofinfinitely many data points.


Further assumptions

Assumptions (simplified)5 The true system satisfies the model structure chosen. This

means there exists at least one θ0 so that for any input u(k) andthe corresponding output y(k) of the true system, we have:

y(k) = G(q−1; θ0)u(k) + H(q−1; θ0)e(k)

with e(k) white noise.6 The input u(k) is independent from the noise e(k) (the

experiment is performed in open loop).


Additional guarantee

Theorem 2

Under Assumptions 1-6, θ converges to a true parameter vector θ0 asN →∞.

Remark: Also a consistency guarantee. Theorem 1 guaranteed aminimum-error solution, whereas Theorem 2 additionally says thissolution corresponds to the true system, if the system satisfies themodel structure.


Table of contents


2 Model structures



Solving the optimization problem

Multiple inputs and outputs

MIMO ARX: Matlab example


Optimization problem

Objective of identification procedure: Minimize mean squared error

V (θ) =1N

N∑k=1

ε(k)2

where ε(k) are the prediction errors. In the general case:ε(k) = H−1(q−1)(y(k)−G(q−1)u(k))

Solution: θ = arg minθ

V (θ)

So far we took this solution for granted and investigated its properties.While in ARX linear regression could be applied to find θ, in generalthis does not work. Main implementation question:

How to solve the optimization problem?


Minimization via derivative root

Consider first the scalar case, θ ∈ R.

Idea: at any minimum, the derivativef (θ) = dV

dθ is zero. So, find a root of f (θ).

Remarks:Care must be taken to find aminimum and not a maximum orinflexion. This can be checked withthe second derivative, d2V

dθ2 = dfdθ > 0.

We may also find a local minimumwhich is larger (worse) than theglobal one.


Newton’s method for root finding

Start from some initial point θ0.At iteration `, next point θ`+1 is the intersection between abscissaand tangent at f in current point θ`. By geometry arguments:

θ`+1 = θ` −f (θ`)df (θ`)

dθ

Remarks:

Notation df (θ`)dθ means the value of derivative df

dθ at point θ`.

The slope of the tangent is df (θ`)dθ .

θ`+1 is the “best guess” for root given current point θ`.


Newton’s method for the identification problem

Replace f (θ) by dVdθ to get back to optimization problem:

θ`+1 = θ` −dV (θ`)

dθd2V (θ`)

dθ2

Extend to θ ∈ Rn. Recall that dVdθ is now an n-vector, and d2V

dθ2 ann × n-matrix (the Hessian):

θ`+1 = θ` −[

d2V (θ`)

dθ2

]−1 dV (θ`)

dθ

Add a step size α` > 0:

θ`+1 = θ` − α`

[d2V (θ`)

dθ2

]−1 dV (θ`)

dθ

Remark:

The stepsize helps with the convergence of the method, e.g.because in identification V is noisy.


Stopping criterion

Algorithm can be stopped:

When the difference between consecutive parameter vectors issmall, e.g. maxn

i=1 |θi,`+1 − θi,`| smaller than some presetthreshold.

orWhen the number of iterations ` exceeds a preset maximum.


Computing the derivatives

V (θ) =1N

N∑k=1

ε(k)2

Keeping in mind that ε(k) depends on θ, from matrix calculus:

dVdθ

=2N

N∑k=1

ε(k)dε(k)

dθ

d2Vdθ2 =

2N

N∑k=1

dε(k)

dθ

[dε(k)

dθ

]>+

2N

N∑k=1

ε(k)d2ε(k)

dθ2

where:dε(k)

dθ is the vector derivative and d2ε(k)dθ2 the Hessian of ε(k).

dε(k)dθ [ dε(k)

dθ ]>

is an n × n matrix.


Gauss-Newton

Ignore the second term in the Hessian of V and just use the first term:

H =2N

N∑k=1

dε(k)

dθ

[dε(k)

dθ

]>leading to the Gauss-Newton algorithm:

θ`+1 = θ` − α`H−1 dV (θ`)

dθ

Motivation:

Quadratic form of H gives better algorithm behavior.Simpler computation.


Example: 1st order ARMAX

Recall model and prediction error for 1st order ARMAX:

y(k) = −ay(k − 1) + bu(k − 1) + ce(k − 1) + e(k)

ε(k) = −cε(k − 1) + y(k) + ay(k − 1)− bu(k − 1)

We need dε(k)dθ = [∂ε(k)

∂a , ∂ε(k)∂b , ∂ε(k)

∂c ]>

. Differentiating second equation:

∂ε(k)

∂a= −c

∂ε(k − 1)

∂a+ y(k − 1)

∂ε(k)

∂b= −c

∂ε(k − 1)

∂b− u(k − 1)

∂ε(k)

∂c= −c

∂ε(k − 1)

∂c− ε(k − 1)

So, ∂ε(k)∂a , ∂ε(k)

∂b , ∂ε(k)∂c are dynamical signals that can be computed

using the recursions above, starting e.g. from 0 initial values.


Example: 1st order ARMAX (continued)

Finally, the algorithm is implemented as follows:

Given current value of parameter vector θ`, apply the recursionsabove to find signals dε(k)

dθ , k = 1, . . . , n.

Plug dε(k)dθ into equations for dV

dθ , d2Vdθ2

Apply Newton (or Gauss-Newton) update formula to find θ`+1.


Table of contents


2 Model structures







MIMO system

So far we considered y(k) ∈ R, u(k) ∈ R,Single-Input, Single-Output (SISO) systems:

y(k) = G(q−1)u(k) + H(q−1)e(k)

Many systems are Multiple-Input, Multiple-Output (MIMO).E.g., aircraft. Inputs: throttle, aileron, elevator, rudder.Outputs: airspeed, roll, pitch, yaw.


General MIMO model

Consider y(k) ∈ Rny , u(k) ∈ Rnu. Model:

y(k) = G(q−1)u(k) + H(q−1)e(k)y1(k)y2(k)

...yny (k)

= G(q−1)

u1(k)u2(k)

...unu(k)

+ H(q−1)

e1(k)e2(k)

...eny (k)

Different from SISO:

noise e(k) ∈ Rny is a random vector of the same size as y ,white-noise (independent, zero-mean).G(q−1) is a matrix of size ny × nuH(q−1) is a matrix of size ny × ny

The matrix elements are fractions of polynomials in q−1.


MIMO ARX

A(q−1)y(k) = B(q−1)u(k) + e(k)

A(q−1) = I + A1q−1 + . . . + Anaq−na

B(q−1) = B1q−1 + . . . + Bnbq−nb

where I is the ny × ny identity matrix, A1, . . . , Ana ∈ Rny×ny ,B1, . . . , Bnb ∈ Rny×nu.

Put in the general template:

y(k) = A−1(q−1)B(q−1)u(k) + A−1(q−1)e(k)


MIMO ARX: concrete example

Take na = 1, nb = 2, ny = 2, nu = 3. Then:

A(q−1)y(k) = B(q−1)u(k) + e(k)

A(q−1) = I + A1q−1

= I +

[a11

1 a121

a211 a22

1

]q−1

B(q−1) = B1q−1 + B2q−2

=

[b11

1 b121 b13

1b21

1 b221 b23

1

]q−1 +

[b11

2 b122 b13

2b21

2 b222 b23

2

]q−2


MIMO ARX: concrete example (continued)

([1 00 1

]+

[a11

1 a121

a211 a22

1

]q−1

) [y1(k)y2(k)

]

=

([b11

1 b121 b13

1b21

1 b221 b23

1

]q−1 +

[b11

2 b122 b13

2b21

2 b222 b23

2

]q−2

) u1(k)u2(k)u3(k)

+

[e1(k)e2(k)

]

Explicit relationship:

y1(k) + a111 y1(k − 1) + a12

1 y2(k − 1)

= b111 u1(k − 1) + b12

1 u2(k − 1) + b131 u3(k − 1)

+ b112 u1(k − 2) + b12

2 u2(k − 2) + b132 u3(k − 2) + e1(k)

y2(k) + a211 y1(k − 1) + a22

1 y2(k − 1)

= b211 u1(k − 1) + b22

1 u2(k − 1) + b231 u3(k − 1)

+ b212 u1(k − 2) + b22

2 u2(k − 2) + b232 u3(k − 2) + e2(k)


General error and predictor

The derivation mirrors that in the SISO case:

y(k) = Gu(k) + He(k)

⇒ e(k) = ε(k) = H−1(y(k)−Gu(k))

y(k) = y(k)− e(k) = Gu(k) + (H − I)e(k)

= Gu(k) + (H − I)H−1(y(k)−Gu(k))

= Gu(k) + (I − H−1)(y(k)−Gu(k))

= Gu(k) + (I − H−1)y(k)−Gu(k) + H−1Gu(k))

= (I − H−1)y(k) + H−1Gu(k)

=: L1y(k) + L2u(k)

where technical conditions must be satisfied, e.g. to ensure that H−1

is meaningful.


PEM criterion

Important difference between SISO and MIMO: identification criterion.The SISO criterion, the MSE:

V (θ) =1N

N∑k=1

ε(k)2

is no longer appropriate, since ε(k) is an ny -vector.

Define R(θ) = 1N

∑Nk=1 ε(k)ε>(k), an ny × ny matrix. Two options for

the new criterion:V (θ) = trace(R(θ))

V (θ) = det(R(θ))

Assuming ε(k) is zero-mean, R(θ) is an estimate of its covariance.


Table of contents


2 Model structures







Experimental data

Consider a continuous stirred-tank reactor:

Image credit: mathworks.com

Input: coolant flow Q

Outputs:

Concentration CA of substance A in the mixTemperature T of the mix


Experimental data

Left: identification, Right: validation


MIMO ARX in Matlab

A(q−1)y(k) = B(q−1)u(k) + e(k)

A(q−1) =

a11(q−1) a12(q−1) . . . a1ny (q−1)a21(q−1) a22(q−1) . . . a2ny (q−1)

. . .any1(q−1) any2(q−1) . . . anyny (q−1)

aij(q−1) =

{1 if i = j0 otherwise

+ aij1q−1 + . . . + aij

naijq−naij

B =

b11(q−1) b12(q−1) . . . b1nu(q−1)b21(q−1) b22(q−1) . . . b2nu(q−1)

. . .bny1(q−1) bny2(q−1) . . . bnynu(q−1)

bij(q−1) = bij

1q−nkij + . . . + bijnbij

q−nkij−nbij+1


Identifying a MIMO ARX modelm = arx(id, [Na, Nb, Nk]);

Arguments:

1 Identification data.2 Matrices with orders of polynomials in A, B, and delays nk :

Na =

na11 . . . na1ny. . .

nany1 . . . nanyny

Nb =

nb11 . . . nb1nu. . .

nbny1 . . . nbnynu

Nk =

nk11 . . . nk1nu. . .

nkny1 . . . nknynu


MIMO ARX results

Take na = 2, nb = 2, and nk = 1 everywhere in matrix elements:

Na = [2 2; 2 2]; Nb = [2; 2]; Nk = [1; 1];m = arx(id, [Na Nb Nk]);compare(m, val);

Appendix: Nonlinear ARX

Appendix: Nonlinear ARX (for project)


Nonlinear ARX structure

Recall standard ARX:

y(k) =− a1y(k − 1)− a2y(k − 2)− . . .− anay(k − na)


Linear dependence on delayed outputs y(k − 1), . . . , y(k − na) andinputs u(k − 1), . . . , u(k − nb).

Nonlinear ARX (NARX) generalizes this to any nonlineardependence:

y(k) =g(y(k − 1), y(k − 2), . . . , y(k − na),

u(k − 1), u(k − 2), . . . , u(k − nb); θ)

+ e(k)

Function g is parameterized by θ ∈ Rn, and these parameters can betuned to fit identification data and thereby model a particular system.

In our case, g will be a neural network.


Neural-net NARX: training data

y(k) =g(y(k − 1), y(k − 2), . . . , y(k − na),

u(k − 1), u(k − 2), . . . , u(k − nb); θ)

+ e(k)

y(k) =g(x(k); θ) + e(k)

with new notation for the network input:x(k) = [y(k − 1), . . . , y(k − na), u(k − 1), . . . , u(k − nb)]>

Then, the training dataset consists of input-output pairs:

(x(1), y(1)), (x(2), y(2)), . . . , (x(N), y(N))

where N=number of time steps in the data.

Note: Negative and zero-time y and u can be taken 0, assumingsystem in zero initial conditions.


Neural-net NARX: training criterion

y(k) =g(x(k); θ)

ε(k) =y(k)− y(k) = y(k)− g(x(k); θ)

The criterion is V (θ) = 1N

∑Nk=1 ε2(k), the MSE. Training should return

a parameter vector θ minimizing V (θ).

So NARX identification is in fact:

A nonlinear prediction error method.And also a nonlinear regression problem.


Using the NARX model

One-step ahead prediction: If true output sequence is known, thenetwork input x(k) is fully available:

x(k) =[y(k − 1), . . . , y(k − na), u(k − 1), . . . , u(k − nb)]>

y(k) =g(x(k); θ)

Example: On day k − 1, predict weather for day k .

Simulation: If true outputs unknown, use the previously predictedoutputs to construct an approximation of x(k):

x(k) =[y(k − 1), . . . , y(k − na), u(k − 1), . . . , u(k − nb)]>

y(k) =g(x(k); θ)

(predicted outputs at negative and zero time can also be taken 0.)

Example: Simulation of an aircraft’s response to emergency pilotinputs, that may be dangerous to apply to the real system.

part 5 - prediction error methods

Documents