time series analysis and system identification

111
Time Series Analysis and System Identification Course number 157109 Huibert Kwakernaak Gjerrit Meinsma Department of Applied Mathematics University of Twente Enschede, The Netherlands

Upload: nguyen-dong-hai-phuong

Post on 04-Mar-2015

190 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Time Series Analysis and System Identification

Time Series Analysis

and

System Identification

Course number 157109

Huibert Kwakernaak

Gjerrit Meinsma

Department of Applied MathematicsUniversity of Twente

Enschede, The Netherlands

Page 2: Time Series Analysis and System Identification

ii

Page 3: Time Series Analysis and System Identification

Preface

These lecture notes were translated from the 1997 Dutch edition of the lecture notes for the course Tijdreeksenanalyse

en Identificatietheorie. The Dutch lecture notes are a completely revised and expanded version of an earlier set ofnotes (Bagchi and Strijbos, 1988).

For the preparation of part of the Dutch notes the book by Kendall and Orr (1990) was used. This book served as thetext for the course for several years. Also, while preparing the Dutch notes ample use was made of the seminal book byLjung (1987).

The Dutch lecture notes first came out in in 1994. Revisions appeared in 1995 and 1997.In the preparation of the English edition the opportunity was taken to introduce a number of minor revisions and

improvements in the presentation. For the benefit of speakers of Dutch an English–Dutch glossary has been included.

Prerequisites

The prerequisite knowledge for the course covers important parts of

1. probability theory,

2. mathematical statistics,

3. mathematical system theory, and

4. the theory of stochastic processes,

all at the level of the undergraduate courses about these subjects taught at the Department of Applied Mathematics ofthe University of Twente.

Matlab

Time series analysis and identification theory are of great applied mathematical interest but also serve a very impor-tant practical purpose. There is a generous choice of application software for the numerous algorithms and proceduresthat have been developed for time series analysis and system identification. In the international systems and controlcommunity MATLAB is the standard environment for numerical work. For this reason MATLAB has been selected as theplatform for the numerical illustrations and the computer laboratory exercises for this course. The numerical illustra-tions were developed under MATLAB version 4 but they work well under versions 5, 6 and 7.

Within MATLAB the System Identification Toolbox of Ljung (1991) supplies a number of important and highly usefulnumerical routines. At suitable places in the notes various routines from the System Identification Toolbox are intro-duced. A concise description of these routines may be found in one of the appendices.

Exercises and examination problems

The notes include a number of exercises. In addition, most of the problems that were assigned at the written examina-tions during the period 1993–1995 were added to the notes as exercises. Finally, the 1996 and 1997 examinations weretranslated and included as appendices together with full solutions.

Huibert KwakernaakFebruary 1, 1998

Later revisions

In August 1999 errors were corrected by Rens Strijbos. For the print of 2001 more errors were corrected and severalchanges were made by Gjerrit Meinsma. The main changes concern the operator q which now denotes the forward

shift operator; the treatment of the FFT and recursive least squares; the use of matrix notation and the addition ofseveral proofs and exercises. Following the course in 2003 some more errors were corrected. These pertain mainly toestimation theory of spectral densities. For the 2006 print a beginning was made to rearrange and update the secondchapter. In 2008 some small typos were removed.

iii

Page 4: Time Series Analysis and System Identification

Contents

1 Introduction 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Stochastic processes 5

2.1 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Moving average processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Convolutions and the shift operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Auto-regressive processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 ARMA processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.6 Spectral analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.7 Trends and seasonal processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.8 Prediction of time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Estimators 31

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Normally distributed processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Foundations of time series analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4 Linear estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Non-parametric time series analysis 41

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Tests for stochasticity and trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3 Classical time series analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.4 Estimation of the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5 Estimation of the covariance function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.6 Estimation of the spectral density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.7 Continuous time processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Estimation of ARMA models 59

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.2 Parameter estimation of AR processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.3 Parameter estimation of MA processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.4 Parameter estimation of ARMA processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.5 Non-linear optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.6 Order determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6 System identification 75

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.2 Non-parametric system identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.3 ARX models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.4 ARMAX models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.5 Identification of state models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.6 Further problems in identification theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

A Proofs 91

B The Systems Identification Toolbox 97

C Glossary English–Dutch 99

D Bibliography 101

iv

Page 5: Time Series Analysis and System Identification

Index 103

v

Page 6: Time Series Analysis and System Identification

vi

Page 7: Time Series Analysis and System Identification

1 Introduction

1.1 Introduction

1.1.1 Time series analysis

Time series analysis deals with the analysis of measuredor observed data that evolve with time. Historicallythe subject has developed along different lines. Besidesmathematical statistics also the engineering and physicalsciences have contributed. In the last decades a reward-ing interrelationship has developed between the disci-plines of times series analysis and system theory. In theselecture notes the system approach to time series analysisis emphasized.

Time series analysis is used as a tool in many scien-tific fields. A great deal of software is available. In thesenotes serious attention is devoted to the practical aspectsof time series analysis.

Important subjects from time series analysis are:

1. Formulating mathematics models for time series.

2. Describing and analyzing time series.

3. Forecasting time series.

4. Estimating the properties of observed time series.

5. Estimating the parameters of models for observedtime series.

The time series that are studied may be of very differentnatures. They have in common that their time behavior isirregular and only partly predictable. In § 1.2 (p. 1) severalexamples of times series are described.

1.1.2 System identification

systemu y

v

Figure 1.1: System with input signal u and noisev .

Figure 1.1 shows a common paradigm1 for system iden-tification. A system is subject to an input signal u , whichmay be accurately observed and recorded. In some sit-uations the input u may be freely chosen within certain

1Paradigm: model situation.

rules. The output signal y may also be observed andrecorded. The output y is not only a direct result of theinput u but also has a component v that consists of “dis-turbances” and “measurement errors.” For simplicity werefer to v as “noise.” System identification aims at infer-ring the dynamical properties of the system and the sta-tistical properties of the noise v from the observed signalsu and y .

System identification applies to many situations. Assoon as control, prediction or forecasting problems needto be solved for systems whose dynamics are completelyor partially unknown system identification enters thestage. In § 1.2 (p. 1) an example of a system identificationproblem is described. The system identification problemhas much in common with time series analysis but nat-urally the presence of the input signal u introduces newaspects.

1.1.3 Organization of the notes

The notes are organized like this:

• Chapter 2 (p. 5) reviews some notions from the the-ory of stochastic processes.

• Chapter 3 (p. 31) presents a survey of statistical no-tions concerning estimators.

• Chapter 4 (p. 41) deals with non-parametric time se-ries analysis. Non-parametric time series analysismainly concerns the estimation of covariance func-tions and spectral density functions.

• Chapter 5 (p. 59) is devoted to parametric time seriesanalysis, in particular the estimation of the parame-ters of ARMA models,

• Chapter 6 (p. 75) offers an introduction to non-parametric and parametric system identification.

1.2 Examples

1.2.1 Time series

We present several concrete examples of time series.

Example 1.2.1 (Monthly flow of the river Tiber in Rome dur-

ing 1937–1962). The flow and other statistics of the Italianriver Tiber with its different branches are recorded at fivedifferent locations. The measurements are monitored forunusual water heights, water speeds and the like. Thedata of this example originate from the Rome observa-tion station. They represent monthly averages of the dailymeasurements. As a result there are twelve data pointsper year. The data concern the period from 1937 until1962 with the exception of the years 1943 until 1946. Alltogether 265 data points are available. Figure 1.2 showsthe plot of the data. �

Page 8: Time Series Analysis and System Identification

0 50 100 150 200 2500

400

800

1200

month

av

era

ge

flo

w[m

3/

s]

Figure 1.2: Monthly average of the flow of theriver Tiber in Rome during 1937–1962

Example 1.2.2 (Immigration into the USA). To a greater orlesser extent each country is faced with immigration. Im-migration may lead to great problems. All countries care-fully record how many immigrants enter.

In this example we consider the immigration into theUSA during the period from 1820 until 1962. Figure 1.3shows the plot of the available data. �

0 20 40 60 80 100 1201

2

3

year

an

nu

al

imm

igra

tio

n(u

nit

su

nk

ow

n)

Figure 1.3: Annual immigration into the USA dur-ing 1820–1962

Example 1.2.3 (Bituminous coal production of the USA).

This example deals with the production of bituminouscoal. Bitumen is a collective noun for brown or blackmineral substances of a resinous nature and highly in-flammable, which are known under different names.Naphtha is the most fluid, petroleum and mineral tar lessso, and asphalt is solid.

Monthly records of the production of bituminous coalin the USA are available from January 1952 until Decem-ber, 1959. This amounts to 96 data points. These datamight be used to forecast the production in the comingmonths to meet contractual obligations. The data areplotted in Figure 1.4. �

Example 1.2.4 (Freight shipping of the Dutch Railways).

The Netherlands, like many other countries, has an ex-tensive railway network. Many places are connected byrail. Many goods are shipped by rail. For the Dutch Rail-ways it is important to know whether the demand for rail-way shipping keeps growing. For this and other reasonsthe amount of freight shipped is recorded quarterly. Data

0 10 20 30 40 50 60 70 80 902.5

3

3.5

4

4.5

5

month

pro

du

cti

on[t

on

s]

×104

Figure 1.4: Monthly production of bituminouscoal in the USA during 1952–1959

are available for the period 1965–1978, and are plotted inFig. 1.5. �

0 10 20 30 40 50160

180

200

220

240

quarter

Fre

igh

t[t

on

s]

Figure 1.5: Quarterly freight shipping of theDutch Railways during 1965–1978

Example 1.2.5 (Mortgages in the USA during 1973–1978).

Buying real estate normally requires more cash than isavailable. A bank needs to be found to negotiate a mort-gage. Every bank naturally keeps accurate records of theamount of mortgages that are outstanding and of the newmortgages. The data we work with originate from thereports of large commercial US banks to the Federal Re-

serve System. The unit is billions of dollars! We have 70monthly records from January, 1973, until October, 1978.Figure 1.6 shows the plot. �

0 10 20 30 40 50 60 7040

50

60

70

80

90

100

month

am

ou

nt[b

illi

on

$]

Figure 1.6: Monthly new mortgages in the USAduring 1973–1978

Example 1.2.6 (Home construction in the USA during

1947–1967). Home construction activities are character-

2

Page 9: Time Series Analysis and System Identification

istic for the rest of the economy. The number of build-ing permits that are issued generally decreases before aneconomic recession sets in. For this reason the level of ac-tivity in home construction is not only important for thebuilding industry but also as an indicator for general eco-nomic development.

This example concerns the prediction of the index ofnew homes for which a building permit is issued. Thedata are quarterly during the period from 1947 until 1967.This involves 84 data points, which are plotted in Fig-ure 1.7. �

0 10 20 30 40 50 60 70 800

50

100

150

200

quarter

ind

ex

Figure 1.7: Index of quarterly building permits inthe USA during 1947–1967

1.2.2 System identification

Finally we discuss an example of a system identificationproblem.

Example 1.2.7 (Identification of a laboratory process).

Figure 1.8 shows an example of an input signal u and thecorresponding output signal y . The recordings originatefrom an experimental laboratory process and have beentaken from the MATLAB Identification Toolbox (Ljung,1991). The process consists of a hair dryer, which blowshot air through a tube. The input is the electrical voltageacross the heating coil. The output is the output voltageof a thermocouple that has been placed in the outgoingair flow and is a measure for the air temperature.

The input signal is a binary stochastic sequence. It isswitched between two fixed values. The sampling intervalis 0.8[s]. The signal switches from one value to the otherwith a probability of 0.2. Such test signals are often usedfor system identification. They are easy to generate andhave a rich frequency content.

Inspection of the output signal shows that it is not freeof noise. The noise is caused by turbulence of the air flow.

0 50 100 150 200 2503

4

5

6

7

sampling instant t

inp

ut

u

0 50 100 150 200 2503

6

sampling instant t

ou

tpu

ty

Figure 1.8: Measured input and output signals ina laboratory process

3

Page 10: Time Series Analysis and System Identification

4

Page 11: Time Series Analysis and System Identification

2 Stochastic processes

The purpose of time series analysis is to obtain insightinto the structure of the phenomenon that generates thetime series. Generally the phenomenon is modeled as astochastic process. The time series then is a realization ofthe stochastic process.

Depending on whether the observations of the phe-nomenon are recorded at discrete instants only (usuallyequidistantly spaced) or continuously the time series isreferred to as a discrete or a continuous time series. Inthese notes the discussion is mainly limited to discretetime series.

In Section § 2.1 (p. 5) various notions from the theory ofstochastic processes are reviewed. The sections that fol-low deal with several stochastic processes with a specificstructure.

Section 2.2 (p. 9) discusses moving average processes,§ 2.3 (p. 10) treats convolutions, § 2.4 (p. 11) is aboutauto-regressive processes, and § 2.5 (p. 16) is on mixedauto-regressive/moving average processes. After this in§ 2.6 (p. 17) the spectral analysis of stochastic processesis summarized. In § 2.7 (p. 22) it is explained how trendsand seasonal processes may be modeled. The chapterconcludes in § 2.8 (p. 24) with a treatment of the predic-tion of time series.

2.1 Basic notions

In this section we summarize several notions from thetheory of stochastic processes.

2.1.1 Time series as realizations of stochastic pro-

cesses

The examples of time series that we presented in the pre-vious chapter are discrete time series, i.e. they are definedat discrete time instances only. We will denote such timeseries by x t , with t the time index. Two dominant featurespresent in all these times series are:

• They are irregular: the time series x t are noisy in thatthey appear not to be samples of some underlyingsmooth function.

• Despite the irregularities, the time series possessa definite temporal relatedness or memory: subse-quent values x t and x t+1 in a time series are not un-related.

Irregularity is unavoidable and temporal relatedness is amust without which for instance prediction would not befeasible. Therefore any sensible class of mathematical

models for time series should be able to capture these twofeatures. The predominant class of mathematical mod-els with these features are the stochastic processes. To in-troduce stochastic processes we consider 9 fictitious dailytemperature profiles x t , t = 1, 2, . . . shown in Fig 2.1(a-i).The two features of irregularity and temporal relatednessare clear in these 9 series. If we want to consider the 9

ttttt

ttttt

x tx tx tx tx t

x tx tx tx tx t

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

Figure 2.1: Nine time series (a-i) and combined (j)

temperature profiles as originating from a single processthen it makes sense to combine the plots into a single plotas shown in Fig. 2.1(j). Here at any time t we have a cloudof 9 points x t . We now take the point of view that each ofthe 9 time series x t , t = 1, 2, . . . is a realization of a fam-ily of stochastic variables X t , t = 1, 2, . . .. A realization isa particular outcome, a particular observation of a familyof stochastic variables. The stochastic variables X t have acertain probability distribution associated to them,

FX t(x , t ) := Pr(X t ≤ x ).

Its derivate

f X t(x , t ) :=

∂ xFX t(x , t )

is known as the (probability) density function or ampli-

tude distribution and intuitively we expect—at any givent —many samples x t there were the mass of the densityfunction is high. Figure 2.2(a) tries to convey this point.It depicts the density functions f X t

(x , t ) together with the9 realizations. Most of the samples x t indeed are locatednear the peak of the density functions. The density func-tion f X t

(x , t ) may depend on time as is clearly the casehere.

2.1.2 The mean value function

Knowledge of the probability distribution FX t(x , t ) is suf-

ficient to determine the mean value function m (t )or sim-ply the mean. It is defined as

m (t ) =EX t =

∫ ∞

−∞x dFX (x , t ). (2.1)

E denotes expectation. The mean value function m (t ) in-dicates the average level of the process at time t . For rea-sons of exposition the mean is depicted by a solid graphin Fig. 2.2(b) even though it is only defined at the integers.

5

Towardthesea
Highlight
Towardthesea
Highlight
Towardthesea
Highlight
Page 12: Time Series Analysis and System Identification

x tx t

x→

x→

f(x

,1)

f(x

,3)

tt

(a) (b)

Figure 2.2: The combined times series with superimposed the density functions (a) and mean (b)

Figure 2.3: Confidence bounds m (t ) ± σ(t ) andm (t )±2σ(t )

2.1.3 Variance and standard deviation

The variance is defined as

var(X t ) :=E�

(X t −m (t ))2�

=

∫ ∞

−∞(x −m (t ))2 dFX (x , t ).

It characterizes the spread of the collection of realizationsaround its mean. The smaller the variance the more con-densed we expect the cloud of points to be. Clearly, thevariance in Fig. 2.2 increases with time. The variancehowever has a dimension different from that of the sam-ples: if x t is in meters, for instance, then var(X t ) is in me-ters squared. Taking a square root resolves this issue. Thestandard deviationσ is the square root of the variance

σ(t ) =p

var(X t ).

It has the same dimension as x t and therefore can be plot-ted together with x t : Fig. 2.3 shows the realizations x t

together with two confidence bounds around the mean:m (t )±σ(t ) and m (t )±2σ(t ).

2.1.4 The covariance and correlation function

Figure 2.4 once again combines the 9 time series in a sin-gle plot but now the samples in each of the 9 time series

x2

x3

Figure 2.4: Temporal relatedness

are connected by solid lines. The purpose is to demon-strate that there is a temporal relatedness or memory inthese series. For instance the encircled value of x2 in theplot appears to influence the subsequent values x3,x4, . . .etcetera and only slowly does its effect on the time seriesdiminish and does the series return in the direction of itsexpected value m (t ). This is very common in practice.For example if the mean temperature today is supposedto be 20 degrees but today it happens to be 10 degreesthen tomorrow the temperature most likely is going to bearound 10 degrees as well, however in a week or so thetemperature may easily rise and exceed the claimed av-erage of 20. That is, time series typically display memorybut the memory is limited.

Temporal relatedness is captured by the covariance

function R(t , s ) defined as

R(t , s ) := cov(X t , Xs ) :=E[(X t −m (t ))(Xs −m (s ))], (2.2)

with t and s two time indices. Knowledge of the proba-bility distribution FX (x , t ) is insufficient to determine thecovariance function. Indeed this distribution determinesfor each t the stochastic properties of X t but it says noth-ing about how one X t is related to the next X t+1. Sufficientis knowledge of the joint probability distribution

FX t ,Xs(x t ,xs ; t , s ) := Pr(X t ≤ x t , Xs ≤ xs )

6

Page 13: Time Series Analysis and System Identification

and we have

R(t , s ) =

R2

(x1−m (t ))(x2−m (s ))dFX t ,Xs(x1,x2; t , s ).

The covariance function may be normalized to the corre-

lation function defined as

ρ(t , s ) =R(t , s )

p

R(t , t )R(s , s ). (2.3)

It has the convenient property that

−1≤ρ(t , s )≤ 1.

Both covariance and correlation functions indicate towhat extent the values of the process at the time instantst and s are statistically related. If R(t , s ) = ρ(t , s ) = 0 wesay that X t and Xs are uncorrelated. Ifρ(t , s ) =±1 then X t

and Xs are maximally correlated and it may be shown thatthat is the case iff X t is a linear function of Xs (or the otherway around), that is, iff X t = a +b Xs or Xs = a +b X t

1. Wewill interpret the covariance and correlation function indetail for stationary processes introduced shortly.

Lemma 2.1.1 (Properties of R and ρ).

1. Symmetry: R(t , s ) = R(s , t ) and ρ(t , s ) = ρ(s , t ) for

all t , s .

2. Positivity: R(t , t )≥ 0 and ρ(t , t ) = 1 for all t .

3. Cauchy-Schwartz’s inequality: |R(t , s )| ≤p

R(t , t )R(s , s ) and |ρ(t , s )| ≤ 1 for all t , s .

4. Nonnegative-definiteness of covariance matrix: Thecovariance matrix, also called variance matrix, of afinite number n of stochastic variables X t1 , X t2 , . . . , X tn

is the n ×n matrix R defined as

R=E

Xt1 −m (t1)

...

Xtn −m (tn )

Xt1 −m (t1) · · · Xtn −m (tn )��

.

(2.4)

The i j -th entry of R is R(t i , t j ). The covariance ma-

trix is a symmetric nonnegative definite n × n ma-

trix, that is, for every vector v ∈ Rn there holds that

v TRv ≥ 0. Likewise also the correlation matrix P

which is the matrix with entries ρ(t i , t j ) is symmetric

and nonnegative definite.

Proof. Problem 2.3 (p. 26).

2.1.5 Definition of a stochastic process

More generally, a stochastic process is a family of stochas-tic variables X t , t ∈ T. The set T is the time axis. In thediscrete-time case T usually is the set of the natural num-bers N or that of the integers Z or subsets thereof. In the

1This follows from the proof of Cauchy-Schwarz’ inequality, seepage 91.

continuous-time case T often is the real line R or the setof nonnegative real numbers R+.

The values of a stochastic process at n different timeinstants t1, t2, . . . , tn , all in T, have a joint probability dis-

tribution

FX t1 ,X t2 ,...,X tn(x1,x2, . . . ,xn ; t1, t2, . . . , tn )

:= Pr(X t1 ≤ x1, X t2 ≤ x2, . . . , X tn≤ xn ),

(2.5)

with x1, x2, . . . , xn real numbers. The set of all these prob-ability distributions, for all n ∈N and—for fixed n—for alln-tuples t1, t2, . . . , tn , is called the probability law of theprocess. If this probability law is known then the stochas-tic structure of the process is almost completely deter-mined2. Usually it is too much, and also not necessary, torequire that the complete probability law be known, andthe partial characterizations of the process such as meanand covariance are sufficient.

2.1.6 Stationarity and wide-sense stationarity

A process is called (strictly) stationary if the probabilitydistributions (2.5) are invariant under time shifts, that is,if for every n

Pr(X t1+τ ≤ x1, X t2+τ ≤ x2, . . . , X tn+τ ≤ xn )

= Pr(X t1 ≤ x1, X t2 ≤ x2, . . . , X tn≤ xn )

for all τ ∈Z. The statistical properties of a stationary pro-cess “do not change with time.”

A direct consequence of stationarity is that for station-ary stochastic processes the distribution FX t

(x ) does notdepend on time. As a result also the mean value function

m (t ) =

∫ ∞

−∞x dFX t

(x ) (2.6)

does not depend on time, and, hence, is a constant m . Inaddition we have for the covariance function

R(t +τ, s +τ)

=

∫ ∞

−∞

∫ ∞

−∞(x1−m )(x2−m )dFX t+τ,Xs+τ

(x1,x2; t +τ, s +τ)

=

∫ ∞

−∞

∫ ∞

−∞(x1−m )(x2−m )dFX t ,Xs

(x1,x2; t , s )

=R(t , s ).

(2.7)

This holds in particular for τ = −s in which case we getR(t − s , 0) = R(t , s ). The covariance function R(t , s ) of astationary process therefore only depends on the differ-

ence of its arguments t and s . For a stationary process wedefine

r (τ) = R(t +τ, t ) = cov(X t+τ, X t ), (2.8)

2In the discrete-time case the probability law determines the structurecompletely. In the continuous-time case more is needed.

7

Page 14: Time Series Analysis and System Identification

and call r , like R , the covariance function of the process.The correlation function is redefined as

ρ(τ) =r (τ)

r (0). (2.9)

Stochastic processes that are not necessarily stationary,but have the properties that the mean value function isconstant and the covariance function only depends onthe difference of its arguments are called wide-sense sta-

tionary. The amplitude distribution of a wide-sense sta-tionary process is not necessarily constant!

Lemma 2.1.2 (Covariance function of a wide-sense station-

ary process). Suppose that r is the covariance function

and ρ the correlation function of a wide-sense stationary

process. The following holds.

1. Positivity: r (0)≥ 0, ρ(0) = 1,

2. Symmetry: r (−τ) = r (τ) and ρ(−τ) =ρ(τ) for all τ ∈T,

3. Cauchy-Schwarz’s inequality: |r (τ)| ≤ r (0) and

|ρ(τ)| ≤ 1 for all τ ∈T.

4. Nonnegative-definiteness: The covariance matrix

and correlation matrix

r (0) r (τ1) · · · r (τn−1)

r (τ1) r (0) · · · r (τn−2)

......

......

r (τn−1) r (τn−2) · · · r (0)

, (2.10)

ρ(0) ρ(τ1) · · · ρ(τn−1)

ρ(τ1) ρ(0) · · · ρ(τn−2)...

......

...

ρ(τn−1) ρ(τn−2) · · · ρ(0)

(2.11)

are nonnegative definite n×n matrices for anyτi , i =

1, 2, . . . , n −1 and all n ∈N.

Proof. Problem 2.4 (p. 26).

2.1.7 Temporal relatedness

As claimed, the covariance function r (τ) of a wide-sensestationary process describes the temporal relatedness ofthe process. An example of a covariance function is

r (τ) =σ2X

a |τ|, τ ∈Z. (2.12)

The number σX =p

r (0) =p

var(X t ) is the standard de-

viation of the process. The number a , with |a | < 1, de-termines how fast the function decreases to zero, and,hence, how fast the temporal relatedness is lost. Fig-ure 2.5(a) shows the plot of r for σX = 1 and a = 0.9. Fig-ure 2.6(a) shows an example of a realization of the corre-sponding (stationary) process with mean value m = 0. InExample 2.4.3 (p. 12) we see how such realizations may begenerated.

-50 -40 -30 -20 -10 0 10 20 30 40 50

0

0.2

0.4

0.6

0.8

1

τ

r

-50 -40 -30 -20 -10 0 10 20 30 40 50

-0.5

0

0.5

1

τ

r

Figure 2.5: (a) Exponential covariance function.(b) Damped harmonic covariancefunction

0 25 50 75 100 125 150 175 200-3

-2

-1

0

1

2

3

τ

x t

0 25 50 75 100 125 150 175 200-3

-2

-1

0

1

2

3

τ

x t

Figure 2.6: (a) Realization of a process with expo-nential covariance function. (b) Real-ization of a process with damped har-monic covariance function

8

Page 15: Time Series Analysis and System Identification

A different example of a covariance function is

r (τ) =σ2X

a |τ|[A cos(2πτ

T)+ B sin(

2π|τ|T)], τ ∈Z,

(2.13)with σX , a , T , A , and B constants. Again σX ≥ 0 is thestandard deviation of the process. The number a , with|a |< 1, again determines how fast the function decreasesto zero. The number T > 0 represents a periodic charac-teristic of the time series. Figure 2.5(b) displays the be-havior of the plot of r forσ= 1, a = 0.9, T = 12, A = 1 andB = 0.18182. Figure 2.6(b) shows a possible realization ofthe process with mean value m = 0. The irregularly peri-odic behavior is unmistakable.

2.1.8 White noise

The temporal relatedness is minimal in a wide-sense sta-tionary discrete-time process for which

r (τ) =

¨

σ2 for τ= 0,0 for τ 6= 0,

τ ∈Z. (2.14)

This process consists of a sequence of uncorrelatedstochastic variables X t , t ∈ Z. Such a process is some-times said to be “purely random.” In a physical or engi-neering context it is often called white noise. This nameis explained in § 2.6.5 (p. 20). Usually the mean value ofwhite noise is assumed to be 0 and the standard deviationσ taken to be 1. Sometimes it is assumed that the stochas-tic variables are not only uncorrelated but even mutuallyindependent. In this case the process is completely char-acterized by its amplitude distribution FX t

(x ) = Pr(X t ≤x ). Often it is assumed that this amplitude distribution isnormal.

The white noise process plays an important role in thetheory of stochastic processes and time series analysis. Ina sense it is the most elementary stochastic process. Weshall soon see that other stochastic processes often maybe thought to originate from white noise, that is, they aregenerated by a “driving” white noise process.

Figure 2.7 shows an example of a realization of whitenoise with mean zero and standard deviation σ = 1. Ina suitable computer environment realizations of whitenoise may easily be generated with a random numbergenerator.

Figure 2.7: Realization of white noise with mean 0and standard deviation 1

x= rand(1,99); % 99 samples white, uniformly

% distributed over [0,1]

x=randn(1,99); % 99 samples white, normally

% distributed, mean 0, var 1

plot(x); % plot it

x t

ǫt

t

t

ǫtǫt−10 ǫsǫs−10

x t

xs

Figure 2.8: Realization of a moving average X t =

ǫt+ǫt−1+· · ·+ǫt−10 (top) of white noiseǫt (bottom)

2.2 Moving average processes

A moving average process of order k is a process X t thatmay be described by the equation

X t = b0ǫt +b1ǫt−1+ · · ·+bkǫt−k , t ≥ k . (2.15)

The coefficients b0, b1, . . . , bk are real and ǫt , t ∈ Z, iswhite noise with mean µ and standard deviation σ. Thevalue X t of the process at time t is the weighted sum ofthe k +1 immediately preceding values of the white noiseprocess ǫt . This explains the name. The notation forthis process is MA(k ), where the acronym MA stands for“moving average.” Figure 2.8 illustrates moving averag-ing.

Without loss of generality it may be assumed that b0 isscaled so that b0 = 1.

2.2.1 Mean value function and covariance function

By taking the expectation of both sides of (2.15) it followsfor the mean value function m (t ) =EX t that

m (t ) = (b0+b1+ · · ·+bk )µ, t = k , k +1, . . . (2.16)

Obviously m (t ) =m is constant.Define the centered processes X t = X t −m and ǫt =

ǫt −µ. By subtracting (2.16) from (2.15) it follows that

X t =b0ǫt +b1ǫt−1+ · · ·+bk ǫt−k , t = k , k +1, . . . .(2.17)

Squaring both sides of this equality yields

X 2t =

k∑

i=0

k∑

j=0

b i b j ǫt−i ǫt−j . (2.18)

Taking the expectation of both sides and using the factthat ǫt is white noise we find that

var(X t ) = var(X t ) = R(t , t )=σ2k∑

i=0

b 2i . (2.19)

9

Page 16: Time Series Analysis and System Identification

Clearly also the variance var(X t ) does not depend on t . Bytaking the expectation of both sides of

X t Xs =

k∑

i=0

k∑

j=0

b i b j ǫt−i ǫs−j (2.20)

it follows for t ≥ s that

R(t , s ) =E X t Xs

=

(

σ2∑k

i=t−sb i b i−t+s if 0≤ t − s ≤ k ,

0 if t − s > k .

Inspection shows that R(t , s ) only depends on the differ-ence t −s of its arguments. The process X t hence is wide-sense stationary with covariance function

r (τ) =

(

σ2∑k

i=|τ|b i b i−|τ| for |τ| ≤ k ,

0 for |τ|> k ,τ∈Z. (2.21)

We see that the covariance is exactly zero for time shifts|τ| greater than k . This is because the process has “finitememory.” This should be clear from Fig. 2.8: in the figurethe two time windows for t and s do not overlap, that is,no sample of the white noise is in both of the windows,hence X t and Xs are uncorrelated in Fig. 2.8.

2.2.2 The running average process

An example of an MA(k ) process is the running average

process of order k +1, defined by

X t =1

k +1

k∑

j=0

ǫt−j , t ≥ k . (2.22)

The weights of this MA(k ) process are b i = 1/(k + 1), i =

0, 1, . . . , k . The variance of the process is

σ2X= r (0)

= var(X t )

=σ2k∑

i=0

1

(k +1)2

=σ2

k +1.

The covariance function of the running average processis

r (τ) =

(

σ2∑k

i=|τ|1

(k+1)2for |τ| ≤ k ,

0 for |τ|> k ,

=

(

σ2X

1− |τ|k+1

for |τ| ≤ k ,

0 for |τ|> k ,τ ∈Z.

Figure 2.9 shows the triangular shape of r . Figure 2.10shows an example of a realization of the process for k = 9andσ= 1. For the computation of the first 9 values of theprocess the white noise ǫt for t < 0 has somewhat arbi-trarily been taken equal to 0.

-20 -10 0 10 20

0

1

τ

r

Figure 2.9: Covariance function of the runningaverage process of order 9

25 50 75 100 125 150 175 200-2

-1

0

1

2

t

x t

Figure 2.10: Realization of the running averageprocess of order 9

ep=randn(1,200); % some white noise

B=[1 −2 3.3]; % coefficients MA(3) scheme

x=filter(B,1,ep); % x t = ǫt −2ǫt−1 +3.3ǫt−2

plot(x(3:end)); % only plot x3,x4, . . . ,x200

2.3 Convolutions and the shift operator

That MA-processes are wide sense stationary is not un-expected once we realize that MA-processes may be for-mulated without explicit reference to time: whatever t is,the value of x t is a sum of preceding white noise samples.It is useful at this point to introduce an abbreviated no-tation for MA- and other processes. Define the forward

shift operator, q , by

(qX )t = X t+1, t ∈ Z. (2.23)

Then (q−1X )t =X t−1 and we may write (2.15) as

X t = (b0+b1q−1+ · · ·+bk q−k )ǫt

or simplyX t =N (q )ǫt ,

for N (q ) := b0+b1q−1+· · ·+bk q−k . The operator N (q ) thatdefines the MA-process is a polynomial in q−1, and it doesnot depend on time. Moving averages are convolutions.Indeed, the indices of every term in b0ǫt + b1ǫt−1 + · · ·+bkǫt−k add up to t so an MA-process is an example of aprocess of the form

X t =

∞∑

n=−∞hnǫt−n . (2.24)

10

Page 17: Time Series Analysis and System Identification

The right-hand side is known as the convolution of h t andǫt . Similar to what we showed for MA processes, we havefor convolution systems that

EX t =µ

∞∑

n=−∞hn (2.25)

and

r (τ) =σ2∞∑

n=−∞hn hn−τ. (2.26)

Stochastic processes with infinite memory are not MA(k )-processes as the latter has a finite memory of length k .Convolutions (2.24) however are MA-processes of infiniteorder and they can in principle exhibit infinite memory.The class of convolutions is in fact large enough to modeleffectively every wide sense stationary process. This is animportant result:

Lemma 2.3.1 (Innovations representation). Every zero

mean wide sense stationary process X t with absolutely

summable covariance function,∑

τ∈Z |r (τ)| < ∞ is of the

form (2.24) with ǫt some zero mean white noise process

and h t some square summable sequence,∑

t∈Z |h t |2 <∞.

Proof. Follows later from spectral properties, see Ap-pendix A.

A system theoretic interpretation is that essentially ev-ery wide sense stationary process can be seen as the out-put of a system driven by white noise.

2.4 Auto-regressive processes

The previous lemma appears to suggest that when con-sidering wide sense stationary processes we should tryto model it as an MA-process, possibly of infinite order(i.e. convolutions). Not necessarily so as the following ex-ample shows.

Example 2.4.1 (Random shock and its simulation). Con-sider the infinite order MA-process

X t = ǫt +aǫt−1+a 2ǫt−2+a 3ǫt−3+ · · ·

For practical implementation we would have to cut thisinfinite order MA-process to a finite one. If for instancea = 0.9 then an MA(50) scheme could be a good approx-imation | note that 0.950 = 0.0052. However, we can alsoexpress X t recursively as

X t = a X t−1+ ǫt .

This is a description of the process that requires only asingle coefficient and the values of X t are easily gener-ated recursively. This is an example of a first order auto-regressive process. �

In this section we analyse auto-regressive schemes. Anauto-regressive process of order n is a process X t , t ∈ Z+,that satisfies a difference equation of the form

X t = a 1X t−1+a 2X t−2+ · · ·+a n X t−n + ǫt , t ≥n . (2.27)

The constant coefficients a 1, a 2, . . . , a n are real numbersand ǫt , t ∈ Z, is stationary white noise with mean µ andvariance σ2. We assume that σ > 0 and a n 6= 0. Theprocess is called auto-regressive because the value of theprocess at time t besides on a purely random componentdepends on the n immediate past values of the processitself. We denote such a process as an AR(n) process. Be-cause of its memory, an AR process generally is less irreg-ular than a white noise process.

2.4.1 The Markov scheme

We first analyze the simplest case

X t = a X t−1+ ǫt , t = 1, 2, . . . , (2.28)

with a real. In the statistical literature this AR(1) schemeis often called a Markov scheme. For a = 1 the processknown as random walk results.

By repeated substitution it follows that

X t = ǫt +aǫt−1+a 2ǫt−2+ · · ·+a t−1ǫ1+a t X0, (2.29)

for t = 1, 2, . . .. By taking the expectation of both sides wefind for m (t ) =EX t

m (t ) =µ+aµ+a 2µ+ · · ·+a t−1µ+a t m (0)

=

¨

µ 1−a t

1−a+a t m (0) if a 6= 1,

µt +m (0) if a = 1.(2.30)

The mean value function is time-dependent, except if a =

0 or (a = 1,µ= 0) or, if a 6= 1, if

m (0) =µ

1−a. (2.31)

Then we have m (t ) =m (0), t ≥ 0.We could also have obtained these results by taking the

expectation of both sides of (2.28). Then we obtain

m (t ) = a m (t −1)+µ, t = 1, 2, . . . . (2.32)

Repeated substitution results in (2.30).We consider the covariance function of the process

produced by the Markov scheme. Define X t = X t −m (t )

and ǫt = ǫt − µ. The process X t has mean 0, and ǫt iswhite noise with mean 0 and variance σ2. X t and ǫt arethe centered processes corresponding to X t and ǫt .

We have R(t , s ) = cov(X t , Xs ) = cov(X t , Xs ) = R(t , s ),so that we may compute the covariance function of X t asthat of X t . Furthermore it follows by subtracting (2.32)from (2.28) that

X t = a X t−1+ ǫt , t = 1, 2, . . . . (2.33)

11

Page 18: Time Series Analysis and System Identification

From

X t = ǫt +a ǫt−1+a 2ǫt−2+ · · ·+a t−1ǫ1+a t X0, (2.34)

for t = 1, 2, . . ., we obtain

R(t , t ) = var(X t )

= (1+a 2+ · · ·+a 2t−2)σ2+a 2t var(X0)

=

(1−a 2t

1−a 2 σ2+a 2t var(X0) for a 2 6= 1,

tσ2+ var(X0) for a 2 = 1.(2.35)

Furthermore we have for τ≥ 0

R(t +τ, t )

= cov(X t+τ, X t ) (2.36)

= aτ¨

1−a 2t

1−a 2 σ2+a 2t var(X0) for a 2 6= 1,(tσ2+ var(X0)) for a 2 = 1.

Note that we assume X0 to be independent of ǫt , t > 0.Suppose that a 2 6= 1. Inspection of (2.35) reveals that

R(t , t ) depends on the time t , unless

var(X0) =σ2

1−a 2. (2.37)

Because var(X0) cannot be negative, (2.37) can only holdif |a |< 1. If (2.37) holds then

R(t , t ) =σ2

1−a 2, t ≥ 0. (2.38)

It follows from (2.36) that if (2.37) holds then

R(t +τ, t ) =σ2

1−a 2aτ, τ≥ 0. (2.39)

Apparently if |a | < 1 and the initial conditions (2.31) and(2.37) apply then the AR(1) process is wide-sense station-

ary. With the symmetry property of 2.1.2 (p. 8) it followsfrom (2.39) that

r (τ) = cov(X t+τ, X t ) =σ2

1−a 2a |τ|, τ ∈Z. (2.40)

Further contemplation of (2.30) and (2.36) reveals that if(2.37) does not hold but |a |< 1 then

m (t )t→∞−→ µ

1−a,

R(t +τ, t )t→∞−→ σ2

1−a 2a |τ|.

We say that then the process is asymptotically wide-sense

stationary. More generally:

Definition 2.4.2 (Asymptotically wide-sense stationary

AR(n ) process). An AR(n ) process is said to be asymptot-

ically wide sense stationary if the limits

m := limt→∞

m (t ), r (τ) := limt→∞

R(t +τ, t ) ∀τ ∈Z

exist and are unique (i.e., do not depend on initialX0, X1, . . . , Xn−1). �

Example 2.4.3 (Simulation of the AR(1) process). We nowrecognize how a realization of a process with covariancefunction

r (τ) =σ2X a |τ|, (2.41)

as shown in Fig. 2.5 on page 8, may be generated. We con-sider the Markov scheme

X t = a X t−1+ ǫt , t ≥ 1. (2.42)

To obtain a required value of σX for given a we choosein accordance with (2.37) the standard deviation of thewhite noise ǫt as σ2 = (1− a 2)σ2

X . If we let µ = 0 then X t

is centered. Next we need to implement the initial condi-tions

m (0) =EX0 =µ

1−a= 0,

var(X0) =σ2

1−a 2=σ2

X .

These conditions imply that X0 should randomly bedrawn according to a probability distribution with meanzero and variance σ2

X . If X0 is obtained this way then therest of the realization is generated with the help of theMarkov scheme (2.42). The successive values of ǫt are de-termined with a random number generator. �

2.4.2 Asymptotically wide-sense stationary AR pro-

cesses

We have seen that under the condition |a |< 1 the Markovscheme produces an asymptotically wide-sense station-ary process. A similar result holds for the general AR(n)process

X t = a 1X t−1+a 2X t−2+ · · ·+a n X t−n + ǫt , t ≥n . (2.43)

For given X0, X1, . . . , Xn−1 the solution for t ≥ n maybe determined by successive substitution. We may alsoview the process X t as the solution of the linear differenceequation

X t −a 1X t−1−a 2X t−2− · · ·−a n X t−n = ǫt , t ≥n . (2.44)

Using the forward shift operator q this becomes

(1−a 1q−1−a 2q−2− · · ·−a nq−n )︸ ︷︷ ︸

D(q )

X t = ǫt , t ≥ n , (2.45)

orD(q )X t = ǫt , t ≥n . (2.46)

The solution X t of this linear difference equation is thesum of a particular solution (for instance correspondingto the initial conditions X0 = X1 = · · · = Xn−1 = 0), and asuitable solution of the homogeneous equation

D(q )X t = 0. (2.47)

We briefly discuss the solution of the homogeneous equa-tion. Let λ1 ∈C be a root of the equation D(λ) = 0. Then

X t = Aλt1, t ∈Z, A ∈R (2.48)

12

Page 19: Time Series Analysis and System Identification

is a solution of (2.47). Indeed we have that

D(q )Aλt1 = (Aλ

t1−a 1Aλt−1

1 − · · ·−a n Aλt−n1 )

= Aλt1D(λ1)

= 0.

The function D(λ) is a polynomial in λ−1 of degree n

and, hence, there are n zeros λi . We conclude that if λi ,i = 1, 2, . . . , n , are the n zeros of D, then every linear com-bination of the functions λt

i , t ∈Z, is a solution of the ho-mogeneous equation (2.47). More generally, every solu-tion of the homogeneous equation (2.47) is a linear com-bination of the functions

t kλti, t ∈Z, (2.49)

with k = 0, 1, . . . , m i −1, and i = 1, 2, . . . , k . The integer m i

is the multiplicity of the root λi , and λi , i = 1, 2, . . . , k arethe mutually different zeros of D.

If |λi |< 1 then the corresponding term (2.49) in the so-lution of the homogeneous equation (2.47) approacheszero as t →∞. Conversely, if for one or several of the ze-ros we have |λi | ≥ 1 then the corresponding terms in thesolution of the homogeneous equation do not vanish fort →∞. The result is that in the latter case the solution ofthe AR equation (2.46) contains non-stochastic terms thatdo not vanish. The resulting process cannot be asymp-totically wide-sense stationary. It may be proved that thiscondition is also sufficient.

Lemma 2.4.4 (Asymptotically wide-sense stationary AR(n )

process). An AR(n ) process D(q )X t = ǫt is asymptotically

wide-sense stationary iff all solutions λ ∈ C of D(λ) = 0have magnitude strictly smaller than 1. �

From a system theoretical point of view this neces-sary and sufficient condition is equivalent to the require-ment that the system described by the difference equa-tion D(q )X t = ǫt (with input signal ǫt and output signalX t ) be asymptotically stable. For brevity we simply saythat the AR(n) scheme is stable.

We have seen that if the initial conditions of the AR(1)process are suitably chosen then the process is immedi-ately wide-sense stationary. This also holds for the AR(n)scheme. In § 2.4.3 (p. 13) we see what these initial condi-tions are.

For the AR(1) process X t = a X t−1 + ǫt we have D(q ) =

1−aq−1. The polynomial D has a as its only zero. Neces-sary and sufficient for asymptotic wide-sense stationarityhence is that |a | < 1. This agrees with what we found in§ 2.4.1 (p. 11).

2.4.3 The Yule-Walker equations

We consider the problem how to compute the (asymp-totic) covariance function of the AR(n) process

X t = a 1X t−1+a 2X t−2+ · · ·+a n X t−n + ǫt , t ≥ n . (2.50)

For simplicity we assume that the process has been cen-tered already, so that µ = Eǫt = 0 and m (t ) = m (0) =EX0 = 0.

To derive the covariance function r (τ) we assume fornow that τ > 0. Then it follows by multiplying both sidesof (2.50) by X t−τ,

X t X t−τ = a 1X t−1X t−τ+a 2X t−2X t−τ+

· · ·+a n X t−n X t−τ+ ǫt X t−τ,

and taking expectations that

r (τ) = a 1r (τ−1)+a 2r (τ−2)+ · · ·+a n r (τ−n ), ∀τ> 0.(2.51)

Here we used the assumption that the X t are computedforward in time (i.e. that X t is seen as a function of pastX t−1, X t−2, . . . and past white noise) so that ǫt and X t−τ areuncorrelated for τ> 0. Division of (2.51) by r (0) yields

ρ(τ) = a 1ρ(τ−1)+a 2ρ(τ−2)+ · · ·+a nρ(τ−n ), τ> 0.(2.52)

with ρ the correlation function of the process. Interest-ingly, this equation is nothing but D(q )ρ(τ) = 0, (τ ≥ 1),that is, ρ(τ) is a solution of the homogeneous equation.As a result the ρ(τ) and r (τ) converge to zero for any sta-ble AR process.

For τ = 1, 2, . . . , n the equation (2.52) forms—takinginto account that ρ(0) = 1 and using the symmetry prop-erty ρ(−τ) = ρ(τ)—a set of n linear equations for thefirst n correlation coefficients ρ(1),ρ(2), . . . ,ρ(n ). Theseequations are known as the Yule-Walker equations andthey are easy to solve. Once the first n correlation coef-ficients are known then the remaining coefficients ρ(τ)may be obtained for τ > n by recursive application of(2.52).

To determine the covariance function from the correla-tion function we need to compute the variance var(X t ) =

R(t , t ) = r (0) of the process. By squaring both sides of(2.50) and taking expectations we obtain the present vari-ance R(t , t ) as a sum of past covariances,

R(t , t ) =

n∑

i=1

n∑

j=1

a i a j R(t − i , t − j )+σ2. (2.53)

If we suppose again that the process is wide-sense sta-tionary then it follows that

r (0) =n∑

i=1

n∑

j=1

a i a j r (j − i )+σ2, (2.54)

With the substitution r (j − i ) = r (0)ρ(j − i ) it follows that

r (0) = r (0)n∑

i=1

n∑

j=1

a i a jρ(j − i )+σ2, (2.55)

so that

r (0) =σ2

1−∑n

i=1

∑n

j=1 a i a jρ(j − i ). (2.56)

13

Page 20: Time Series Analysis and System Identification

We now recognize how the initial conditions X0, X1, . . . ,Xn−1 of the AR(n) scheme need to be chosen so thatthe process is immediately wide-sense stationary. Thestochastic variables X0, X1, . . . , Xn−1 need to be ran-domly drawn according to an n-dimensional joint prob-ability distribution with all n means 0 and covariancescov(X i , X j ) = r (i − j ), with i and j in {0, 1, . . . , n −1}.

For the Markov scheme X t = a X t−1+ǫt the Yule-Walkerequations (2.52) reduce to

ρ(k ) = aρ(k −1), k ≥ 1. (2.57)

Since ρ(0) = 1 it immediately follows that ρ(k ) = a |k |, k ∈Z. From (2.56) it is seen that r (0) = σ2/(1− a 2). Theseresults agree with those of § 2.4.1 (p. 11).

2.4.4 Partial correlations

The result that is described in this subsection plays a rolein estimating AR schemes as explained in § 5.2 (p. 59) and§ 5.6 (p. 69).

In the AR(n) scheme the value X t of the process at timet is a linear regression on the n preceding values X t−1,X t−2, . . . , X t−n , with regression coefficients a 1, a 2, . . . , a n .

Conversely we could ask the question whether a givenwide-sense stationary process X t (not necessarily gener-ated by an AR scheme) may be “explained” by an AR(n)scheme. To this end we could determine the coefficientsa 1, a 2, . . . , a n that minimize the mean square error

E

X t −n∑

i=1

a i X t−i

!2

. (2.58)

Define a 0 =−1. Then we have

E

X t −n∑

i=1

a i X t−i

!2

=

n∑

i=0

n∑

j=0

a i a j r (i − j ), (2.59)

with r the covariance function of the process. We min-imize this quadratic function with respect to the coeffi-cients a 1, a 2, . . . , a n . Partial differentiation with respectto a k , k = 1, 2, . . . , n , yields the necessary conditions

n∑

i=0

a i r (i −k ) = 0, k = 1, 2, . . . , n . (2.60)

Because a 0 =−1 this is equivalent to

r (k ) =

n∑

i=1

a i r (i −k ), k = 1, 2, . . . , n . (2.61)

Division by r (0) yields

ρ(k ) =

n∑

i=1

a iρ(i −k ), k = 1, 2, . . . , n , (2.62)

with ρ the correlation function. These are precisely theYule-Walker equations, except that now the correlation

coefficients are given, and we need to solve for the regres-sion coefficients a 1, a 2, . . . , a n . The Yule-Walker equa-tions have an important symmetry property that is clearfrom the matrix representation of (2.62),

ρ(0) ρ(1) · · · ρ(n −1)ρ(1) ρ(0) · · · ρ(n −2)

......

......

ρ(n −1) ρ(n −2) · · · ρ(0)

a 1

a 2

...a n

=

ρ(1)ρ(2)

...ρ(n )

.

(2.63)The matrix on the left is symmetric and we recognize it asthe correlation matrix of Lemma 2.1.2. As a result the ma-trix is nonnegative definite and under weak assumptionsit is invertible which guarantees that the a j exist and areuniquely determined by theρ(k ), (see Problem 2.8, p. 26).

If a priori we do not know what the correct order n ofthe regression scheme is then we could successively solvethe set of equations (2.62) for n = 1, 2, 3, . . .. If the processreally satisfies an AR scheme then starting with a certainvalue of n , say N + 1, the regression coefficients a k , k ≥N +1 will be identical to zero.

To make the dependence of n explicit we rewrite (2.62)as

ρ(k ) =

n∑

i=1

a niρ(i −k ), k = 1, 2, . . . , n , (2.64)

Here the a ni , i = 1, 2, . . . , n denote the regression coef-ficients that correspond to a regression scheme of ordern .

A well-known result from regression analysis is thatrather than solving (2.64) for a number of successive val-ues of n the coefficients a ni may recursively be computedfrom the equations

a n+1,n+1 =ρ(n +1)−

∑n

j=1 a n jρ(n +1− j )

1−∑n

j=1 a n jρ(j ), (2.65)

a n+1,j = a n j −a n+1,n+1a n ,n+1−j , j = 1, 2, . . . , n .(2.66)

These equations are known as the Levinson-Durbin algo-rithm (Durbin, 1960) and they can derived from the spe-cial structure of Eqn. (2.63), see Proof A.1.3 (p. 91). For n

given, first (2.65) is used to compute the value of a n+1,n+1

from the values of the coefficients that were found ear-lier. Next, (2.66) yields a n+1,1, a n+1,2, . . . , a n+1,n . The re-cursion equations hold from n = 0, with the conventionthat summations whose upper limit is less than its lowerlimit cancel (are zero).

The coefficients a 11, a 22, a 33, . . . , are called the partial

correlation coefficients. The value N of n above whichthe partial correlation coefficients are zero (or, in prac-tice, very small) determines the order of the regressionscheme.

2.4.5 Simulation with Matlab

14

Page 21: Time Series Analysis and System Identification

In Exercise 2.7 (p. 26) the AR(2) process

X t = a 1X t−1+a 2X t−2+ ǫt (2.67)

will be discussed. The parameters are chosen as a 1 =

2a cos(2π/T ) and a 2 =−a 2, with a = 0.9 and T = 12. Theprocess has a damped harmonic covariance function. InFig. 2.6(b) on p. 8 a realization of the process is shown.This realization has been generated with the help of thefollowing sequence of MATLAB-commands:

a = 0.9; T = 12; % Define the parameters

a1 = 2∗a∗cos(2∗pi/T);

a2 = −a∗a;

% Choose the standard deviation sigma of

% the white noise so that the stationary

% var of the process is 1 (see problem 2.7):

sigma=sqrt((1−a1−a2)∗(1+a1−a2)∗(1+a2)/(1−a2))

D = [1 −a1 −a2]; % Define the system

N = 1;

randn(’seed’,0); % Generate white noise

e = sigma∗randn(200,1);

th=idpoly(D,[],N); % Generate

x = idsim(e,th); % and

plot(x); % plot realization of X t

The MATLAB-functions idsim and poly2th belong tothe Systems Identification Toolbox. The function idsimis used to simulate a class of linear systems. The struc-ture of the system is defined by the “theta parameter”th. The function poly2th generates the theta parameterfrom the various polynomials that determine the systemstructure. For AR schemes only the polynomial D(q ) =

1− a 1q−1 − · · · − a nq−n is relevant. The coefficients of D

are specified as a row vector according to ascending pow-ers of q−1. The first coefficient of D needs to be 1.

The various MATLAB commands are described in Ap-pendix B.

2.4.6 Converting AR to MA and back

By successive substitution the AR(n) scheme

X t = a 1X t−1+a 2X t−2+ · · ·+a n X t−n + ǫt (2.68)

may be rewritten as

X t = ǫt +h1ǫt−1+h2ǫt−2+ · · · . (2.69)

If the scheme is asymptotically wide-sense stationary—that is, if the system D(q )X t = ǫt is asymptoticallystable—then the coefficients h1, h2, h3, . . . converge ex-ponentially to zero. We show this shortly. Hence, the ARscheme may be viewed as an MA(∞) scheme, with infinite

memory. Every stable AR scheme may be approximatedby an MA scheme with sufficiently long memory.

Take by way of example the stable AR(1) scheme

X t = a X t−1+ ǫt , t ≥ 1. (2.70)

with |a |< 1. By repeated substitution it follows that

X t = a�a X t−2+ ǫt−1

�+ ǫt

= ǫt +aǫt−1+a 2ǫt−2+a 3ǫt−3+ · · · .

This is the random shock model as considered in Exam-ple 2.4.1.

Conversely, under certain conditions an MA(k ) schememay be converted into an AR(∞) scheme. Assume thatb0 = 1. We hence consider the MA(k ) scheme

X t = ǫt +b1ǫt−1+ · · ·+bkǫt−k . (2.71)

This may be rewritten as

ǫt = X t −b1ǫt−1−b2ǫt−2− · · ·−bkǫt−k . (2.72)

By repeated substitution this leads to an AR(∞) schemeof the form

ǫt = X t + g 1X t−1+ g 2X t−2+ · · · . (2.73)

We will show that the condition for convergence of the co-efficients g 1, g 2, . . . to zero is that all zeros of the polyno-mial N (λ) = 1+b1λ−1+b2λ−2+ · · ·+bkλ−k have modulusstrictly smaller than 1. If this condition is satisfied thenthe MA scheme X t =N (q )ǫt is said to be invertible.

Example 2.4.5 (Inversion of an MA(1) process). Considerthe MA(1) process X t =N (q )ǫt , with

N (q ) = 1+bq−1. (2.74)

To find the equivalent AR(∞) description we try to elimi-nate ǫt−k by repeated substitution,

X t −bǫt−1 = ǫt ,

X t −b�X t−1−bǫt−2

︸ ︷︷ ︸

ǫt−1

= ǫt ,

X t −b X t−1+b 2 �X t−2−bǫt−3�

︸ ︷︷ ︸

ǫt−2

= ǫt .

This results in the AR(∞) process D(q )X t = ǫt where

D(q ) = 1−bq−1+b 2q−2−b 3q−3+ · · · .

The example may suggest that D(λ) may be obtainedvia long division in the negative powers of λ of 1/N (λ).Indeed that is the case for arbitrary invertible N (q ), seeProblem 2.30 (p. 28) and Problem 2.12 (p. 26). If we areafter the coefficients of D(q ) then we may do that by re-peated substitution or by long division, whichever hap-pens to be more convenient. The connection allows toprove what we alluded to before.

15

Page 22: Time Series Analysis and System Identification

Lemma 2.4.6 (Invertible MA processes). The coefficients of

D(q ) = 1/N (q ) converge to zero if and only if all zerosλi of

N (λ) have modulus smaller than 1.

Proof. Let λi denote the zeros of N (λ). Then N (λ)may befactored as N (λ) =

∏m

i=1 1−λiλ−1. For simplicity assumethat all zeros λi have multiplicity 1 so that 1/N (λ) has apartial fraction expansion of the form

1

N (λ)=

m∑

i=1

c i

1

1−λiλ−1, c i ∈C.

For large enough λ all fractions |λiλ−1| are less than 1, inwhich case we may replace 1

1−λiλ−1 by its geometric series.This gives

1

N (λ)=

m∑

i=1

c i

�1+λiλ

−1+(λiλ−1)2+ · · ·

=

∞∑

j=0

� m∑

i=1

c iλj

i

λ−j .

The coefficients∑m

i=1 c iλj

i of λ−j converge to zero as j →∞ if and only if |λi |< 1 for all i .

The case that some zeros λi have higher multiplicitymay be proved similarly, but the proof is more techni-cal.

The above lemma is a statement about convergence ofcoefficients. It also implies a type a convergence of theprocess itself, see Problem 2.15 (p. 27).

In conclusion we point out a further symmetry thatAR and MA schemes possess. Stable AR processes havethe property that the partial correlation coefficients even-tually become exactly zero, while the regular correla-tion function decreases exponentially. Invertible MA pro-cesses have the property that the regular correlation func-tion eventually becomes exactly zero, while the partialcorrelation coefficients decrease exponentially.

Example 2.4.7 (Matlab simulation). The following se-quence of MATLAB commands was used to produce thetime series of Fig. 2.10.

% Simulation of an MA(10) process

% Choose sigma such that the stationary

% variance of the process is 1

sigma = 1/sqrt(11);

% Define the system structure

D = 1;

N = ones(1,11);

th = poly2th(D,[],N);

% Generate the white noise

randn(’seed’,0);

e = sigma∗randn(256,1);

% Generate and plot realization of X t

x = idsim(e,th);

plot(x);

For an explanation see § 2.4.5 (p. 15). �

2.5 ARMA processes

The AR and MA processes that are discussed in the pre-ceding section may be combined to so-called ARMA pro-

cesses in the following manner. The process X t , t ∈ T, isa mixed auto-regressive/moving average process of order(n , k ) if it satisfies the difference equation

X t = a 1X t−1+a 2X t−2+ · · ·+a n X t−n (2.75)

+b0ǫt +b1ǫt−1+ · · ·+bkǫt−k ,

for t ≥ n . The coefficients a 1, a 2, . . . , a n , and b0, b1, . . . , bk

are real numbers, and ǫt , t ∈ Z, is white noise with meanµ and standard deviationσ. We denote this process as anARMA(n , k ) process. Without loss of generality it may beassumed that b0 = 1.

We may compactly represent (2.75) as

D(q )X t =N (q )ǫt , (2.76)

with D and N the polynomials in q−1,

D(q ) = 1−a 1q−1−a 2q−2− · · ·−a nq−n ,

N (q ) = b0+b1q−1+b2q−2+ · · ·+bk q−k .

ARMA processes include AR- and MA-processes as spe-cial cases. They form a flexible class of models that de-scribe many practical phenomena with adequate accu-racy.

It is not difficult to see that if the AR process D(q )X t =

ǫt is asymptotically wide-sense stationary then so is theARMA process D(q )X t =N (q )ǫt . Necessary and sufficientfor wide-sense stationarity is that all the zeros of the poly-nomial D have modulus strictly smaller than 1. Systemtheoretically this means that the ARMA scheme is stable.

If the ARMA scheme is stable then the ARMA modelis equivalent to an MA(∞) scheme that is obtained bysuccessive substitution. Conversely, if all the zeros ofN (λ) have modulus strictly smaller than 1 then the ARMAmodel by successive substitution is equivalent to anAR(∞) scheme. We then say — like for the MA scheme— that the ARMA scheme is invertible.

For ARMA processes generally neither the correlationsnor the partial correlations become exactly zero with in-creasing time shift.

Direct computation of the covariance function fromthe coefficients of the ARMA scheme is not simple. Whenwe discuss in § 2.6.5 (p. 20) and § 2.6.6 (p. 20) the spectralanalysis of wide-sense stationary processes we see howthe covariance function may be found.

16

Page 23: Time Series Analysis and System Identification

2.6 Spectral analysis

In many applications, for example radar, sonar and seis-mology applications, the important properties of the timeseries are their frequency properties. As an example, sup-pose we want to know the speed of a fast moving object.To do that we may transmit a sinusoidal signal cos(ω0t )

towards this object and estimate the speed of the objecton the basis of the reflected signal (the time series). Thisreflected signal is also sinusoidal but due to Doppler ef-fect has the different frequency ω0(1− 2v /c ), where v isthe speed of the object and c the known speed of wavepropagation. Estimation of the speed v now boils downto estimation of the frequency ω0(1 − 2v /c ) of the re-flected signal. Of course measurements of this signal arenoisy and it may hence be advantageous to model it as arealization of a stochastic process.

In this section we summarize a number of importantresults concerning the spectral analysis of wide-sense sta-tionary processes. We emphasize the connection withsystem theory.

In § 2.5 (p. 16) it is shown that by successive substi-tution a stable ARMA scheme may be represented by anMA(∞) scheme of the form

X t = h0ǫt +h1ǫt−1+h2ǫt−2+ · · ·

=

∞∑

m=0

hmǫt−m , t ∈Z.

We recognize this as a convolution sum. We also recog-nize that system theoretically the process X t is the out-put signal of a linear time-invariant system with the whitenoise ǫt as input signal. Figure 2.11(a) shows the cor-responding block diagram. In Fig. 2.11(b) we view theARMA scheme as a special case of a convolution systemwith input signal u t and output signal yt . We may write

yt =

∞∑

m=−∞hm u t−m , t ∈Z, (2.77)

where we set hm = 0 for m < 0. The function hm , m ∈ Z,is nothing else but the impulse response of the system. Ifthe input signal is the “impulse”

u t =

¨

1 for t = 0,0 for t 6= 0,

t ∈Z, (2.78)

then the corresponding output signal is

yt =h t , t ∈Z. (2.79)

2.6.1 Frequency response

We consider the convolution system (2.77). Suppose thatthe input signal is the complex-harmonic signal

u t = eiωt , t ∈Z, (2.80)

ARMAǫt X t

hut yt

(a) (b)

Figure 2.11: (a) Block diagram of the ARMA sys-tem. (b) Convolution system withimpulse response h.

with i =p−1 and the real number ω the angular fre-

quency. Then the corresponding output signal is

yt =

∞∑

m=−∞hm eiω(t−m )

=

∞∑

m=−∞hm e−iωm

!

︸ ︷︷ ︸

h(ω)

eiωt

= h(ω)eiωt , t ∈Z.

Hence the output signal also is complex-harmonic3 withangular frequencyω and with “complex amplitude” h(ω).The function

h(ω) =

∞∑

m=−∞hm e−iωm (2.81)

is called the frequency response function of the system.The expression (2.81) defines the function h as the dis-

crete Fourier transform — abbreviated to DFT — of theimpulse response h. Because there also exists anotherdiscrete Fourier transform (see § 4.6.5, p. 51) the nameDCFT — discrete-to-continuous Fourier transform — ismore informative (Kwakernaak and Sivan, 1991). Suffi-cient for existence of the DCFT is that

∑∞m=−∞ |hm | exists.

Lemma 2.6.1 (Properties of the DCFT.). Let h be the DCFT

of hm , m ∈Z. Then,

1. Periodicity: h(ω+k 2π) = h(ω) for every integer k .

2. Conjugate symmetry: If h is a real sequence then h is

conjugately symmetric, that is, h(−ω) = h(ω) for all

ω ∈R. The overbar denotes the complex conjugate.

Proof. Problem 2.17, (p. 27).

Because of the periodicity property it is sufficient toconsider the DCFT on an arbitrary interval on the fre-quency axis of length 2π. Because of the symmetry prop-erty we choose for this interval the interval [−π,π).

By multiplying both sides of (2.81) by eiωt and next in-tegrating both sides over [−π,π)with respect toω it easilyfollows that

h t =1

∫ π

−πh(ω)eiωt dω, t ∈Z. (2.82)

3All harmonic eiωt are therefore eigenfunctions of the convolution sys-tem, and h(ω) are the eigenvalues.

17

Page 24: Time Series Analysis and System Identification

This formula serves to reconstruct the original time func-tion h from the DCFT h. The formula (2.82) is called theinverse DCFT .

The frequency response function h characterizes theresponse of the convolution system (2.77) to many moreinput signals than complex-harmonic signals alone. Sup-pose that the input signal u t possesses the DCFT u (ω).Then with the help of the inverse DCFT the input signalu may be represented as

u t =1

∫ π

−πu (ω)eiωt dω, t ∈Z. (2.83)

This expression shows that u t may be considered asa “linear combination” of uncountably many complex-harmonic functions eiωt with frequencies ω ∈ [−π,π).The infinitesimal coefficient of the harmonic componenteiωt in the expansion (2.83) is u (ω)dω.

Substitution of the expansion (2.83) into the convolu-tion sum (2.77) yields

yt =

∞∑

m=−∞hm u t−m

=

∞∑

m=−∞hm

1

∫ π

−πu (ω)eiω(t−m ) dω

=1

∫ π

−π

∞∑

m=−∞hm e−iωm

!

︸ ︷︷ ︸

h(ω)

u (ω)eiωt dω,

so that

yt =1

∫ π

−πh(ω)u (ω)︸ ︷︷ ︸

y (ω)

eiωt dω, t ∈Z. (2.84)

The expression (2.84) is precisely the inverse DCFT of theoutput signal y . Clearly we have for the DCFT y of y

y (ω) = h(ω)u (ω), ω ∈ [−π,π). (2.85)

The commutative diagram of Fig. 2.12 illustrates the rela-tions between the input signal u , the output signal y andtheir DCFTs u and y .

2.6.2 Frequency response function of the ARMA

scheme

Consider as a concrete system the ARMA scheme. If wedenote the input signal as u and the corresponding out-put signal as y then we have

D(q )yt =N (q )u t , t ∈Z. (2.86)

We assume that the scheme is stable, that is, that all zerosλi of D(λ) have modulus strictly smaller than 1.

If the input signal is complex-harmonic with u t = eiωt

then it follows from § 2.6.1 (p. 17) that the output signal is

Multiplication with h

Convolution with h

DCFT DCFTDCFT DCFT

inverse inverse

u

u

y

y

Figure 2.12: Commutative diagram

also complex-harmonic of the form yt = a eiωt . In addi-tion, the complex amplitude a is given by a = h(ω), withh the frequency response function of the system. Givenu t = eiωt we therefore look for a solution of the differenceequation (2.86) of the form yt = a eiωt .

It may easily be verified that shifting the complex-harmonic signal eiωt back once comes down to multipli-cation of the signal by e−iω. Back shifting several times re-sults in multiplication by the corresponding power e−iω.Application of the compound shift operation N (q ) tothe complex-harmonic signal eiωt therefore yields thecomplex-harmonic signal N (eiω)eiωt . Hence, substitu-tion of u t = eiωt and yt = a eiωt into (2.86) results in

D(eiω)a eiωt =N (eiω)eiωt , t ∈Z. (2.87)

Cancellation of the factor eiωt and solution for a = h(ω)

shows that the frequency response function of the ARMAscheme is given by

h(ω) =N (eiω)

D(eiω), ω ∈ [−π,π). (2.88)

2.6.3 Spectral density function

Realizations of wide-sense stationary processes, such asthe output process X t of a stable ARMA scheme, do nothave a DCFT because the infinite sum diverges. Directfrequency analysis of stochastic processes therefore is notimmediately feasible. It does make sense to study theFourier transform of the covariance function of the pro-cess, however.

Let X t , t ∈ Z, be a wide-sense stationary process withcovariance function r . Then the DCFT

φ(ω) =

∞∑

τ=−∞r (τ)e−iωτ, ω ∈ [−π,π), (2.89)

of r (if it exists) is called the spectral density function orthe power spectral density function of the process. Thename will soon become clear.

Lemma 2.6.2 (Properties of the spectral density function).

The spectral density function φ has the following proper-

ties:

18

Page 25: Time Series Analysis and System Identification

1. Realness: φ(ω) is real for allω ∈R.

2. Nonnegativity: φ(ω)≥ 0 for allω ∈R.

3. Symmetry: φ(−ω) =φ(ω) for allω ∈R.

4. power spectral density: var(X t ) = r (0) =1

∫ π

−πφ(ω)dω.

Proof. Problem 2.18 (p. 27).

For non-stochastic time series X t the power is com-monly defined as the average of X 2

t. In a stochastic set-

ting the power of a zero mean process becomes var(X t ) =

EX 2t = r (0). Condition 4 of Lemma 2.6.2 shows that the

power can be seen as an integral over frequency of thefunction φ(ω), hence the name power spectral density.

2.6.4 Filters

The importance of the spectral density function becomesclear when we consider filtered stochastic processes. Sup-pose that the convolution system

Yt =

∞∑

m=−∞hmUt−m , t ∈Z. (2.90)

has a wide-sense stationary process Ut with covariancefunction ru as input signal. We show that the output pro-cess Yt is also wide-sense stationary. To this end we com-pute the covariance function of the output process.

Without loss of generality we assume that the processesUt and Yt are both centered. Then for s and t inZwe have

EYt Ys =E

m

hmUt−m

n

hnUs−n

!

=E

m

n

hm hnUt−mUs−n

!

=∑

m

n

hm hn EUt−mUs−n

=∑

m

n

hm hn ru (t − s +n −m ).

All summations are from−∞ to∞. Inspection shows thatthe right-hand side only depends on the difference t − s

of the arguments t and s . Apparently the process Yt iswide-sense stationary with covariance function ry givenby

ry (τ) =∑

m

n

hm hn ru (τ+n −m ), τ ∈Z. (2.91)

Next we determine the spectral density function of Yt . Letφu be the spectral density function of Ut , so that

ru (τ) =1

∫ π

−πφu (ω)e

iωτ dω, τ ∈Z. (2.92)

Substitution of (2.92) into the right-hand side of (2.91)yields

ry (τ)

=∑

m

n

hm hn

1

∫ π

−πφu (ω)e

iω(τ+n−m ) dω

(2.93)

=1

∫ π

−π

m

hm e−iωm

!

︸ ︷︷ ︸

h(ω)

n

hn eiωn

!

︸ ︷︷ ︸

h(−ω)

φu (ω)eiωτdω

=1

∫ π

−π|h(ω)|2φu (ω)︸ ︷︷ ︸

φy (ω)

eiωτ dω, τ ∈Z. (2.94)

Here we use the conjugate symmetry of the frequencyresponse function h. Closer inspection shows that theright-hand side of (2.94) is the inverse DCFT of ry . TheDCFT of ry is the spectral density function φy of the out-put process y . Hence we have for this spectral densityfunction

φy (ω) = |h(ω)|2φu (ω), ω ∈ [−π,π). (2.95)

This relation very clearly exhibits the effect of the systemon the input process. Suppose that the system is a “bandfilter” for which

h(ω) =

(

1 forω0− b

2≤ |ω| ≤ω0+

b

2,

0 for all other frequencies.

=

−π π−ω0 +ω0

Hence the system only lets through harmonic signalswith frequencies in a narrow band with width b centeredaround the frequencies ±ω0, with 0 ≤ b ≪ ω0. Then wehave for the variance—i.e. the power—of the output sig-nal

var(Yt ) = ry (0)

=1

∫ π

−πφy (ω)dω

≈ b

2πφu (−ω0)+

b

2πφu (ω0)

=b

πφu (ω0).

We interpret 12πφu (ω0) as the power (density) of Ut at fre-

quency ω0. Because the power var(Yt ) is nonnegative forall b ≥ 0 this also proves thatφu (ω0) can only be nonneg-ative (Property 2 of Lemma 2.6.2).

In general we say that the system (2.90) is a filter for theprocess Ut .

2.6.5 Spectral density of white noise and ARMA-

processes

19

Page 26: Time Series Analysis and System Identification

Because the covariance function of white noise ǫt withstandard deviation σ is given by

rǫ(τ) = cov(ǫt+τ,ǫt ) =

¨

σ2 for τ= 0,0 for τ 6= 0,

τ ∈Z,

(2.96)the spectral density function of white noise equals

φǫ(ω) =σ2, ω ∈ [−π,π). (2.97)

Therefore, all frequencies are equally represented inwhite noise. The fact that white light also has this prop-erty explains the name “white noise.”

From § 2.6.2 (p. 18) we know that the stable ARMAscheme D(q )Yt = N (q )Ut has the frequency responsefunction

h(ω) =N (eiω)

D(eiω), ω∈ [−π,π). (2.98)

We conclude from this that the wide-sense stationaryprocess X t defined by the stable ARMA scheme D(q )X t =

N (q )ǫt has the spectral density function

φX (ω) = |h(ω)|2φǫ(ω)

=

����

N (eiω)

D(eiω)

����

2

σ2.

The covariance function of the ARMA process may nowbe determined by inverse Fourier transformation of thespectral density function:

rX (τ) =1

∫ π

−πφX (ω) eiωτ dω

=1

∫ π

−π

����

N (eiω)

D(eiω)

����

2

σ2 eiωτ dω.

In the next subsection we show that there is a simpler wayof obtaining the covariance function.

By way of illustration we consider the spectral densityfunction of the AR(2) process

X t = a 1X t−1+a 2X t−2+ ǫt . (2.99)

Because N (q ) = 1 and D(q ) = 1−a 1q−1−a 2q−2 the spec-tral density function is

φX (ω) =σ2

��1−a 1 e−iω−a 2 e−2iω

��2 , ω ∈R. (2.100)

This process is also discussed in 2.7 (p. 26). For a 1 =

2a cos(2π/T ) and a 2 = −a 2 the covariance function is ofthe form

rX (τ) = a |τ|�

A cos( 2πτT)+ B sin( 2π|τ|

T)�

, (2.101)

with A and B constants to be determined. An exampleof a realization of the process with a = 0.9 and T = 12is given in Fig. 2.6(b). In Fig. 2.13 the plot of the spectraldensity function of the process is shown in the frequencyrange from 0 to π. We see that the spectral density hasa peak near the angular frequency 2π/T = 0.5236. Thepeak agrees with the weakly periodical character of theprocess.

0 0.5 1 1.5 2 2.5 30

2

4

6

8

10

ω

φ

Figure 2.13: Spectral density function of an AR(2)process.

Example 2.6.3 (Matlab computation). The Systems Iden-

tification Toolbox has a facility to compute the spectraldensity function of a given ARMA process. To obtain thespectral density function of Fig. 2.13 first those MATLAB

commands from the script of § 2.4.5 (p. 15) may be ex-ecuted that define the system structure th of the AR(2)scheme. Next the commands

phi = th2ff(th);

[omega,phi] = getff(phi);

serve to compute the frequency axis omega and the cor-responding values of the spectral density phi. With theirhelp the plot of Fig. 2.13 may be prepared. �

2.6.6 Two-sided z -transformation and generating func-

tions

By replacing in the definition

x (ω) =

∞∑

t=−∞x t e−iωt (2.102)

of the DCFT the quantity eiω by the complex variable z weobtain the two-sided z -transform

X (z ) =

∞∑

t=−∞x t z−t (2.103)

of x t . In mathematical statistics and stochastics the z -transform is known as the generating function.

We define the z -transform for all complex values of z

for which the infinite sum converges. For z = eiω, ω ∈[−π,π), that is, on the unit circle in the complex plane,the z -transform reduces to the DCFT:

X (eiω) = x (ω). (2.104)

If the z -transform X of x exists on the unit circle then theinversion formula for the z -transform is

x t =1

∫ π

−πX (eiω)eiωt dω, t ∈Z. (2.105)

Often the computation of the complex integral may beavoided by algebraic manipulations such as known fromapplications of the Laplace transformation.

20

Page 27: Time Series Analysis and System Identification

Example 2.6.4 (Covariance function of the

ARMA(1,1)-process). By way of example we consider thecomputation of the covariance function of the ARMA(1,1)process defined by the scheme

X t = a X t−1+ ǫt +bǫt−1. (2.106)

Without loss of generality we choose the coefficient of ǫt

equal to 1. The scheme is stable if |a |< 1. Because N (q ) =

1+bq−1 and D(q ) = 1−aq−1 the spectral density functionof the process is

φX (ω) =

����

1+b e−iω

1−a e−iω

����

2

σ2

=1+b e−iω

1−a e−iω

1+b eiω

1−a eiωσ2.

With the substitution eiω = z we see that inverse Fouriertransformation ofφX comes down to determining the in-verse z -transform of

1+b z−1

1−a z−1

1+b z

1−a zσ2. (2.107)

This is a rational of z with poles at z = a and z−1 = a .Partial fraction expansion yields

1+b z−1

1−a z−1

1+b z

1−a zσ2 =

Bσ2

z−1−a+Cσ2+

Bσ2

z −a, (2.108)

with

B =(a +b )(1+ab )

1−a 2, C =

1+2ab +b 2

1−a 2.

This is not a usual partial fraction expansion but one thatretains the z ↔ z−1 symmetry. From |z | = |eiω | = 1 and|a |< 1 it follows that |a z |< 1 so that we have the followinginfinite expansion

1

z−1−a=

z

1−a z= z (1+a z +a 2z 2+ · · ·)

= z +a z 2+a 2z 3+ · · · .

For reasons of symmetry we have that

1

z −a=

z−1

1−a z−1= z−1+a z−2+a 2z−3+ · · · .

Combining these two expansions in (2.108) yields

σ2 1+b z−1

1−a z−1

1+b z

1−a z

=σ2 B (· · ·+a z 2+ z )+σ2C +σ2 B (z−1+a z−2+ · · ·).

By definition this is the z -transform of the covariancefunction,

· · ·+ r (−2)z 2+ r (−1)z + r (0)+ r (1)z−1+ r (2)z−2+ · · · .(2.109)

Matching the coefficients shows that

r (τ) =

σ2 B a 1−τ for τ< 0,

σ2C for τ= 0,

σ2 B aτ−1 for τ> 0,

=

1+2ab+b 2

1−a 2 σ2 for τ= 0,(a+b )(1+ab )

a (1−a 2)σ2a |τ| for τ 6= 0,

τ ∈Z.

We conclude with a comment about the invertibility ofstable ARMA schemes. The covariance function of thestable ARMA scheme D(q )X t = N (q )ǫt is the inverse z -transform of

N (z )

D(z )

N (z−1)

D(z−1)σ2. (2.110)

Condition for the invertibility of the scheme is that all ze-ros of N (z ) have modulus strictly smaller than 1. Con-sider the numerator N (z )N (z−1) of (2.110). The zeros ofthis numerator consist of the zeros of N (z ) and the recip-rocals of these zeros. Suppose that not all zeros of N havemodulus greater than or equal to 1. Then it is always pos-sible to find a polynomial N (of the same degree as N )whose zeros all have modulus smaller than or equal to 1such that

N (z−1)N (z ) =N (z−1)N (z ). (2.111)

The ARMA process X t defined by D(q )X t = N (q )ǫt hasthe same spectral density function as the original processX t . Hence, it also has the same covariance function andtherefore cannot be distinguished from the original pro-cess. Without loss of generality we may therefore assumethat all zeros of N have modulus less than or equal to 1.If there are no zeros with modulus equal to 1 then thescheme is invertible.

2.6.7 Spectral analysis of continuous-time processes

The exposition of the present section has been limited todiscrete-time processes with time axis Z. For many ap-plications this is the relevant model. There also are ap-plication areas, in particular in physics and electrical en-gineering, where the underlying phenomena have an es-sential continuous-time character. To analyze these phe-nomena it is necessary to use continuous-time stochasticprocesses as models.

We summarize some of the notions that we developedfor discrete-time processes for the continuous-time pro-cess X t , t ∈ R. Suppose that the process is wide-sensestationary with covariance function r (τ) = cov(X t+τ, X t ),for τ ∈ R. The spectral density function of the process isdefined as the Fourier transform4

φ(ω) =

∫ ∞

−∞r (τ)e−iωτ dτ, ω ∈R. (2.112)

4Kwakernaak and Sivan (1991) refer to this Fourier transform as theCCFT — continuous-to-continuous Fourier transform.

21

Page 28: Time Series Analysis and System Identification

If the spectral density function φ exists then it is real andnonnegative for allω, and symmetric inω. If the spectraldensity φ is given then the covariance function r may beretrieved by the inverse Fourier transformation

r (τ) =1

∫ ∞

−∞φ(ω)eiωτ dω, τ ∈R. (2.113)

This shows that

var(X t ) = r (0) =1

∫ ∞

−∞φ(ω)dω. (2.114)

Consider a continuous-time convolution system with in-put signal u and output signal y described by

yt =

∫ ∞

−∞h(τ)u t−τdτ, t ∈R. (2.115)

The function h is called the impulse response of the sys-tem. The impulse response is the response of the systemif the input signal is the delta function u t = δ(t ). Thefrequency response function of the system is the Fouriertransform

h(ω) =

∫ ∞

−∞h(τ)e−iωτ dτ, ω ∈R, (2.116)

of the impulse response. If the frequency response h ex-ists then it is conjugate symmetric (provided h is real).The impulse response may be recovered from the fre-quency response function by inverse Fourier transforma-tion.

If the input signal u of the convolution system (2.115)is a wide-sense stationary stochastic process with covari-ance function ru then the output process is also wide-sense stationary with covariance function

ry (τ) =

∫ ∞

−∞

∫ ∞

−∞h(t )h(s )ru (τ+ s − t )dt ds , τ ∈R.

(2.117)The spectral density function φy of the output processfollows from the spectral density functionφu of the inputprocess by the relation

φy (ω) = |h(ω)|2φu (ω), ω ∈R. (2.118)

2.7 Trends and seasonal processes

In classical time series analysis a time series is decom-posed into three components:

1. A trend, that is, a more or less gradual development.The monthly index of the American home mortgagesof Fig. 1.6 (p. 2) appears to consist mainly of thiscomponent.

2. A seasonal component with a more or less pro-nounced periodic character. The water flow of theriver Tiber of Fig. 1.2 (p. 2) has an obvious seasonalcomponent with a period of 12 months.

3. An incidental component, consisting of irregularfluctuations. The annual immigration into the USAof Fig. 1.3 (p. 2) appears to be primarily of this nature.

We show how ARMA schemes may be used to model thesethree types of phenomena.

2.7.1 Trends and ARIMA models

In the classical time series analysis discussed in § 4.3(p. 41) trends are often represented by polynomials:

Zt = a 0+a 1t +a 2t 2+ · · ·+a p t p , t ∈ Z. (2.119)

The degree of the polynomial may be decreased by apply-ing the difference operator. Define the (backward) differ-ence operator∇ by

∇Zt =Zt −Zt−1. (2.120)

In terms of the backward shift operator q−1 we have

∇= 1−q−1. (2.121)

If Zt is a polynomial in t of degree p as in (2.119) then∇Zt

is a polynomial of degree p−1. By applying the differenceoperator p + 1 times the degree of the polynomial Zt isdecreased to 0. Because the difference operator reducesconstants to 0 we have

∇p+1Zt = 0, t ∈Z. (2.122)

Conversely we may consider this relation as a differenceequation for Zt . Each solution of this difference equationis a polynomial of degree p . We may rewrite the equation(2.122) as

(1−q−1)p+1Zt = 0. (2.123)

Obviously, irregularities in the trend may be modeled byadding a stochastic component to this equation. Thus weobtain the AR(p +1) model

(1−q−1)p+1Zt = ǫt , t ∈Z, (2.124)

with ǫt white noise with mean 0 and standard deviationσ. For p = 0 the model agrees with the “random walk”model

Zt =Zt−1+ ǫt . (2.125)

The polynomial trend model may easily be embedded inan ARMA model. We then consider schemes of the form

D(q )(1−q−1)d X t =N (q )ǫt , t ∈Z, (2.126)

orD(q )∇d X t =N (q )ǫt , t ∈Z. (2.127)

D is a polynomial of degree n and N has degree k . Thismodel is known as an ARIMA(n, d , k ) scheme. The “I” inthis acronym is the initial letter of “integrated.” Integra-tion is to be understood as the inverse operation of takingdifferences.

22

Page 29: Time Series Analysis and System Identification

0 25 50 75 100 125 150 175 2000

0.5

1

1.5

I(1)

0 25 50 75 100 125 150 175 2000

0.5

1

1.5

I(2)

0 25 50 75 100 125 150 175 2000

0.5

1

I(3)

Figure 2.14: Realizations of I(1), I(2) and I(3) pro-cesses

Figure 2.14 shows realizations of I(1), I(2), and I(3) pro-cesses. The three realizations were obtained from a singlerealization of white noise with variance 1 by applying theI(1) scheme three times in succession. The results werescaled by successively dividing the realizations that werefound by 10, 1000, and 100000. The plots show that as d

increases the behavior of the realization of the I(d ) pro-cess becomes less irregular, relatively.

2.7.2 Seasonal processes

A time series Zt is periodic with period P ∈Z if

Zt =Zt−P , t ∈Z, (2.128)

or(1−q−P )Zt = 0, t ∈Z. (2.129)

The strict periodicity is softened by modifying the modelto

(1−aq−P )Zt = ǫt , t ∈Z, (2.130)

with ǫt white noise and a a constant such that |a | < 1.This AR(P) process has the stationary covariance function

r (τ) =

(

σ2Z a

|τ|P for τ= 0, ±P, ±2P, . . . ,

0 for other values of τ,τ ∈Z.

(2.131)Depending on the value of a the values of the time se-ries Zt at time instants that are a multiple of the period

0 50 100 150 200 250-10

-5

0

5

10

t

z t

Figure 2.15: Realization of the process Zt =

0.9Zt−12+ ǫt , t ∈Z

P away from each other more or less strongly correlated.For time instants that are not separated by a multiple ofP there is no correlation at all. Figure 2.15 shows a re-alization for P = 12 and a = 0.9. It is clearly seen thatthe behavior within one “period” may be very irregular.The model (2.130) may be refined by considering ARMAschemes of the form

D(q P )Zt =N (q P )ǫt , t ∈Z. (2.132)

The models retain the characteristic that there is no cor-relation for time instants that are not separated by a mul-tiple of P .

We discuss other possibilities to capture weakly pe-riodic phenomena by an ARMA-model of the formD(q )Zt = N (q )ǫt . To this end we consider the homoge-neous equation D(q )Zt = 0. Denote the zeros of D as λi ,i = 1, 2, . . .. In § 2.4.2 (p. 12) it is explained that the so-lution of the homogeneous equation is a linear combina-tion of terms of the form t k (λi )

t , t ∈Z.If D has a zero eiω0 on the unit circle (withω0 real) then

the homogeneous equation has a corresponding purelyperiodic solution eiω0t for t ∈ Z. The corresponding realsolution is of the form cos(ω0t +ψ), withψ a constant.

The polynomial D(q ) = 1 − q−P , for instance, whichcharacterizes the purely periodic model (2.129), has P ze-ros eik 2π/P , k = 0, 1, . . . , P−1, on the unit circle. The poly-nomial D(q ) = 1−aq−P of the model (2.130) has the zerosa 1/P eik 2π/P , k = 0, 1, . . . , P − 1. The closer a is to 1 thecloser the zeros are to the unit circle, and the more pro-nounced the periodic character is.

Generally it is true that if D has one or several com-plex conjugate zero pairs that are close to the unit circlethen the realizations of the corresponding ARMA processD(q )X t = N (q )ǫt show a more or less pronounced peri-odic component. How pronounced depends on the dis-tance of the zero pair to the unit circle.

An example is the AR(2) process X t = a 1X t−1+a 2X t−2+

ǫt of 2.7 (p. 26). For a 1 = 2a cos(2π/T ) and a 2 =−a 2, witha and T real, the polynomial

D(q ) = 1−2a cos(2π/T )q−1+a 2q−2 (2.133)

has the complex conjugate zero pair

a [cos(2π/T )± i sin(2π/T )]. (2.134)

23

Page 30: Time Series Analysis and System Identification

Figure 2.6(b) shows a realization of this process for a = 0.9and T = 12. The zeros are close enough to the unit cir-cle for the quasi-periodic character to be clearly recog-nizable.

2.8 Prediction of time series

An important and interesting application of models fortime series is prediction or forecasting. The problem isto predict the future behavior of the phenomenon that isobserved as accurately as possible given the observationsup to the present time. We discuss this problem for pro-cesses that are described by ARMA schemes.

2.8.1 Prediction of ARMA processes

We study the prediction of processes described by theARMA scheme

D(q )X t =N (q )ǫt , t ∈Z, (2.135)

with ǫt white noise with mean 0. We assume that thescheme is both stable and invertible. For the time be-ing we suppose that the prediction is based on observa-tions of a realization of the process from the infinite pastuntil time t0. Later we consider the situation where onlyfinitely many past observations are available.

The solution of the prediction problem consists of twobasic steps.

1. Given the past observations reconstruct the past re-alization of the white noise that generated the obser-vations.

2. Given the past realization of the driving white noisepredict the future behavior of the process.

Eventually the two solution steps are combined into a sin-gle prediction operation.

The reconstruction of the driving white noise is basedon inversion of the ARMA scheme. By the invertibility as-sumption there exists an equivalent AR(∞) scheme of theform

G (q )X t = ǫt , t ∈Z. (2.136)

G follows by expansion of

G (q ) =D(q )

N (q )= g 0+ g 1q−1+ g 2q−2+ · · · . (2.137)

With the help of (2.136) the past realization of the whitenoise ǫt , t ≤ t0, may be reconstructed recursively fromthe observed realization X t , t ≤ t0, according to

ǫt = g 0X t + g 1X t−1+ g 2X t−2+ · · · , t = . . . , t0−1, t0.(2.138)

Similarly the X t , t ≤ t0 may be reconstructed fromthe ǫt , t ≤ t0 with help of the MA(∞) scheme X t =

N (q )/D(q )ǫt . In other words: knowing the past of X t

is equivalent to knowing the past of ǫt . We next con-sider the prediction problem. For that we partially ex-pand N (q )/D(q ) as

N (q )

D(q )= h0+h1q−1+h2q−2+ · · ·hm−1q−m+1+q−m Rm (q )

D(q )

with R(q ) causal. Such representations exist for any m ∈N and they may be easily obtained using long division,see Example 2.8.1. With this partial expansion we maywrite

X t0+m = h0ǫt0+m +h1ǫt0+m−1+ · · ·+hm−1ǫt0+1︸ ︷︷ ︸

prediction error et0+m |t0

(2.139)

+Rm (q )

D(q )ǫt0

︸ ︷︷ ︸

best prediction X t0+m |t0

.

The second term, indicated as “best prediction” X t0+m |t0

is determined by ǫt0 , ǫt0−1, . . . , and therefore is known attime t0. The first term, indicated as “prediction error”e t0+m |t0 is a stochastic variable with mean zero that is un-correlated with X t , t ≤ t0. It is in fact independent of X t ,t ≤ t0 if we assume that the white noise is mutually in-dependent. In that case it follows that X t0+m |t0 is the con-ditional expectation of X t0+m given X t , t ≤ t0. Among allfunctions X t0+m |t0 of X t , t ≤ t0, the predictor

X t0+m |t0 =Rm (q )

D(q )ǫt0 (2.140)

minimizes then the mean square prediction error

E�

(X t0+m |t0 −X t0+m )2| X t , t ≤ t0

. (2.141)

We thus identify the optimal predictor as

X t+m |t =Rm (q )

D(q )ǫt (2.142)

Here we replaced t0 with t . The X t+m |t is the best pre-dictor of X t+m given past observations up to and includ-ing time t and it is called the m -step predictor. Since

ǫt =D(q)

N (q)X t we have that

X t+m |t =Rm (q )

D(q )

D(q )

N (q )X t =

Rm (q )

N (q )X t .

To determine X t+m |t we therefore do not need to generatethe ǫt first. Clearly we may implement the predictor asthe ARMA scheme

N (q )X t+m |t = Rm (q )X t , t ∈Z. (2.143)

The result shows that for AR schemes D(q )X t = ǫt the m -step predictor is an MA scheme X t+m |t = Rm (q )X t .

Up to this point we have assumed that all past valuesof X t from time −∞ on are available for prediction. Be-cause of the assumed invertibility of the original ARMAscheme (2.135) the predictor (2.143) is stable. The effect

24

Page 31: Time Series Analysis and System Identification

of an incorrect initialization of (2.143) at a time instantin the finite rather than the infinite past hence decreasesexponentially. Therefore a predictor of the form (2.143)that for instance has been initialized at time 0 with initialconditions 0 asymptotically yields correct results.

From the formula

e t+m |t = h0ǫt+m +h1ǫt+m−1+ · · ·+hm−1ǫt+1 (2.144)

for the prediction error it follows that the mean squareprediction error equals

Ee 2t0+m |t0

= (h20+h2

1+ · · ·+h2m−1)σ

2. (2.145)

Example 2.8.1. By way of example we discuss the AR(1)process described by

X t = a X t−1+ ǫt . (2.146)

Inspection shows that the best one-step predictor is

X t+1|t = a X t . (2.147)

We derive the m -step predictor formally. We have D(q ) =

1− aq−1 and N (q ) = 1, then long division gives us (seeProblem 2.30) that

N (q )

D(q )=

1

1−aq−1

= 1+aq−1+a 2q−2+ · · ·+a m−1q 1−m +q−m a m

1−aq−1.

(2.148)

For the sake of stability we need to assume that |a | < 1.Apparently we have h i = a i , i = 0, 1, . . . and Rm (q ) = a m .Substitution into (2.143) yields the m -step predictor

X t+m |t = a m X t . (2.149)

Prediction over m steps for this process hence amountsto multiplication of the last known observation X t by a m .That the last known observation is all that is needed fromthe past follows because the AR(1) process is a Markovprocess.

The mean square prediction error according to (2.145)is

Ee 2t+m |t = (1+a 2+a 4+ · · ·+a 2(m−1))σ2

=1−a 2m

1−a 2σ2

= (1−a 2m )σ2X .

Here σ2X is the stationary variance of the process itself.

For m →∞ the mean square prediction error approachesσ2

X , and the m -step prediction itself approaches 0.The smaller |a | is the faster the mean square prediction

error increases with m . The reason is that if |a | is smallthe temporal dependence of the process is also small sothat any prediction is poor. �

It is easy to see that the predictor (2.149) in the aboveexample is also optimal if a = 1 (despite the fact that theAR(1) scheme is not stable). The AR(1) process then re-duces to the random walk process

X t =X t−1+ ǫt . (2.150)

The optimal m -step predictor for the random walk is

X t+m |t =X t . (2.151)

The best prediction of the random walk hence is to usethe last known observation as the prediction. It is wellknown that the weather forecast “tomorrow the weatherwill be like today” scores almost as good as meteorologi-cal weather forecasts ...

Example 2.8.2 (Matlab computation). The Systems Identi-

fication Toolbox has a routine predict for the calcula-tion of m -step predictions. We apply it to the AR(2) timeseries that is produced in § 2.4.5 (p. 15). After definingthe theta structure th and generating the time series xaccording to the script of § 2.4.5 (p. 15) we calculate theone-step predictions xhat with

xhat = predict(x,th,1);

In Fig. 2.16 the first 51 values of the observed and the pre-dicted process are plotted. The sample average of thesquare prediction error is 0.2998. The predictor workswell when the time series varies more or less smoothly buthas difficulties coping with fast changes. �

5 10 15 20 25 30 35 40 45 50-2

-1

0

1

2

t

Figure 2.16: Dots: One-step predictions. Line:AR(2) time series

2.9 Problems

2.1 Basic stochastic processes. Let ǫt be white noise withvariance 1 and nonzero mean Eǫt = 1/4. Computevariance and mean of X t = ǫt +ǫt−1+ǫt−2+ǫt−3 andsketch a “reasonable” realization x t for t = 0, 1, . . . , 20and indicate mean and standard deviation of theprocess.

2.2 Basic stochastic processes. Let Yt be a stationary pro-cess with zero mean and variation σ2 and supposethe Yt , t ∈Z are mutually uncorrelated. Show that all

25

Page 32: Time Series Analysis and System Identification

four time series X t below have the same mean andvariance but that their covariance function R(t+τ, t )

generally differ. Sketch two realizations of each ofthe following four time series below.

a) X t = Y1,

b) X t = (−1)t Y1,

c) X t = Yt ,

d) X t = (Yt +Yt−1)/p

2.

2.3 Properties of R, ρ and R. Prove Lemma 2.1.1. (Usethe Cauchy-Schwarz inequality on page 91 and forPart 4 consider var(v1X t1+v2X t2 ) for constant v1, v2.)

2.4 Covariance function of a wide-sense stationary pro-

cess. Prove Lemma 2.1.2.

2.5 Convolutions. Prove equations (2.25) and (2.26).

2.6 Non-centered processes. Assume that the white noiseprocess ǫt has mean µ 6= 0. Suppose that the AR(n)process X t defined by (2.27) is asymptotically wide-sense stationary.

a) Prove that the asymptotically wide-sense sta-tionary process has the constant mean

m =µ

1−∑n

i=1 a i

. (2.152)

b) Prove that if the process is asymptotically wide-sense stationary then the denominator 1 −∑n

i=1 a i is non-zero. Hint: If 1 −∑n

i=1 a i = 0then D(λ) has a zero 1.

c) How need the initial conditions X0, X1, . . . , Xn−1

be chosen so that the process is immediatelystationary?

d) Let m (t ) = EX t be the mean value functionof the process X t . Which difference equationand initial conditions does the centered pro-cess X t = X t −m (t ) satisfy?

2.7 A Yule scheme. The Yule scheme is the AR(2) scheme

X t = a 1X t−1+a 2X t−2+ ǫt , t = 2, 3, . . . . (2.153)

a) Prove that the first two stationary correlationcoefficients are given by

ρ(1) =a 1

1−a 2, ρ(2) = a 2+

a 21

1−a 2. (2.154)

b) Also show that the stationary variance equals

r (0) =(1−a 2)σ2

(1−a 1−a 2)(1+a 1−a 2)(1+a 2).

(2.155)

c) Suppose that the polynomial D(q ) = 1−a 1q−1−a 2q−2 has the complex conjugate zero pair

a e±i2π/T , (2.156)

with a and T real. Show that it follows that a 1 =

2a cos(2π/T ) and a 2 =−a 2. Also show that fork = 2, 3, . . . the stationary correlation functionhas the form

ρ(k ) = a k [A cos(2πk

T)+ B sin(

2πk

T)], (2.157)

with A and B real constants to be determined.

d) Let a = 0.9 and T = 12. Compute ρ(1) andρ(2) numerically from the Yule-Walker equa-tions. Use these numbers to determine A andB numerically. Plot ρ.

2.8 Solvability of the Yule-Walker equations⋆. Assumethat X t is wide sense stationary. Use Lemma 2.1.1and 2.1.2 to show that (2.63) has a unique solution ifand only if X t , . . . , X t+n are linearly independent (i.e.,∑n

i=1 c i X t+i = 0 =⇒ c i = 0).

2.9 Partial correlation coefficients. Consider the AR(2)scheme

X t = c1X t−1+ c2X t−2+ ǫt , t = 2, 3, . . . , (2.158)

that is also studied in Problem 2.7

a) Compute the partial correlation coefficientsa 11 and a 22.

b) Derive the limit limc2→0 a 11. Is that limit a sur-prise?

c) What can you say about the coefficients a ni forn ≥ 2?

2.10 ARMA model. The covariance function of a wide-sense stationary process X t , t ∈ Z, with mean 0 isgiven by

r (τ) =

1 τ= 012|τ|= 1

0 |τ| ≥ 2, τ ∈Z. (2.159)

By which ARMA scheme may this process be de-scribed?

2.11 Asymptotic behavior of the partial correlations of the

MA process and the correlations of the AR process.

Make it plausible that the partial correlation coeffi-cients of an invertible MA process and the correla-tion coefficients of a stable AR process decrease ex-ponentially.

2.12 Long division. Consider the AR(2) process (1 −a 1q−1−a 2q−2)X t = ǫt and suppose it is stable.

26

Page 33: Time Series Analysis and System Identification

a) Use repeated substitution to determine the firstthree coefficients b0,b1,b2 of the MA(∞) de-scription X t = (b0+b1q−1+b2q−2+ · · · )ǫt of theprocess.

b) Use long division to determine the first threecoefficients b0,b1,b2 of the series expansion(b0+b1λ−1+b2λ−2+· · · ) of 1/(1−a 1λ−1−a 2λ−2)

in the negative powers of λ.

2.13 MA(∞) process. Consider the MA(∞) scheme givenby

X t = ǫt +1

2ǫt−3+

1

4ǫt−6+

1

8ǫt−9+ · · · . (2.160)

The process ǫt , t ∈Z, is white noise with variance 1.

a) There exists an AR scheme of finite order thatgenerates this process. Determine this scheme.

b) Compute the covariance function r (k ) of theprocess. Plot it.

2.14 Stationary mean. Suppose that the ARMA schemeD(q )X t = N (q )ǫt is asymptotically wide-sense sta-tionary. Show that the mean value function isasymptotically given by

EX t =m (t )t→∞−→ N (1)

D(1)µ. (2.161)

2.15 Convergence of inverted processes⋆. Consider an MA-process X t =N (q )ǫt and let hk be the coefficients of1/N (q ),

1

N (q )= h0+h1q−1+h2q−2+ · · · .

Now define for each n the approximating processesY n

tvia

(h0+h1q−1+h2q−2+ · · ·+hnq−n )Y nt = ǫt . (2.162)

a) Suppose X t = N (q )ǫt is invertible. Show thatthe AR-process (2.162) is stable for n largeenough. (Hint: consider infλ∈C,|λ|≥1 |1/N (λ)|.)

b) Suppose X t = N (q )ǫt is invertible. Show thatlimn→∞EY n

t=EX t and that for each k we have

limn→∞ rY n (k ) = rX (k ).

2.16 Centered ARMA process. Let m (t ) =EX t be the meanvalue function of the ARMA process (2.75). What dif-ference equation and initial conditions are satisfiedby the centered process X t = X t −m (t )?

2.17 Properties of the DCFT. Prove Lemma 2.6.1.

2.18 Properties of the spectral density function. ProveLemma 2.6.2.

2.19 Show that the spectral density of X t = q K ǫt is inde-pendent of K and explain in words why this is not asurprise.

2.20 Reciprocal zeros. Prove that it is always possible tofind a polynomial N (of the same degree as N ) whosezeros have modulus smaller than or equal to 1 suchthat N (z )N (z−1) = N (z )N (z−1). What is the relationbetween the zeros of N and those of N ?

2.21 Let N (q ) = (1+ 12

q−1)(1+ 3q−1) and suppose ǫt is awhite noise process with varianceσ2.

a) Show that X t =N (q )ǫt is not invertible.

b) Determine φX (ω).

c) Which invertible MA process Zt = N (q )ǫt hasthe same covariance function as X t ?

2.22 The difference operator decreases the degree of a poly-

nomial. Prove the claim in Subsection 2.7.1 that ifZt is a polynomial in t of degree p as in (2.119) thatthen∇Zt is a polynomial of degree p −1.

2.23 Variance of the I(d ) process. Consider the process Zt ,t ≥ 0, that is generated by the I(d ) scheme

(1−q−1)d Zt = ǫt , t ∈Z, (2.163)

with the initial conditions Z0 = Z−1 = · · · = Z1−d = 0.Here d is a natural number and ǫt is white noise withvariance 1. Compute var(Zt ) as function of t for d =

1, 2, and 3.

2.24 Covariance function of the process (2.130). Provethat the stationary covariance function of the pro-cess (2.130) is given by (2.131). What is σ2

Z ?

2.25 Location of the zero pair and periodicity. Supposethat D has a complex conjugate zero pair a e±iω0 ,with ω0 > 0. For stability we need |a | < 1. The zeropair is close to the unit circle if |a | is not much lessthan 1. Make it plausible that the zero pair is closeenough to the unit circle for the periodic characterto be pronounced if

− log |a |≪ω0. (2.164)

2.26 Stochastic harmonic process. Consider the stochasticharmonic process

X t = A cos(2πt

T+ B ), t ∈Z, (2.165)

with T a given integer. A and B are independentstochastic variables with EA = 0, var A = σ2, and B

uniformly distributed on [0, 2π].

a) This process is stationary. What is the covari-ance function r of the process?

b) Find an AR representation of this process. Howshould the initial conditions be chosen?

2.27 Best linear predictor⋆. On page 24 the “best predic-tor” (2.139) is argued to minimize the mean squareerror (2.141) provided that the ǫt are mutually inde-pendent.

27

Page 34: Time Series Analysis and System Identification

Now suppose that the ǫt are uncorrelated (butmaybe not independent). Show that the bestpredictor of (2.139) is the one that minimizesE(X t0+m |t0−X t0+m )

2 with respect to the linear predic-tors X t0+m |t0 =

∑∞i=0 c iǫt0−i , c j ∈R.

2.28 Predictor for the MA(1) process. Find the optimal m -step predictor for the MA(1) process. Which condi-tion needs to be satisfied? Discuss and explain thesolution.

2.29 Mean square prediction error. What is the theoreticalmean square prediction error of the one-step predic-tor for the AR(2) process of Example 2.8.2 (p. 25)?

2.30 Division with remainder. Long division is a proce-dure to expand N (q )/D(q ) in negative powers of q .The skeleton is shown in Eqn. (2.166) (page 29). Inthe course of doing long division you automaticallykeep track of the remainder denoted q−m R(q ) andwe have by construction that

N (q )

D(q )= h0+ · · ·+hm−1q 1−m +

q−m R(q )

D(q ). (2.167)

Compute the expansion (2.167) for

a)N (q )

D(q )=

1

1−aq−1for arbitrary m ;

b)N (q )

D(q )=

1

1−aq−m;

c)N (q )

D(q )=

1

1−q−1+0.5q−2for m = 1, 2, 3.

2.31 Predictor. Consider the scheme

(1+q−1+1

4q−2)X t = ǫt

with ǫt a zero mean white noise process with vari-ance σ2.

a) Is the scheme stable?

b) Is the scheme invertible?

c) Determine the spectral density function of X t .

d) Determine the 2-step predictor scheme.

2.32 Predictor for a seasonal model. Consider the sea-sonal model

X t −aq−P X t = ǫt , t = 1, 2, . . . , (2.168)

with the natural number P > 1 the period, a a realconstant, and ǫt white noise with variance σ2. As-sume that the model is stable.

a) Determine the one-step optimal predictor forthis model. Interpret the result.

b) Determine the (P + 1)-step optimal predictor.Interpret the result.

2.33 Prediction of a polynomial times series. A time seriesis given by the expression

X t = c0+ c1t + c2t 2, t ∈Z, (2.169)

where c0, c1 and c2 are unknown coefficients.

a) Determine an AR scheme that describes thetime series.

b) Determine a recursive scheme for the one-stepoptimal predictor of the time series.

c) How may a k -step predictor be obtained?

Matlab problems

2.34 Predictors. The following MATLAB determines the m -step predictor.

function Rm=predic(D,N,m);

%

d=length(D);

Rm=[N zeros(1,max(0,d−length(N))) ...

zeros(1,m)];

for k=1:m

Rm(1:d)=Rm(1:d)−Rm(1)/D(1)∗D;Rm(1)=[];

end

To find the m step predictor for X t =12

X t−3 + ǫt wetype at the MATLAB prompt

D=[1 0 0 −1/2]; % D(q ) = 1−1/2q−3

N=1; % N (q ) = 1m=2; % for example

Rm=predic(D,N,m)

Determine for X t =12

X t−3+ǫt the m -step predictorsfor m = 1, 2, 3, 4, 5, 6, 7 and interpret the results.

28

Page 35: Time Series Analysis and System Identification

D(q)︷ ︸︸ ︷

1−a 1q−1− · · ·−a nq−n /

N (q)︷ ︸︸ ︷

b0+b1q−1+ · · ·+bk q−k \=b0︷︸︸︷

h0 + · · ·+hm−1q 1−m

h0+ · · ·

h1q−1+ · · ·h1q−1+ · · ·

...

hm−1q 1−m + · · ·hm−1q 1−m + · · ·

rm q−m + · · ·︸ ︷︷ ︸

q−m R(q)

(2.166)

29

Page 36: Time Series Analysis and System Identification

30

Page 37: Time Series Analysis and System Identification

3 Estimators

3.1 Introduction

In this short chapter we present a survey of several im-portant statistical notions that are needed for time seriesanalysis. Section 3.2 is devoted to normally distributedprocesses and the multi-dimensional normal distribu-tion. Section 3.3 presents estimators and their proper-ties, the maximum likelihood principle and estimationmethod, and the Cramér-Rao inequality. We mostly dealwith stochastic processes of which at most the first andsecond order moments are known or are to be estimated,or processes that are normally distributed. For such pro-cesses linear estimators are a natural choice. These areconsidered in Section 3.4.

3.2 Normally distributed processes

3.2.1 Normally distributed processes

Up to this point the discussion has remained limited tothe first and second order properties of stochastic pro-cesses, that is, their mean and covariance function. Otherproperties, in particular the probability distribution ofthe process, have not been considered. There has beenno mention, for instance, of the probability distributionof the uncorrelated stochastic variables ǫt , t ∈Z, that de-fine white noise.

For some applications it is not necessary to intro-duce assumptions on the probability distributions be-yond the second-order properties. Sometimes it cannotbe avoided, however, that more is assumed to be known.A common hypothesis, which may be justified for manyapplications, is that the process is normally distributed.

A process Zt , t ∈ T, is normally distributed if all jointprobability distributions of Zt1 , Zt2 , . . . , Ztn

, are multi-dimensional normal distributions. Multi-dimensionalnormal distributions are also referred to as multi-dimensional Gaussian distributions. Several results andformulas from the theory of multi-dimensional normaldistributions are summarized in § 3.2.2 (p. 31) and § 3.2.3(p. 33).

Multi-dimensional normal probability distributions of,say, the n stochastic variables Z1, Z2, . . . , Zn , are com-pletely determined if the n expectations

EZi , i = 1, 2, . . . , n ,

and the n 2 covariances

cov(Zi ,Z j ), i = 1, 2, . . . , n , j = 1, 2, . . . , n ,

are known.

3.2.2 Multi-dimensional normal distributions

For completeness and for later use we briefly summarizethe theory and formulas of multi-dimensional normallydistributed stochastic variables.

A (scalar) stochastic variable Z is said to be normallydistributed if

1. Z =µ with probability 1, or

2. Z has the probability density function

fZ (z ) =1

σp

2πe−

12σ2 (z−µ)2

=

µ z →σ

Here µ and σ are real constants with σ> 0. In both casesµ is the expectation of Z . In the first case Z has variance0, in the second it has variance σ2 > 0. In the first case Z

is said to be singularly normally distributed. If µ = 0 andσ = 1 then Z is said to have a standard normal distribu-

tion.From elementary probability theory it is known that

if Z1 and Z2 are two independent normally distributedstochastic variables then every linear combination a 1Z1+

a 2Z2 of Z1 and Z2, with a 1 and a 2 real constants, is alsonormally distributed. Likewise, every linear combinationa 1Z1 + a 2Z2 + · · ·+ a nZn of the n independent normallydistributed stochastic variables Z1,Z2, . . . ,Zn is normallydistributed.

Let Z1,Z2, . . . ,Zn , be n mutually independent stochas-tic variables with standard normal distributions and con-sider k linear combinations X1, X2, . . . , Xk , of the form

X i =m i +

n∑

j=1

a i j Z j , i = 1, 2, . . . , k . (3.1)

Here the m i and a i j are real constants. Each of thestochastic variables X i is (scalar) normally distributed.If the k stochastic variables X1, X2, . . . , Xk may be repre-sented in the form (3.1), with Z1,Z2, . . . ,Zn , independentnormally distributed stochastic variables with standarddistributions, then X1, X2, . . . , Xk are said to be jointly nor-

mally distributed.

We investigate what the joint probability density func-tion of X1, X2, . . . , Xk is, if it exists. Define the random vec-tors

X =

X1

X2

...Xk

, Z =

Z1

Z2

...Zn

,

and let A be the k ×n matrix with entries a i j and m thek -dimensional column vector with entries m i . Then wemay write (3.1) as

X = AZ +m . (3.2)

31

Page 38: Time Series Analysis and System Identification

According to elementary probability theory the jointprobability density function of the independent stochas-tic variables Z1,Z2, . . . ,Zn , each with a standard normaldistribution, is

fZ (z ) =

n∏

j=1

fZ j(z j )

=

n∏

j=1

1p

2πe−

12

z 2j

=1

�p2π�n e−

12

∑n

j=1 z 2j

=1

�p2π�n e−

12 z Tz . (3.3)

Here z is the column vector with componentsz 1, z 2, . . . , z n . The superscript T indicates the trans-pose. To determine the joint probability density functionf X of X1, X2, . . . , Xn from fZ we use the following theorem.

Theorem 3.2.1 (Probability density under transformation).

Let Z be a vector-valued stochastic variable of dimension

N with joint probability density function fZ (z ). Further-

more, let g : RN → RN be a differentiable bijective map

with inverse g −1. Define with the help of this map the

stochastic variable

X = g (Z ).

Then X has the joint probability density function

f X (x ) = fZ (g−1(x )) |det J (x )|. (3.4)

J is the Jacobian matrix of h = g −1. The entry Ji j of the

N ×N matrix J is given by

Ji j (x ) =∂ h i (x )

∂ x j

(x ). (3.5)

Proof. The probability density f X is defined by

f X (x1,x2, . . . ,xN ) =∂ N FX (x1,x2, . . . ,xN )

∂ x1∂ x2, . . . ,∂ xN

.

FX is the joint probability distribution FX (x ) = Pr(X ≤ x ),with x = (x1,x2, . . . ,xN ). Here an inequality between vec-tors is taken entry by entry. Because X = g (Z )we have

FX (x ) = Pr(X ≤ x )

= Pr(g (Z )≤ x )

=

g (z )≤x

fZ (z )dz .

By changing the variable of integration to η = g (z ) it fol-lows from calculus that

FX (x ) =

η≤x

fZ (g−1(η)) |det( J (η))|dη.

We finally obtain (3.4) by partial differentiation of FX (x )

with respect to the components of x .

To apply Theorem 3.2.1 to (3.2) we need to assume thatk = n , so that A is square, and that A is non-singular.Then the map g and the inverse map g −1 are defined byg (z ) = Az +m and g −1(x ) = A−1(x −m ). The Jacobianmatrix of g −1 is J (x ) = A−1. By application of (3.4) to (3.3)it now follows that the probability density function of thenormally distributed vector-valued stochastic variable X

is given by

f X (x ) =1

(p

2π)n |det A |e−

12 (x−m )T(AAT )−1(x−m )

=1

(p

2π)n (det AAT)1/2e−

12(x−m )T(AAT )−1(x−m ) . (3.6)

It is easy to identify the parameters m and AAT that occurin this probability density function. By taking the expec-tation of both sides of (3.2) it follows immediately that 1

mX :=EX =m .

The matrix

ΣX :=E�(X −EX )(X −EX )T

is called the variance matrix or covariance matrix of thestochastic vector X . It follows from (3.2) that

ΣX =E�

AZZ TAT�

= A E(ZZ T)AT

= AAT,

because EZZ T = I . With this we may rewrite (3.6) as

f X (x ) =1

(p

2π)n (detΣX )1/2

e−12 (x−mX )

TΣ−1X (x−mX ) . (3.7)

This is the general form of the joint normal probabilitydensity function. The following facts may be proved:

1. If conversely the stochastic vector X has the proba-bility density (3.7) with ΣX a symmetric positive def-inite matrix then X is normally distributed with ex-pectation mX and variance matrix ΣX .

2. Let X = AZ +m , with the entries of Z independentwith standard normal distributions. Then X is nor-mally distributed with expectation mX =m and vari-ance matrix ΣX = AAT.

a) If A has full row rank then ΣX is non-singularand X has the probability density function(3.7).

b) If A does not have full row rank then the vari-ance matrix ΣX = AAT is singular. Then X

has no probability density function and is saidto be singularly multi-dimensionally normallydistributed.

1The expectation of a matrix is the matrix of expectations.

32

Page 39: Time Series Analysis and System Identification

Example 3.2.2 (Two dimensions). Suppose that

X1

X2

=

α 10 2

��

Z1

Z2

with Z1,Z2 independent zero mean unit variance nor-mally distributed stochastic variables. If α = 0 then X2 =

2X1 hence for small values ofαwe expect that X2 is “close”to 2X1. Figure 3.1 depicts the joint probability density2

f X (x1,x2) for α= 1/2. Clearly the mass of f X (x1,x2) is cen-tered around the line x2 = 2x1. �

x1

x2

Figure 3.1: Joint probability density f X (x1,x2), seeExample 3.2.2

3.2.3 Characteristic function

If Z is an n-dimensional vector-valued stochastic variablethen

ψZ (u ) :=Eeiu TZ ,

with u an n-dimensional vector, is called the characteris-

tic function of Z . If Z possesses a probability density fZ

then we have

ψZ (u ) =

Rn

fZ (z )eiu Tz dz . (3.8)

This expression shows that ψZ is nothing but a multi-dimensional Fourier transform of fZ . For the multi-dimensional normal probability density (3.7) we have

ψX (u ) =1

(p

2π)n (detΣX )1/2

Rn

eiu Tx− 12(x−mX )TΣ

−1X (x−mX ) dx .

(3.9)Rewriting the exponent yields

iu Tx −1

2(x −mX )

TΣ−1X (x −mX )

=−1

2(x −mX − iΣX u )TΣ−1

X(x −mX − iΣX u )

− 1

2u TΣX u + iu TmX .

2For reason of exposition f X is scaled with a factor 5 in Fig. 3.1.

Substitution into (3.9) yields using the fact that the in-tegral over Rn of an n-dimensional probability densityfunction equals 1 that

ψX (u ) = e−12

u TΣX u−iu TmX . (3.10)

This formula also holds if Z is singularly normally dis-tributed.

A useful application of the characteristic function isthe computation of moments of stochastic variables (seeProblem 3.2, p. 38)

3.3 Foundations of time series analysis

3.3.1 Introduction

In Chapter 2 we reviewed stochastic processes as modelsfor time series. The remainder of these lecture notes dealwith the question how on the basis of observations of thetime series inferences may be made about the propertiesof the underlying process.

One such problem is to estimate the covariance func-tion based on observations of part of the time series.Other problems arise if for instance the structure of thecovariance function is known but the values of certain pa-rameters that occur in the model need to be estimated.

By way of example, consider a time series that is a real-ization of a wide-sense stationary process X t , t ∈ Z, withmean EX t = m and covariance function cov(X t , Xs ) =

r (t − s ). If part of a realization x t , t = 0, 1, . . . , N − 1,of this process3 is given then an obvious estimate of themean m is the sample mean

mN =1

N

N−1∑

t=0

x t . (3.11)

The circumflex denotes that we deal with an estimate.The index N indicates that the result depends on thenumber of observations N .

Naturally we immediately wonder how accurate mN isas an estimate of m . With this question we arrive in therealm of mathematical statistics. There is something spe-cial, however. Elementary statistics usually deals withsamples of stochastic variables that are mutually inde-pendent and identically distributed. The numbers x t thatoccur in the sum in (3.11), however, are samples of depen-

dent stochastic variables. This means that well-knownresults from elementary statistics about the variance ofsample means do not apply.

3.3.2 Estimates and estimators

We work with a stochastic process X t , t ∈T. In our appli-cations we usually have T=Z. The subset S ofT is the set

3Realizations of a process X t , t ∈T, are denoted as x t , t ∈T.

33

Page 40: Time Series Analysis and System Identification

of the time instants at which a realization of the processhas been observed. Often S= {0, 1, . . . , N −1}.

Let θ be a parameter of the stochastic model of the pro-cess. In the case of a wide-sense stationary process, θcould be the mean m of the process. We consider thequestion how to obtain an estimate of the parameter θfrom the observed realization x t , t ∈ S. An estimator s

is an operation on x t , t ∈ S, that produces an estimate

s (x t , t ∈ S) for θ . An estimator hence is a map. The imageis the estimate.

Before considering how an estimator actually may befound we concentrate on the question how the propertiesof a given estimator may be characterized. Because fordifferent realizations of the process different outcomes ofthe estimate result we may evaluate the properties of theestimator by studying the stochastic variable

S = s (X t , t ∈ S).

For the estimator of the mean m of a wide-sense station-ary process that we proposed in (3.11) we thus study thestochastic variable

SN =1

N

N−1∑

t=0

X t . (3.12)

An estimator S = s (X t , t ∈ S) is called an unbiased esti-mator for the parameter θ if

ES = θ .

The estimator (3.12) is an unbiased estimator for themean of the wide-sense stationary process X t , t ∈ Z. Abiased estimator has a systematic error, called bias. If thebias is known (which often is not the case) then it may becorrected.

It sometimes happens that an estimator is biased for afinite number of observations but becomes unbiased asthe number of observations approaches infinity. Denote

SN = s (X t , t ∈ {0, 1, . . . , N }). (3.13)

Then SN is called an asymptotically unbiased estimatorof θ if

limN→∞ESN = θ .

An unbiased or asymptotically unbiased estimator is notnecessarily a good estimator. An important quantitythat determines the quality of the estimator is the mean

square estimation error

E(S−θ )2.

The larger the mean square error is the less accurate theestimator is. If the estimator is unbiased then the meansquare estimation error is precisely the variance of the es-timator.

Often we may expect that we can estimate more accu-rately by accumulating more and more observations. An

estimator SN as in (3.13) is called consistent if for everyǫ > 0

limN→∞

Pr(|SN −θ |> ǫ) = 0.

In this case the estimator converges in probability to thetrue value θ .

Theorem 3.3.1 (Sufficient condition for consistency). If SN

is an unbiased or asymptotically unbiased estimator for θ

and

limN→∞

var(SN ) = 0

then SN is consistent.

Proof. We use the Chebyshev inequality, which is that forevery stochastic variable Z and numbers ǫ > 0 there holds

E(Z 2) =

∫ ∞

−∞z 2 d FZ

≥∫

z 2≥ǫ2

z 2 d FZ

≥ ǫ2

z 2≥ǫ2

d FZ

= ǫ2Pr(Z 2 ≥ ǫ2)

= ǫ2Pr(|Z | ≥ ǫ).

For Z =SN −θ this states that

0≤ Pr(|SN −θ | ≥ ǫ)

≤ E(SN −θ )2ǫ2

=var(SN )+ (ESN −θ )2

ǫ2.

Because asymptotic unbiasedness implies thatlimN→∞ESN = θ and by the assumption thatlimN→∞ var(SN ) = 0 we immediately have

limN→∞

Pr(|SN −θ | ≥ ǫ) = 0.

This completes the proof.

3.3.3 Maximum likelihood estimators

In the next sections and chapters we introduce estima-tors for the mean value, covariance function and spectraldensity function of wide-sense stationary processes anddiscuss the statistical properties of these estimators. Theformulas for these estimators are based on sample aver-ages. There also are other methods to find estimators. Awell-known and powerful method is the maximum likeli-

hood principle. We discuss this idea.Suppose that X1, X2, . . . , XN are stochastic variables

whose joint probability density function depends on anunknown parameter θ . This may be a vector-valued pa-rameter. We denote the joint probability density functionas

f X1,X2,...,XN(x1,x2, . . . ,xN ; θ ). (3.14)

34

Page 41: Time Series Analysis and System Identification

Suppose that x1, x2, . . . , xN are observed sample values ofthe stochastic variables. By substitution of these valuesinto (3.14) the expression (3.14) becomes a function of θ .If

f X1,X2,...,XN(x1,x2, . . . ,xN ; θ1)> f X1,X2,...,XN

(x1,x2, . . . ,xN ; θ2)

then we say that θ1 is a more likely value of θ than θ2. Wetherefore select the value of θ that maximizes the likeli-

hood function

f X1,X2,...,XN(x1,x2, . . . ,xN ; θ )

as the estimate θ of the parameter. Often it turns out to bemore convenient to work with the log likelihood function

L(x1,x2, . . . ,xN ; θ ) = log f X1,X2,...,XN(x1,x2, . . . ,xN ; θ ).

The maximum likelihood method provides a straightfor-ward recipe to find estimators. We apply the method inChapter 5 (p. 59) to the identification of ARMA models.Disadvantages of the method are that analytical formulasneed to be available for the probability density functionand that the maximization of the likelihood function maybe quite cumbersome.

There also are methodological objections to the maxi-mum likelihood principle. These objections appear to beunfounded because in many situations maximum likeli-hood estimators have quite favorable properties. They of-ten are efficient. This property is defined in the next sec-tion.

3.3.4 The Cramér-Rao inequality

The well known Cramér-Rao inequality provides a lowerbound for the variance of an unbiased estimator. At thispoint it is useful to introduce two short hands for the firsttwo partial derivatives: from now on we use Lθ and Lθθto mean

Lθ (x ,θ ) =∂ L(x ,θ )

∂ θ, Lθθ (x ,θ ) =

∂ 2L(x ,θ )

∂ θ 2.

Theorem 3.3.2 (Cramér-Rao inequality). Suppose that the

probability distribution of the stochastic variables X1, X2,

. . . , XN depends on a scalar unknown parameter θ . Let S =

s (X ), with X = (X1, X2, . . . , XN ), be an unbiased estimator

for θ . Denote the log likelihood function of the stochastic

variables as L(x ,θ ), with x = (x1,x2, . . . ,xN ).

1. Then

var(S)≥1

M (θ ), (3.15)

where

M (θ ) =E[Lθ (X ,θ )]2 =−ELθθ (X ,θ ). (3.16)

2. Var(S) = 1/M (θ ) if and only if

Lθ (x ,θ ) =M (θ )[s (x )−θ ]. (3.17)

Proof. See Appendix A (p. 91).

Example 3.3.3 (Cramér-Rao inequality). By way of exam-ple we consider the case that X1, X2, . . . , XN are inde-pendent, normally distributed stochastic variables withunknown expectation m and standard deviation σ. Thejoint probability density function and log likelihood func-tion are

f (x , m ) =1

σp

2π�N

e−1

2σ2

∑N

i=1(x i−m )2 ,

L(x , m ) =−N log(σp

2π)−1

2σ2

N∑

i=1

(x i −m )2.

Differentiating twice with respect to m we find

L m (x , m ) =1

σ2

N∑

i=1

(x i −m ), L m m (x , m ) =− N

σ2.

According to Cramér-Rao we thus have for every unbi-ased estimator S of m that

var(S)≥ σ2

N. (3.18)

A well known estimator for m is the sample average

mN =1

N

N∑

i=1

X i .

This estimator is unbiased with variance

var(mN ) =E

1

N

N∑

i=1

(X i −m )

!2

=σ2

N.

The variance of this estimator equals the lower bound. Itis not difficult to check that (3.17) applies. �

Unbiased estimators whose variance equals the lowerbound of the Cramér-Rao inequality are said to be effi-

cient. If equality is assumed in the limit4 N →∞ then theestimator is called asymptotically efficient.

Maximum likelihood estimators have the followingproperties.

1. If an efficient estimator S of θ exists then S is a max-imum likelihood estimator. The reason is that if S isefficient then by Theorem 3.3.2 (p. 35)

Lθ (x ,θ ) =M (θ )[s (x )−θ ].

Now if θML is the maximum likelihood estimator ofθ , then Lθ (X , θML) = 0, so that M (θML)[s (x )− θML] =

0. Normally M is invertible so that necessarily s (x ) =

θML.

4In the sense that the ratio of var(S) and 1/M (θ ) approaches 1 as N →∞.

35

Page 42: Time Series Analysis and System Identification

2. In many situations maximum likelihood estimatorsare asymptotically efficient.

In contrast to the latter statement is the fact that for smallsample sizes the properties of maximum likelihood es-timators are reputed to be less favorable than those ofother estimators.

The Cramér-Rao inequality may be extended to thecase that the parameter θ is vector-valued.

Theorem 3.3.4 (Cramér-Rao inequality for the vector case).

Suppose that the probability distribution of the stochastic

variables X1, X2, . . . , XN depends on an unknown vector-

valued parameter

θ =

θ1

θ2

· · ·θn

.

Let

S =

S1

S2

· · ·Sn

= s (X ) =

s1(X )

s2(X )

· · ·sn (X )

,

with X = (X1, X2, . . . , XN ), be an unbiased estimator of θ .

Denote the log likelihood function of the stochastic vari-

ables as L(x ,θ ), with x = (x1,x2, . . . ,xN ). Denote the gra-

dient and the Hessian of L with respect to θ successively

as

Lθ (x ,θ ) =

∂ L(x ,θ )∂ θ1

∂ L(x ,θ )∂ θ2

· · ·∂ L(x ,θ )∂ θn

,

Lθθ T (x ,θ ) =

∂ 2 L(x ,θ )∂ θ 2

1

∂ 2L(x ,θ )∂ θ1∂ θ2

· · · ∂ 2L(x ,θ )∂ θ1∂ θn

∂ 2 L(x ,θ )∂ θ2∂ θ1

∂ 2L(x ,θ )∂ θ 2

2· · · ∂ 2L(x ,θ )

∂ θ2∂ θn

· · · · · · · · · · · ·∂ 2 L(x ,θ )∂ θn ∂ θ1

∂ 2L(x ,θ )∂ θn∂ θ2

· · · ∂ 2L(x ,θ )∂ θ 2

n

.

Furthermore, denote the variance matrix of S as

var(S) =E[(S−ES)(S−ES)T]

=

var(S1) cov(S1,S2) · · · cov(S1,Sn )

cov(S2,S1) var(S2) · · · cov(S2,Sn )

· · · · · · · · · · · ·cov(Sn ,S1) cov(Sn ,S2) · · · var(Sn )

.

Then

var(S)≥M (θ )−1, (3.19)

where

M (θ ) =E[Lθ (X ,θ )LTθ (X ,θ )] =−ELθθ T (X ,θ ). (3.20)

A proof of Theorem 3.3.4 is listed in Appendix A. Theexpectation of a matrix with stochastic entries is takenentry by entry. The inequality A ≥ B , with A and B sym-metric matrices such as in (3.19) by definition is that thematrix A − B is nonnegative definite5. From (3.19) it fol-lows that in particular

var(S i )≥ Ri i , i = 1, 2, . . . , n .

The numbers Ri i are the diagonal entries of the matrixR = M (θ )−1. In mathematical statistics the matrix M (θ )

is known as Fisher’s information matrix.

3.4 Linear estimators

tk

xk

|εk |

Figure 3.2: Linear approximation

3.4.1 Linear estimators

One of the classical estimation problems is to approxi-mate a set of observed (tk ,xk ) ∈ R2 by a function lin-ear in t . See Fig. 3.2. In the absence of assumptions onthe stochastic properties of X t the approximating straightline x = a+b t is often taken to be the one that minimizesthe sum of squares

k

(xk − (a +b tk ))2 (3.21)

with respect to a and b . In this section we review this andother linear estimation problems.

It will be convenient to stack the observationsX1, X2, . . . , XN in a column vector denoted X ,

X =

X1

X2

...XN

.

This allows us to write the least squares problem (3.21) ina more compact vector notation as the problem of mini-mizing ‖ǫ‖2 = ǫTǫ where ǫ is the vector of equation errors

5A square real matrix M is positive definite if x TMx > 0 for every realvector x 6= 0 (of correct dimension). It is non-negative definite ifx TMx ≥ 0 for every vector x (of correct dimension).

36

Page 43: Time Series Analysis and System Identification

defined via

X1

X2

...XN

︸ ︷︷ ︸

X

=

1 t1

1 t2

......

1 tN

︸ ︷︷ ︸

W

a

b

︸︷︷︸

θ

+

ǫ1

ǫ2

...ǫN

︸ ︷︷ ︸

ǫ

.

More generally, we consider in this section the problemof obtaining estimators of θ , when all we are given areobservations X and we know that

X =W θ + ǫ

with W a given matrix. Depending on what we assume forǫ different estimators result. A possibility is to use leastsquares approximation as indicated above.

Lemma 3.4.1 (Least squares). Suppose W is a full column

rank matrix. Then there is a unique θ that minimizes ‖ǫ‖2and it is given by

θ = (W TW )−1W TX .

Proof. The quantity to be minimized is ‖ǫ‖2 = ǫTǫ =

(X −W θ )T(X −W θ ) and this is quadratic in θ . Quadraticfunctions that have a minimum are minimal if and only iftheir gradient is zero. The gradient of (X −W θ )T(X −W θ )

with respect to θ is

∂ θ[(X −W θ )T(X −W θ )] =−2W T(X −W θ )

=−2[W TX − (W TW )θ ].

This is zero if and only if θ = (W TW )−1W TX .

Suppose now that we have some knowledge of stochas-tic properties of ǫ. To begin with, assume that the entriesǫt of ǫ come from a zero mean white noise process. Then

var(ǫ) =E(ǫǫT) =σ2IN .

Because of the relation X =W θ+ǫ it follows that also X israndom, and therefore any sensible estimator θ =S(X ) aswell. In the least squares problem the estimator θ turnsout to be linear in the observations X . To keep the prob-lems tractable we will now limit the class of estimators tolinear estimators, that is, we want the estimator θ to beof the form

θ = K X , for some matrix K .

Since we know that var(ǫ) = σ2IN , the least squares so-lution that aims to minimize ‖ǫ‖2 is no longer well moti-vated. What is well motivated is to find an estimate θ of θthat minimizes

E‖θ − θ ‖2 =∑

j

E(θj − θj )2.

Lemma 3.4.2 (Linear unbiased minimum variance estima-

tor). Consider X = W θ + ǫ and assume that Eǫ = 0 and

var(ǫ) =σ2IN . The following holds.

1. A linear estimator θ = K X of θ is unbiased if and only

if K W = I . In particular this shows that linear unbi-

ased estimators exist if and only if W has full column

rank;

2. The linear unbiased estimator θ = K X that mini-

mizesE‖θ − θ‖2 is

θ = (W TW )−1W TX (3.22)

in which case

var(θ ) =σ2(W TW )−1

and

E‖θ − θ ‖2 =σ2tr(W TW )−1;

3. For any linear unbiased estimator p of θ there holds

that

var(p )≥ var(θ )

where θ is the estimator defined in (3.22).

A proof is listed on Page 93. Note that this results inthe same estimator as the least squares estimator. Condi-tion 3 states that among the unbiased linear estimators,the minimum variance estimator (3.22) is the one thatachieves the smallest variance matrix. Here the matrixinequality is in sense that var(p )− var(θ ) is a nonnega-tive definite matrix. We may interpret this as a Cramér-Rao type lower bound but then with respect to the lin-ear estimators. It is important to realize that Condition3 is only about linear estimators p . It may be possiblethat nonlinear estimators exist that outperform the lin-ear ones. However if all we know are the first and secondorder moments of the underlying stochastic process thenlinear estimators are a natural choice. Also, as we shallnow see, the linear estimators are maximum likelihoodestimators in the case that ǫt are zero mean normally dis-tributed white noise. In that case these estimators areoptimal with respect to any unbiased estimator linear ornonlinear.

Suppose that the ǫt are zero mean jointly normally dis-tributed white noise with variance σ2. The vector ǫ isthen a vector-valued Gaussian process, with joint prob-ability density function

f ǫ(ǫ) =1

(σp

2π)Ne−

12σ2 ‖ǫ‖2 .

From X = W θ + ǫ we get that X is normally distributedwith mean W θ and varianceσ2I . So

f X (X ) =1

(σp

2π)Ne−

12σ2 ‖X−W θ ‖2 .

37

Page 44: Time Series Analysis and System Identification

Its log likelihood function is

L =−N log(σp

2π)− 1

2σ2‖X −W θ ‖2.

The maximum likelihood estimators of θ and σ followsimilarly as in Example 3.3.3 by differentiation of L withrespect to θ andσ. Here, though, we need the vector val-ued derivative of θ , i.e. gradient,

Lθ =1

σ2W T(X −Wθ ), Lσ =−N

1

σ+

1

σ3‖X −W θ ‖2.

The gradient Lθ is zero if and only if θ = (W TW )−1W TX ,and then the derivative Lσ is zero if and only if σ2 =

‖X −W θ ‖2/N . Again we found the same estimator forθ , and as a result the variance matrix var(θ ) again equalsσ2(W TW )−1. The Hessian Lθθ T is readily seen to be

Lθθ T =− 1

σ2W TW.

Fisher’s information matrix hence is M (θ ) = −ELθθ T =1σ2 (W

TW ), but this is precisely the inverse of the variance,so the Cramér-Rao lower bound is attained (3.19). There-fore:

Lemma 3.4.3 (Maximum likelihood). Suppose ǫt is a zero

mean normally distributed white noise process with vari-

ance σ2. Consider X = W θ + ǫ as defined above and as-

sume that W has full column rank. Then the maximum

likelihood estimators of θ andσ2 are

θML = (WTW )−1W TX ,

σ2ML=

1

N‖X −W θ ‖2

and the estimator θML is efficient with var(θML) =

σ2(W TW )−1. �

3.5 Problems

3.1 Stationary normally distributed processes. Supposethat the stochastic process X t , t ∈Z, is normally dis-tributed. Prove that the process is wide-sense sta-tionary if and only if it is strictly stationary.

3.2 Fourth-order moment of normally distributed

stochastic variables. Let X1, X2, X3 and X4 befour jointly distributed stochastic variables withcharacteristic functionψ(u 1, u 2, u 3, u 4).

a) Show that

EX1X2X3X4

=∂ 4ψ(u 1, u 2, u 3, u 4)

∂ u 1∂ u 2∂ u 3∂ u 4

����u1=0,u2=0,u3=0,u4=0

.

b) Prove with the help of (3.10) that if X1, X2,X3 and X4 are zero mean jointly normally dis-tributed random variables then

EX1X2X3X4

=EX1X2EX3X4+EX1X3EX2X4

+EX1X4EX2X3. (3.23)

3.3 Normally distributed AR process. Let {ǫt }t≥0 bea jointly normally distributed white noise pro-cess. Consider AR(n) process {X t }t≥0 defined byD(q )X t = ǫt , (t ∈ N). Under which conditions onX−n , X−n+1, . . . , X−1 is {X t }t≥0 jointly normally dis-tributed?

3.4 Unbiased estimators and the mean square error. Let S

be an unbiased estimator of a parameter θ , and con-sider other estimators

Sγ = γS, γ∈R.

Such an estimator Sγ is unbiased if and only if γ = 1(or θ = 0). Now it is tempting to think that the γ thatminimizes the mean square error

minγE(Sγ−θ )2 (3.24)

is γ equal to 1 as well. That is generally not the case:Show that (3.24) is minimized for

γ=θ 2

var(S)+θ 2≤ 1. (3.25)

3.5 Estimation of random walk parameter.6 The processdefined by

∇X t = ǫt , t = 1, 2, 3, . . . , (3.26)

with X0 = 0, ∇ the difference operator ∇X t = X t −X t−1, and ǫt , t = 1, 2, . . . normally distributed zeromean white noise with variance σ2, is known as therandom walk process.

a) Compute the mean value function m (t ) =

EX t and the covariance function r (t , s ) =

cov(X t , Xs ) of the process. Is the process wide-sense stationary?

b) Determine the joint probability density func-tion of X1, X2, . . . , XN , with N > 1 an integer.

c) Prove that the maximum likelihood estimatorσ2 of the variance σ2 of the white noise ǫt , t >

0, based on the observations X1, X2, . . . , XN , isgiven by

σ2N=

1

N

N∑

k=1

(Xk −Xk−1)2. (3.27)

6Examination May 25, 1993.

38

Page 45: Time Series Analysis and System Identification

d) Prove that this estimator is unbiased.

e) It may be shown that Eǫ4t= 3σ4. Prove that the

estimator (3.27) is consistent, by showing thatvar(σ2

N )→ 0 as N →∞.

3.6 Suppose X0, . . . , XN−1 are mutually independentstochastic variables with the same probability den-sity function

f (x ) =

(1λ

e−x/λ if x > 0

0 if x ≤ 0

for some parameter λ > 0. It may be shown thatEX t = λ and EX 2

t= 2λ2.

a) Determine the maximum likelihood estimatorλ of λ given X0, . . . , XN−1.

b) Is this estimator λ biased?

c) Express the Cramér-Rao lower bound for thevariance of this estimator λ in terms of λ andN .

d) Is this estimator λ efficient?

e) Is this estimator λ consistent?

3.7 Interpretation of the vector-valued Cramér-Rao lower

bound⋆. Show that a vector of estimators θ ∈ Rn ofθ ∈ Rn is efficient if and only for every λj ∈ R theestimator

∑n

j=1λj θj ∈ R is an efficient estimator of∑n

j=1λj θj .

3.8 Colored noise. Suppose ǫ ∈RN is a vector of stochas-tic variables all with zero mean, and that

EǫǫT =PPT

for some nonsingular matrix P .

a) Determine E ǫǫT for ǫ := P−1ǫ.

b) Reformulate Lemma 3.4.2 but now for the casethat EǫǫT = PPT (instead of EǫǫT = σ2IN asassumed in Lemma 3.4.2).

3.9 Linear unbiased minimum variance estimator. Con-sider the stochastic process X t = µ+ ǫt with ǫt zeromean white noise with varianceσ2.

Given observations x1, . . . ,xN determine the linearunbiased minimum variance estimator of µ, usingLemma 3.4.2.

Matlab problems

10. Explain what the MATLAB code listed below does andadapt it to find a polynomial p (t ) of sufficiently highdegree such that maxt∈[0,π/2] |sin(t )−p (t )| is less than10−7. Supply plots of the error sin(t )−p (t ) and com-pare the results with the Taylor series expansion ofsin(t ) around t = 0: sin(t ) = t − 1

3!t 3+ 1

5!t 5− 1

7!t 7 · · · .

n=5;

N=10;

t=linspace(0,2,N);

X=exp(t)’;

W=ones(N,n+1);

for k=1:n

W(:,k+1)=W(:,k) .∗ t’;

end

theta=W\X; % this means theta=(W’∗W)\W’∗X

plot(t,X,t,W∗theta);

39

Page 46: Time Series Analysis and System Identification

40

Page 47: Time Series Analysis and System Identification

4 Non-parametric time series analysis

4.1 Introduction

In this chapter we discuss non-parametric time seriesanalysis. This includes “classical” time series analysis.

Classical time series analysis consists of a number ofrelatively simple techniques to identify and isolate trendsand seasonal effects. These techniques are mainly basedon moving averaging methods. The methodology is com-plemented with a collection of non-parametric statisticaltests for stochasticity and trend.

Classical methods are still used extensively in practicaltimes series analysis. There exists a well-known softwarepackage, known as Census X-11, which implements thesemethods. The American Bureau of the Census developedthe package. Many national and international institutesthat gather and process statistical data use this package.

Under non-parametric time series analysis we alsoinclude methods to estimate covariance functions andspectral density functions. Parametric time series anal-ysis, which is the subject of Chapter 5 (p. 59), deals withthe estimation of parameters of more or less structuredmodels for time series, in particular, ARMA models.

In § 4.2 (p. 41) some well known non-parametric testsfor stochasticity and trend are reviewed. A brief discus-sion of a number of ideas from classical time series anal-ysis follows in § 4.3 (p. 41).

The first subject from non-parametric time series anal-ysis that we discuss is the estimation of the mean of awide-sense stationary process in § 4.5 (p. 43). We in-troduce an obvious estimator and investigate its consis-tency. Next in § 4.6 (p. 45) we discuss the estimation of thecovariance function of a wide-sense stationary process.Section 4.7 (p. 46) considers the estimation of the spec-tral density function. Again the unbiasedness and consis-tency of the estimates are investigated. “Windowing” inthe time and frequency domain is extensively discussed,and a brief presentation of the fast Fourier transform isincluded.

In § 4.7 (p. 53) some aspects of non-parametric esti-mation problems for continuous-time processes are re-viewed.

4.2 Tests for stochasticity and trend

This and the next section follow Chapters 2, 3 and 4of Kendall and Orr (1990).

The first question that may be asked when an observedtime series x t , t = 1, 2, . . . , n , is studied is whether it is

a purely random sequence, that is, whether it is a re-alization of white noise. The following well know non-parametric test may provide an indication for the answerto this question. A time instant k is a turning point of theobserved sequence if

xk−1 < xk > xk+1 (4.1)

(the sequence exhibits a peak) or

xk−1 > xk < xk+1 (4.2)

(the sequence has a local minimum). If the sequence haslength n then it has at most n −2 turning points. Let y bethe total number of turning points. The null hypothesisis that the sequence is a realization of a sequence of mu-tually independent identically distributed stochastic vari-ables (that is, of white noise). Under this hypothesis theexpectation and variance of y equal

Ey =2(n −2)

3, var(y ) =

16n −29

90. (4.3)

A proof is listed in Appendix A. With increasing n the dis-tribution of y quickly approaches a normal distribution.With good approximation the statistic

z =y −Eyp

var(y )(4.4)

has a standard normal distribution for n sufficientlylarge. If the observed outcome of z lies outside a suit-ably chosen confidence interval then the null hypothesisis rejected.

The next test could be whether the sequence consti-tutes a trend. A suitable statistic for this is the number ofpoints of increase. A time instant k is a point of increaseif xk+1 > xk . If the sequence is a realization of a sequenceof mutually independent identically distributed stochas-tic variables then the number of points of increase z hasexpectation Ez = 1

2(n−1) and variance var(z ) = 1

12(n+1).

The outcome of the statistic is an indication for the pres-ence of a trend and its direction.

Null hypotheses that are not rejected are not necessar-ily true. The use of the tests that have been describedtherefore is limited. In the literature many other non-parametric tests are known that possibly may be useful.

4.3 Classical time series analysis

4.3.1 Introduction

In classical time series analysis a time series x t has threecomponents:

1. A trend m t .

2. A seasonal component s t .

3. An incidental or random component w t .

41

Page 48: Time Series Analysis and System Identification

These components may be additive but also multiplica-

tive. In the first case we have

x t =m t + s t +w t . (4.5)

In the second case

x t =m t · s t ·w t . (4.6)

The multiplicative model may be converted to an addi-tive model by taking the logarithm of the time series. Thiscan only be done if the time series is essentially positive.If a time series exhibits fluctuations that increase withtime then this is an indication that a multiplicative modelmight be appropriate. The cause of the increasing fluctu-ations then is an increasing trend factor m t in the mul-tiplicative model (4.6). The immigration into the USA asplotted in Fig. 1.3 (p. 2) shows this pattern.

4.3.2 Trends

In what follows we use the additive model (4.5) and forthe time being assume that no seasonal component s t ispresent. A well-known method to estimate the trend is totake a moving average of the form

m t =

k2∑

j=−k1

h j x t−j . (4.7)

The circumflex on the m indicates an estimated value.The real numbers h j are the weights. If k1 = k2 = k thenthe moving average is said to be centered. If in additionh−j = h j for all j then the moving average is symmetric.

The method that is described in what follows may beused to determine moving averaging schemes in a sys-tematic manner. Suppose — without loss of generality —that we wish to estimate the value of the trend at time 0.Then around this time we could represent the trend as apolynomial, say of degree p , so that

m t = a 0+a 1t + · · ·+a p t p . (4.8)

Next we could determine the coefficients a 0, a 1, · · · , a p

by minimizing the sum of squares

n∑

t=−n

(x t −a 0−a 1t − · · ·−a p t p )2. (4.9)

This makes sense if 2n > p . By way of example choosep = 3 and n = 3. Then

3∑

t=−3

(x t −a 0−a 1t −a 2t 2−a 3t 3)2 (4.10)

needs to be minimized with respect to the coefficientsa 0, a 1, a 2 and a 3. Partial differentiation with respect toeach of the coefficients and setting the derivatives equalto zero yields the four normal equations

3∑

t=−3

x t t j = a 0

3∑

t=−3

t j+a 1

3∑

t=−3

t j+1+a 2

3∑

t=−3

t j+2+a 3

3∑

t=−3

t j+3

for j = 0, 1, 2, 3. The summations over t from −3 to 3 ofthe odd powers j of t cancel, so that the set of equationsreduces to

∑3t=−3 x t = 7a 0 +28a 2,

∑3t=−3 t x t = 28a 1 +196a 3,

∑3t=−3 t 2x t = 28a 0 +196a 2,

∑3t=−3 t 3x t = 196a 1 +1588a 3.

(4.11)Since according to (4.8) m0 = a 0 we are only interested inthe coefficient a 0. By solving the first and third equationsof (4.11) for a 0 it follows that

m0 =1

21

73∑

t=−3

x t −3∑

t=−3

t 2x t

!

=1

21(−2x−3+3x−2+6x−1+7x0+6x1+3x2−2x3).

The estimate m0 of the trend at time 0 hence follows froma symmetric weighted average of the values of the timeseries x t about the time instant 0 with weights

1

21[−2, 3, 6, 7, 6, 3, −2]. (4.12)

These weights define a moving averaging scheme.In Kendall and Orr (1990) tables may be found that list

the weights of moving averaging schemes for differentvalues of the degree of the polynomial p and of the num-ber of terms n in the sum of squares. The schemes areall symmetric. The sum of the weights is always 1, whichmay easily be checked for the weights of (4.12). The con-sequence of this is that if x t is constant then this constantis estimated as the trend — a sensible result.

For time instants near the beginning and the end of thetime series a scheme of the form (4.7) cannot be used be-cause some of the values of the time series x t that areneeded are missing. This may be remedied by adjustingthe number of terms in the sum of squares (4.9) corre-spondingly. The resulting moving averaging scheme is nolonger symmetric.

In the days when no electronic computers were avail-able and the calculations needed to be done with sim-ple tools — recall the photographs of halls filled with me-chanical calculators and their operators from the secondworld war — much effort was spent in searching for effi-cient schemes. A clever device is the successive applica-tion of several moving averaging schemes of short lengthwith simple (integer) weights. Spencer’s formula (from1904) consists of the successive application of four mov-ing averaging schemes with weights

[1, 1, 1, 1, 1],

[1, 1, 1, 1, 1],

[1, 1, 1, 1, 1, 1, 1],1

350[−1, 0, 1, 2, 1, 0, −1].

(4.13)

42

Page 49: Time Series Analysis and System Identification

This composite scheme is equivalent to a 21-pointscheme that differs little from a least squares scheme.Schemes of this type are still used in actuarial practice.

Which scheme is eventually used depends on the re-sults. It often is found by trial and error.

4.3.3 Seasonal components

If a seasonal component is present then the period P isusually known. Periods with the values 12 (months) or 4(quarters) are common.

The seasonal component s t with period P of the timeseries

x t =m t + s t +w t (4.14)

may easily be removed on the basis of the assumptionthat

P∑

j=1

s t+j = 0 for all t . (4.15)

Application to x t of a centered P-point moving averagingscheme with weights

1

P[1, 1, · · · , 1] (4.16)

eliminates the seasonal component. The seasonal com-ponent is recovered by subtracting the processed time se-ries from the original time series.

If P is even — such as for the examples that were cited— then no centered scheme with the weights (4.16) is fea-sible. The scheme then is modified to a (P+1)-point cen-tered scheme with weights

1

P[ 1

2, 1, 1, · · · , 1, 1

2]. (4.17)

The approach that has been described assumes that thepattern of the seasonal component does not change withtime. The Census X-11 package has modifications andoptions that permit gradual changes of the seasonal pat-tern.

Example 4.3.1 (Measurements of water level). Suppose x t

shown in Fig. 4.1(a) represents the water level of the seameasured every 12 minutes. Because of waves, the mea-surements x t are very irregular. To get rid of irregularitieswe average x t over 11 samples,

yt =

5∑

j=−5

1

11x t+j .

This choice of 11 is somewhat arbitrary, and as 11 sam-ples cover about 2 hours, the periodic behavior in x t isslightly flattened. The result yt is shown in Fig. 4.1(b).This clearly shows a periodic behavior, and now we alsosee a definite trend, possibly due to sedimentation. Toseparate this trend we average yt over four periods (240samples)

m t =

120∑

j=−120

h j yt+j ,

in which

[h−120, . . . , h120] =1

240[ 1

2, 1, 1, . . . , 1, 1, 1

2].

See part (c) of Fig. 4.1. Combined, the x t has now beenseparated into a trend, a seasonal component and a ran-dom component,

x t = m t︸︷︷︸

trend

+yt −m t︸ ︷︷ ︸

seasonal

+x t − yt︸ ︷︷ ︸

random

.

4.4 Estimation of the mean

4.4.1 Estimation of the mean

We consider the estimation of the mean m = EX t of awide-sense stationary process X t , t ∈ Z. As previouslyestablished the sample mean

mN =1

N

N−1∑

t=0

X t (4.18)

is an obvious estimator. It is unbiased.If X t , t ∈Z, is normally distributed white noise then X t ,

t = 0, 1, · · · , N−1, are mutually independent normally dis-tributed with expectation m and standard deviationσ. InExample 3.3.3 (p. 35) we saw that mN then is an efficientestimator of m , with varianceσ2/N .

Now consider the general case that X t , t ∈ Z, is wide-sense stationary with covariance function cov(X t , Xs ) =

r (t − s ). The variance of the estimator (4.18) then is givenby

var(mN ) =1

N

N−1∑

k=−N+1

1− |k |N

r (k ). (4.19)

The proof of (4.19) may be found in Appendix A (p. 93).This implies with Theorem 3.3.1 (p. 34) that the estimatoris consistent if

limN→∞

N−1∑

k=−N+1

1− |k |N

r (k )<∞. (4.20)

In that case the variance of the estimation error dependslike 1/N on the number of observations N . This meansthat the standard deviation of the estimation error de-creases to zero as 1/

pN as N → ∞. The error de-

creases relatively slowly as the number of observationsN increases. If the number of observations is taken 100times larger then the estimation error becomes only 10times smaller. With the Cramér-Rao lower bound (3.18) inmind, we expect that significantly better estimators cannot be found.

Example 4.4.1 (Estimation of the mean of an

AR(1)-process). Consider by way of example the AR(1)-process X t = a X t−1 + ǫt . This process has a stationarycovariance function of the form

r (k ) =σ2X a |k |, k ∈Z, (4.21)

43

Page 50: Time Series Analysis and System Identification

-1

0

1

2

3

4

0 00500 5005001000 10001000-1

0

1

2

3

4

-1

0

1

2

3

4

x t yt m t

Figure 4.1: Separating noise and seasonal components from x t

provided |a | < 1. According to (4.19) the variance of theestimator (4.18) equals

var(mN ) =σ2

X

N

1+2N−1∑

k=1

(1− k

N)a k

!

. (4.22)

UsingN−1∑

k=1

a k =a −a N

1−a

andN−1∑

k=1

k a k =a −N a N +(N −1)a N+1

(1−a )2

we find that for N large enough

var(mN )≈σ2

X

N

1+a

1−a. (4.23)

The error in the approximation is of order 1/p

N .For a = 0 the variance (4.23) reduces toσ2

X /N . This is ofcourse precisely the variance that results for white noise.

If a increases from 0 to 1 then the error variance also in-creases. This is caused by the increasing positive correla-tion between successive values of X t . The correlation de-creases the effective number of observations, see Fig. 4.2.

If on the other hand a decreases from 0 to −1 then theerror variance decreases. This somewhat surprising phe-nomenon is explained by the alternating character of thetime series if a is negative. The alternating character de-creases the contribution of ǫt to mN , with t fixed. �

4.4.2 Ergodicity

If the condition (4.20) is satisfied then it is possible to de-termine the mean of the process X t with arbitrary pre-cision from a single realization X t , t ∈ Z of the process.A process for which all statistical properties, such as itsmean, but also its covariance function, may be deter-mined from a single realization is said to be ergodic.

A characteristic example of a non-ergodic process is theprocess defined by

X t = Y , t ∈Z. (4.24)

0 5 10 15 20 25 30 35 40 45 50−3

−2

−1

0

1

2

3

m t

m t −q

σ2X

N

1+a

1−a

m t +

q

σ2X

N

1+a

1−ax t

0 5 10 15 20 25 30 35 40 45 50−15

−10

−5

0

5

10

m t

m t −q

σ2X

N

1+a

1−a

m t +

qσ2

X

N

1+a

1−ax t

Figure 4.2: Realization x t and estimate m t withconfidence bounds for a = 0.2 (left)and a = 0.9 (right)

Y is a stochastic variable. All realizations of the processare constant. A single realization of the process X t yieldsonly a single sample of the stochastic variable Y . Hence,determining the stochastic properties of the process X t ,t ∈Z, from a single realization is out of the question.

Loosely speaking, for ergodicity the process shouldhave a limited temporal relatedness. The assumption that

limN→∞

1

N

N∑

k=0

|r (k )|= 0 (4.25)

is sometimes referred to as an ergodicity assump-tion (Kendall and Orr, 1990). When the assumption

44

Page 51: Time Series Analysis and System Identification

holds, then m (t )may be determined with arbitrary preci-sion from a single realization. A slightly stronger assump-tion that we sometimes require is that the r (k ) are abso-lutely summable,

∞∑

k=0

|r (k )|<∞. (4.26)

All wide sense stationary ARMA processes have this prop-erty.

4.5 Estimation of the covariance function

4.5.1 An estimator of the covariance function

The covariance function of a wide-sense stationary pro-cess is more interesting than its mean. It provides im-portant information about the temporal relatedness ofthe process. We discuss how the covariance functioncov(X t , Xs ) = r (t − s ) of a wide-sense stationary processX t t ∈Z, may be estimated.

Suppose that a part x t , t ∈ {0, 1, . . . , N − 1} of a realiza-tion is available. The covariance function cov(X t , Xs ) isdefined as r (t−s ) =E(X t−m )(Xs−m ), with m =EX t . Form we may determine an estimate mN as in § 4.4. Then forfixed k we may estimate the covariance function r (k ) bytaking the sample mean of the product

(x t+k − mN )(x t − mN ) (4.27)

over as many values as are available. For k ≥ 0 we maylet t run from 0 to N −k −1 without going outside the set{0, 1, . . . , N−1}. This implies that N−k values of (4.27) areavailable. Thus we obtain the sample average estimator

rN (k ) =1

N −k

N−k−1∑

t=0

(X t+k − mN )(X t − mN ), (4.28)

k = 0, 1, . . . , N −1.

For k < 0 we obtain by the same argument rN (k ) =

rN (−k ).

4.5.2 Biasedness

We investigate the biasedness of the estimator (4.28). Fork ≥ 0 we write

N−k−1∑

t=0

(X t+k − mN )(X t − mN )

=

N−k−1∑

t=0

(X t+k −m − mN +m )(X t −m − mN +m )

=

N−k−1∑

t=0

(X t+k −m )(X t −m )−N−k−1∑

t=0

(X t+k −m )(mN −m )

−N−k−1∑

t=0

(X t −m )(mN −m )+

N−k−1∑

t=0

(mN −m )2.

For N ≫ k we have with good approximation

N−k−1∑

t=0

X t+k ≈N−k−1∑

t=0

X t ≈ (N −k )mN . (4.29)

It follows that

N−k−1∑

t=0

(X t+k − mN )(X t − mN )

≈N−k−1∑

t=0

(X t+k −m )(X t −m )− (N −k )(mN −m )2.

Dividing by N −k and taking expectations we obtain1

E rN (k )≈ r (k )− var(mN ). (4.30)

The estimator (4.28) of r (k ) hence is biased, and the biasis always negative. As we saw in 4.4 (p. 43) under suitableconditions limN→∞ var(mN ) = 0. In this case the estima-tor (4.28) is asymptotically unbiased.

In practice the estimator (4.28) is usually replaced withan estimator of the same form but with N in the denom-inator rather than N − k . The reason is that althoughthe revised estimator has a larger bias it generally has asmaller mean square estimation error. Another impor-tant reason is discussed in Problem 4.3, (p. 56). From nowon we therefore work with the estimator

rN (k ) =1

N

N−|k |−1∑

t=0

(X t+|k |− mN )(X t − mN ), (4.31)

k =−(N −1), . . . , 0, . . . , N −1.

This estimator is known as the standard biased covariance

estimator. Like (4.28) this estimator is asymptotically un-biased.

4.5.3 Accuracy

We consider the accuracy of the estimator (4.31). For sim-plicity we assume that the process X t has a known meanm , and that m = 0. Hence we use the estimator

rN (k ) =1

N

N−|k |−1∑

t=0

X t+|k |X t . (4.32)

This greatly simplifies the algebra, and the conclusionsare qualitatively the same as for when m is unknown.Asymptotically the formulas that we find also hold for theestimator (4.31). The estimator (4.32) has expectation

E rN (k ) = (1−|k |N)r (k ), (4.33)

and hence is asymptotically unbiased. The mean squareerror is

E (rN (k )− r (k ))2 = var(rN (k ))+ [r (k )−E rN (k )]2

= var(rN (k ))+k 2

N 2r 2(k ).

1It may be shown that E rN (k ) = r (k )− var(mN ) +1

N (N−k )O(k ) if (4.26)

holds.

45

Page 52: Time Series Analysis and System Identification

When computing the variance of rN (k ) expectations offourth-order moments of the stochastic process X t occur.To evaluate these it is necessary to consider more thanonly the first- and second-order properties of the process.We assume that the process is normally distributed. Thenit follows that for k ≥ 0

var(rN (k )) =1

N

N−1−k∑

i=−N+1+k

(1−k + |i |N

)[r 2(i )+r (k+i )r (k−i )].

(4.34)The proof may be found in Appendix A (p. 93). For largeN (that is, N ≫ 1) we have for the mean square error

E (rN (k )− r (k ))2 ≈ var(rN (k )

≈ 1

N

∞∑

i=−∞[r 2(i )+ r (i +k )r (i −k )].

This implies for k = 0 and k ≫ 1

E (rN (k )− r (k ))2 ≈(

2N

∑∞i=−∞ r 2(i ) for k = 0,

1N

∑∞i=−∞ r 2(i ) for k ≫ 1.

(4.35)

These quantities may be estimated as soon as an estimaterN of r is available.

In Appendix A (p. 93) it is furthermore proved that fork ≥ 0 and k ′ ≥ 0

cov(rN (k ), rN (k′)) (4.36)

≈1

N

∞∑

i=−∞[r (i )r (i +k −k ′)+ r (i +k )r (i −k ′)].

For increasing k −k ′ the covariance cov(rN (k ), rN (k ′)) ap-proaches zero. Inspection of (4.36) shows that the intervalover which the covariance decreases to zero is of the sameorder of magnitude as the interval over which the covari-ance function r decreases to zero. An important implica-tion of this is that the correlation in the estimation errorsof rN (k ) and rN (k ′) for close values of k and k ′ is large.Consequently rN may have a deceivingly smooth behav-ior while the estimation error is considerable.

For reasons of exposition we had to take several restric-tive assumptions. Practice seems to confirm however thatthey qualitatively hold for arbitrary wide sense stationaryprocess.

4.5.4 Example

We illustrate the results by an example. Figure 2.6(a) (p. 8)shows a realization of length N = 200 of a process with theexponential covariance function

r (k ) =σ2X

a |k |, (4.37)

with a = 0.9 and σX = 1. Figure 4.3 shows the behaviorof the estimate r200 of the covariance function r based onthe 200 available observations. We see that the estimatedbehavior deviates considerably from the real behavior.

0 10 20 30 40 50 60-0.5

0

0.5

1

time shift k

r

r

Figure 4.3: Solid: Estimate of an exponential co-variance function. Dashed: Actual co-variance function

We check what the accuracy of the estimate is. Because

∞∑

i=−∞r 2(i ) =σ4

X

∞∑

i=−∞a 2|i | =σ4

X

1+a 2

1−a 2(4.38)

it follows from (4.35) that for large N

var(rN (0))≈σ4X

1+a 2

1−a 2

2

N. (4.39)

For large k the variance of rN (k ) is half this amount. Nu-merically we find that the standard deviation of r200(0) isabout 0.309. For large k the standard deviation of r200(k )

is about 0.218. We see that the errors that are theoreticallyanticipated are large. The actual estimates lie within thetheoretical error margins.

Example 4.5.1 (Estimation of the covariance function in

Matlab). The time series of Fig. 2.6(a) (p. 8) and the esti-mated covariance function of Fig. 4.3 were produced withthe following MATLAB script. It assumes that the MATLAB

toolbox IDENT is installed.

% Estimation of the covariance function of

% an AR(1) process. Define the parameters

% and the system structure

a = 0.9;

sigma = sqrt(1−a^2); % then var(X t ) = 1 for

D = [1 −a]; % X t = a X t−1+ ǫt

th = poly2th(D,[]); %

randn(’seed’,1);

e = sigma∗randn(256,1); % white noise

x = idsim(e,th); % realization of X t

rhat = covf(x,64); % Estimate covariance

plot(rhat); % over 64 time shifts

4.6 Estimation of the spectral density

4.6.1 Introduction

In this section we look for estimators of the spectral den-sity of a wide-sense stationary process X t , t ∈ Z. We as-sume that we know that the process has mean EX t = 0,

46

Page 53: Time Series Analysis and System Identification

and that the covariance function is cov(X t , Xs ) = r (t − s ).We further assume that r is absolutely summable, that is,∑∞

k=−∞ |r (k )|<∞. Then the spectral density

φ(ω) =

∞∑

k=−∞r (k )e−ikω, ω ∈ [−π,π), (4.40)

exists and may be shown to be continuous in the fre-quencyω.

An obvious estimator of the spectral density φ is theDCFT of an estimator of the covariance function rN of theprocess. For rN we choose the estimator (4.32):

rN (k ) =1

N

N−|k |−1∑

t=0

X t+|k |X t , k = 0,±1, . . . ,±(N −1).

(4.41)From this we obtain

φN (ω) =

N−1∑

k=−N+1

rN (k )e−ikω . (4.42)

as an estimator for the spectral density φ.This estimator is closely connected to a function that

is known as the periodogram of the segment X t , t ∈{0, 1, . . . , N − 1} of the time series X t . The periodogrampN is defined as the function2

pN (ω) =1

N

�����

N−1∑

t=0

X t e−itω

�����

2

=

��XN (ω)

��2

N, (4.43)

where XN is defined as the DCFT

XN (ω) =

N−1∑

t=0

X t e−itω . (4.44)

In Appendix A (p. 94) it is shown that the estimator(4.42) of φ and the periodogram pN of (4.43) are actuallyone and the same object:

Lemma 4.6.1 (Periodogram).

φN (ω) = pN (ω). (4.45)

4.6.2 Biasedness and consistency

The estimator (4.41) of the covariance function is asymp-totically unbiased. It may be proved that the estimator(4.42) of the spectral density is also asymptotically unbi-ased.

The next question is whether or not the estimator isalso consistent. For the case that the process X t , t ∈ Z isnormally distributed this may be verified and the answeris negative:

2Often it is the plot of pN (ω) as a function of frequencyω that is calledperiodogram.

Lemma 4.6.2 (Inconsistency of the periodogram). Assume

X t is a wide sense stationary normally distributed process

with zero mean. Then for anyω1,ω2 ∈ (−π,π) there holds

that

1. limN→∞ var(φN (ω1)) =φ2(ω1) ifω1 6= 0,

2. limN→∞ cov(φN (ω1), φN (ω2)) = 0 ifω1 6=±ω2.

The proof of the lemma is involved. It is summarized inAppendix A. For normally distributed processes consis-tency is equivalent to convergence to zero of its variance.Hence the above lemma implies that the estimator φN isnot consistent. Also it reveals that for large N the errors inthe estimates of the spectral density for close but differentfrequencies are uncorrelated. A consequence of these tworesults is that the plot of φN (ω) often has an erroneousand irregularly fluctuating appearance. This is imminentfrom Figure 4.4. The figure shows two estimates φN , onefor N = 128 and one for N = 1024, together with the exactφ. The plots confirm that the standard deviation of φN ateach frequency is close toφ.

One explanation of Condition 1 of the above lemma isthat the estimator (4.42) follows by summing over esti-mates of the covariance function that each have varianceof order 1/N . Because the summation is over 2N−1 termsthe outcome may have very large variance. Based on thisargument we expect that if in (4.42) we sum over M ≪ N

terms then the variance will be smaller. The variance ofa sum of O(M ) terms each with variance O(1/N ) can atmost beO(M 2/N ). Therefore if M and N jointly approachinfinity but such that M 2/N → 0 then we expect consis-tent estimation. This is a rather conservative condition.One of the results of the next section is that in fact it suf-fices to take M/N → 0.

4.6.3 Windowing

The previous paragraph suggests to consider estimatorsof the form

φwN (ω) =

∞∑

k=−∞w (k )rN (k )e

−iωk . (4.46)

Here w is a symmetric function such that w (k ) = 0 for|k | ≥M . The function w is called a window. The simplestchoice is the rectangular window

w (k ) =

(

1 for |k | ≤M ,

0 for |k |>M ,(4.47)

with M < N . As we shall see there are advantages tochoose other shapes for the window.

A useful result is that φwN and the estimator φN of (4.42)

are related by convolution,

φwN(ω) =

1

∫ π

−πW (ρ)φN (ω−ρ)dρ, ω ∈ [−π,π).

(4.48)

47

Page 54: Time Series Analysis and System Identification

0

5

10

15

20

25

30

0 0.5 1 1.5 2 2.5 30

5

10

15

20

25

30

0 0.5 1 1.5 2 2.5 3

φ(ω)φ(ω)

φ128(ω) φ1024(ω)

ω→ω→

Figure 4.4: Two estimates φN for the process of Subsection 2.4.5

W is the DCFT of w and is called the spectral window cor-responding to the time window w .

Proof 4.6.3 (Relation between φwN and φN ). By inverse

Fourier transformation it follows that

w (k ) =1

∫ π

−πW (ρ)eiρk dρ. (4.49)

Substitution of this into (4.46) yields

φwN(ω) =

∞∑

k=−∞

1

∫ π

−πW (ρ)eiρk dρ

rN (k )e−iωk

=1

∫ π

−πW (ρ)

∞∑

k=−∞rN (k )e

−i(ω−ρ)k

!

=1

∫ π

−πW (ρ)φN (ω−ρ)dρ.

The effect of the window w clearly is that for ω fixedφw

N (ω) is a weighted average of φN about the frequencyω. The distribution of the weights is determined by theshape of the spectral window W . It is expected that thefluctuations of φN may be decreased by a suitable choiceof the window. A less desirable side effect is that the spec-tral density is distorted by the windowing operation. Theamount of distortion depends on the shape of the densityand that of the spectral window.

Because of the symmetry of w the spectral window W

is a real function. We require that

1

∫ π

−πW (ρ)dρ = 1. (4.50)

This is equivalent to that w (0) = 1. The result of the nor-malization (4.50) is that if φN is a constant then φw

N isprecisely this constant. Similarly if φN varies slowly com-pared to W then φN ≈ φw

N .In the literature many window pairs w and W are

known. We discuss some common ones. The rectangular

window wR is given by (4.47). The corresponding spectralwindow is

WR (ω) =sin((M + 1

2)ω)

sin( 12ω)

. (4.51)

Figure 4.5 shows the time and spectral windows. The partof the spectral window between the frequencies −a anda , with a = 2π/(2M+1), is called the main lobe. On eitherside the side lobes are found.

A disadvantage of the rectangular window is that WR

is negative for certain frequency intervals. Moreover theside lobes are relatively large. The consequence of thenegative values is that the estimated spectral density φw

N

may also be negative for certain frequencies. This is anon-interpretable result because the spectral density isessentially non-negative.

Bartlett’s window w B has the triangular form

w B (k ) =

(

1− |k |M

for |k | ≤M ,

0 for |k |>M .(4.52)

The corresponding spectral window is

WB (ω) =1

M

sin(M

2ω)

sin( 12ω)

!2

. (4.53)

The windows are plotted in Fig. 4.6. The spectral windowWB is positive everywhere. Note that the half width a ofthe main lobe equals twice that of the rectangular win-dow. Hence, there is loss of spectral resolution comparedto the rectangular window. This is the price for the reduc-tion of the estimation error. The side lobes of Bartlett’swindow still are relatively large.

The Hann window, wH , sometimes wrongly referred toas the Hanning window, is very popular. It is given by

wH (k ) =

(12(1+ cos(πk

M)) for |k | ≤M ,

0 for |k |>M .(4.54)

The corresponding spectral window is

WH (ω) =1

2WR (ω)+

1

4WR (ω−

π

M)+

1

4WR (ω+

π

M), (4.55)

48

Page 55: Time Series Analysis and System Identification

ag r

1

0

0

0

0

wR WR

ωk

M−M

2M +1

−π π−a a

Figure 4.5: The rectangular window wR and the corresponding spectral window WR , with a = 2π/(2M +1)

1

0

0

0

0

w B WB

ωk

M

M−M −π π−a a

Figure 4.6: Bartlett’s window w B and the corresponding spectral window WB , with a = 2π/M

1

0

0

0

0

wH WH

ωk

M

M M −π π−a a

Figure 4.7: Hann’s window wH and the corresponding spectral window WH , with a = 2π/M

49

Page 56: Time Series Analysis and System Identification

with WR the spectral window (4.51) that belongs to therectangular window. The window wH and the spectralwindow WH are plotted in Fig. 4.7. The width of themain lobe is the same as that of Bartlett’s window. Theside lobes of Hann’s window are smaller than those ofBartlett’s window, and are only marginally negative.

A variant of Hann’s window is that of Hamming. Ham-ming’s window is given by

wh (k ) =

(

0.54+0.46cos(πk

M) for |k | ≤M ,

0 for |k |>M ,(4.56)

and spectral window

Wh (ω) = 0.54WR (ω)+0.23WR (ω−π

M)+0.23WR (ω+

π

M).

This window has slightly better properties than Hann’swindow. The width of the main lobe is the same as thatof Hann’s window.

We conclude by listing formulas for the asymptoticvariance and covariance of the estimation error for win-dowed estimates of normally distributed white noise:

Lemma 4.6.4 (Covariance of windowed periodogram). As-

sume X t is a zero mean normally distributed white noise

process. Then for large N

cov(φwN(ω1), φ

wN(ω2))≈

φ(ω1)φ(ω2)

2πN

∫ π

−πW (ω1−ρ)[W (ω2−ρ)+W (ω2+ρ)]dρ.

An immediate special case is the variance,

var(φwN(ω))≈ φ

2(ω)

πN

∫ π

−πW 2(ρ)dρ (4.57)

=φ2(ω)

N/2

∞∑

k=−∞w 2(k ). (4.58)

In the first equality we used the fact that W is symmetricand periodic in ω with period 2π, in the second we useda result known as Parseval’s equality3: 1

∫ π

−πW 2(ρ)dρ =∑∞

k=−∞w 2(k ).

To illustrate these results consider Daniell’s window,which in the frequency domain is given by

WD (ω) =

(

M for |ω| ≤ πM

,

0 for |ω|> πM

.

=

M

−π/M π/M ω→

3It follows directly by equating φwN (0) in (4.46) with (4.48) for rN (k ) :=

w (k ).

The larger M is chosen, the narrower the spectral windowis. The corresponding time window is

wD (k ) =

(1 for k = 0,

sin(πk/M )

πk/Mfor k 6= 0.

=

k →

k =M

Because the time window wD is not exactly zero outside afinite interval it can only be implemented approximately.Evaluation of (4.57) yields

var(φwN(ω))≈ 2M

Nφ2(ω). (4.59)

For the popular window of Hann the equality of (4.58)evaluates to

var(φwN(ω))≈ 3M

4Nφ2(ω).

We see that in both cases the relative error isp

var(φwN (ω))

φ(ω)=O(

p

M/N ). (4.60)

The rule of thumb is that the relative error of the spec-

tral estimation is of orderp

M/N even in the non-whitecase. If M and N simultaneously approach ∞ but suchthat M/N → 0 then consistent estimation results. A typi-cal value for M is M = 4

pN .

It is generally true that if the width of the spectral win-dow decreases that then the resolution increases, but atthe same time also the variance of the estimation errorincreases. For every application the width of the spectralwindow needs to be suitably chosen to find a useful com-promise between the two opposite effects. The spectralwindow becomes narrower if the time window becomeswider.

Inspection of Lemma 4.6.4 shows that the correlationbetween the errors in the estimates at two different fre-quencies decreases to zero if the distance between thefrequencies is greater than the width of the spectral win-dow.

4.6.4 More about the DCFT

The Fourier transform, in this case the DCFT, plays a cen-tral role in the estimation of spectral densities. Before dis-cussing the practical computation of estimates we haveanother look at the DCFT.

A special circumstance in the estimation of spectraldensities is that the time functions whose DCFT is com-puted always have finite length. This is indeed alwaysthe case for practical numerical computations of Fouriertransforms.

Consider the DCFT

x (ω) =

L−1∑

k=0

xk e−iωk , ω ∈R, (4.61)

50

Page 57: Time Series Analysis and System Identification

of a time function xk , k = 0, 1, . . . , L−1. For simplicity andwithout loss of generality we position the support4 of thefunction in the interval [0, L). The DCFT is periodic in ωwith period 2π.

Because the function xk that is transformed is definedby L numbers we expect that also the DCFT x (ω) maybe determined by L numbers. Indeed, it turns out to besufficient to calculate the DCFT at L points, with equaldistances 2π/L, on an interval of length 2π, for instance[0, 2π). As shown in Appendix A (p. 95) we may recoverthe xk from the L values x ( 2π

Ln ) via

xk =1

L

L−1∑

n=0

x ( 2πL

n )ei 2πL

nk , k = 0, 1, . . . , L−1. (4.62)

Outside the interval [0, 2π) the values of the DCFT x (ω)

on the grid { 2πL

n , n ∈ Z} follow by periodic continua-tion with period 2π. Then between the grid points theDCFT may be retrieved by the interpolation formula, seee.g. Kwakernaak and Sivan (1991),

x (ω) =

∞∑

n=−∞x ( 2π

Ln )sinc

ω− 2πL

n

2L

!

, ω ∈R. (4.63)

Here sinc is the function

sinc(t ) =

(

1 for t = 0,sin(t )

tfor t 6= 0.

(4.64)

The interpolation formula (4.63) is the converse of thewell known sampling theorem from signal processing.The sampling theorem states that a band-limited timesignal may be retrieved between sampling intervals byinterpolation with sinc functions for a suitable chosensampling interval. Formula (4.63) states that the Fouriertransform of a time-limited signal may be retrieved be-tween suitably spaced grid points by interpolation withsinc functions.

Thus, for the computation of the DCFT x (ω) it is suffi-cient to compute

x ( 2πL

n ) =

L−1∑

k=0

xk e−i 2πL

nk , n = 0, 1, . . . , L−1. (4.65)

The transformation of the time function xk , k =

0, 1, . . . , L − 1, to the frequency function x ( 2πL

n ), n =

0, 1, . . . , L − 1, according to (4.65) is known as the dis-

crete Fourier transformation. To distinguish it from theDCFT Kwakernaak and Sivan (1991) call the transforma-tion (4.65) the DDFT (discrete-to-discrete Fourier trans-

form). Note the symmetry between the DDFT and its in-verse (4.62). The DDFT and its inverse may efficientlybe computed with an algorithm that is known as the fastFourier transform.

4The support is the smallest interval [a ,b ] so that xk = 0 for all k < a

and k > b .

4.6.5 The fast Fourier transform

The fast Fourier transform (FFT) is an efficient algorithmto compute the DDFT. The DDFT of the time signal xk ,k = 0, 1, . . . , L − 1, is defined by Eqn. (4.65), which werewrite in the form

xn =

L−1∑

k=0

xk e−i 2πL

nk , n = 0, 1, . . . , L−1. (4.66)

The inverse DDFT (4.62) now becomes

xk =1

L

L−1∑

n=0

xn ei 2πL

nk , k = 0, 1, . . . , L−1. (4.67)

Because of the symmetry of (4.66) and (4.67) the FFT al-gorithm can both be used for the DDFT and the inverseDDFT.

For any fixed n the computation of (4.66) requires O(L)operations. As there are L values of n to be considered, afull DDFT (4.66) seems to require O(L2) operations. TheFFT however shows that this may be achieved with muchless effort.

There are several variants and refinements of the FFT.We consider the FFT algorithm with base 2 of Cooley-Tukey. This is the most efficient and best known algo-rithm. Like all FFTs it uses factorization of L as a productof integers. For simplicity we assume that L is an integralpower of 2, say L = 2M .

The crux of the FFT is that for all even n = 2m we mayrewrite (4.66) as

x2m =

L/2−1∑

k=0

(xk +xk+L/2)e−i 2π

L/2m k , m = 0, 1, . . . ,

L

2−1

and for all odd indices n = 2m +1,

x2m+1 =

L/2−1∑

k=0

e−i 2πL

k (xk−xk+L/2 )e−i 2π

L/2 m k , m = 0, . . . ,L

2−1.

Thus the L-point DDFT (4.66) may be reduced to compu-tation of two L/2-point DDFTs, one of

yeven,k := xk +xk+L/2 , k = 0, 1, . . . , L/2−1

and one of

yodd,k := e−i 2πL

k (xk −xk+L/2), k = 0, 1, . . . ,L

2−1.

Forming the yeven,k from xk takes L/2 (complex) addi-tions. Forming yodd,k takes besides additions also mul-tiplications. Nowadays multiplication requires about asmuch computing time as addition, in any event, formingyeven,k and yodd,k together takes d L units of operation forsome constant d > 0. Now let C (L) denote the numberof operations needed to compute the L-point DDFT. Theabove shows that

C (L) = d L+2C (L/2).

51

Page 58: Time Series Analysis and System Identification

As C (1) = 0 we get that

C (L) = d L log2(L).

The DDFT so computed is known as the fast Fourier

transform (FFT). Compare the number of operationsd L log2(L) to the number of operations O(L2) that wewould need for direct computation of the DDFT. For largeL the computing time that the FFT requires is a small frac-tion of that for the direct DDFT. In practice L is very largeand the savings that the FFT achieves is spectacular.

For application of the FFT it is necessary that the lengthL of the sequence that needs to be transformed be an in-tegral power of 2. If this is not the case then the sequencemay be supplemented, for instance with a number of ze-ros, so that the length satisfies the requirement. This ofcourse reduces the efficiency. In addition this methoddoes not precisely yield the desired result. The DDFT ofa sequence of length L is defined on a natural grid of L

points. Increasing L leads to a different grid. There arevariants of the FFT that permit L to have other valuesthan integral powers of 2. These are less efficient thanthe Cooley-Tukey algorithm. Burrus and Parks (1985) de-scribe such algorithms.

The FFT algorithm was published in 1965 by the Amer-ican mathematician J. W. Cooley and the American statis-tician J. W. Tukey. It was preceded by the work of the Ger-man applied mathematician C. Runge in the period 1903–1905, which remained unnoticed for a long time. Only theadvent of the electronic computer made the method in-teresting. The algorithm plays an important role in sig-nal processing and time series analysis; for advanced ap-plications computer chips are used whose sole task is tocompute FFTs.

4.6.6 Practical computation

Up to this point we have assumed that to find the spectraldensity of a time series, the time series first is centeredand then its covariance function is estimated accordingto

rN (k ) =1

N

N−|k |−1∑

t=0

X t+k X t , k =−N +1, . . . , N −1.

(4.68)Then the estimate φN of the spectral density is found byFourier transformation of rN .

Computation of (4.68) for all k takes O(N 2) operations.For small values of N direct computation of (4.68) is fea-sible. For large values of N it is worthwhile to considercomputing rN by inverse Fourier transformation becauseof the efficiency of the FFT algorithm. The idea is as fol-lows.

Suppose we compute the estimated spectral densityfunction on L equidistantly spaced frequency grid points

on [0, 2π),

φN (2πL

n ) =1

N

�����

N−1∑

t=0

X t e−i 2πL

nt

�����

2

, n = 0, 1, . . . , L−1.

(4.69)Here we used (4.45) and (4.42). The computation of thesum in (4.69) is essentially a DDFT, so is quickly found. Itis not precisely in the DDFT form (4.66), but we may makeit so if L ≥N , by defining X t = 0 for t =N , N +1, . . . , L−1.Then the sum is a true DDFT,

φN (2πL

n ) =1

N

�����

L−1∑

t=0

X t e−i 2πL

nt

�����

2

, n = 0, 1, . . . , L−1.

(4.70)By definition the φN (

2πL

n ) satisfy

φN (2πL

n ) =

N−1∑

k=−N+1

rN (k )e−i 2π

Lnk , n = 0, 1, . . . , L−1.

(4.71)This we recognize as a DDFT, but it is slightly differentin that here the sum is not from 0 to L − 1 as the defini-tion (4.66) of DDFT assumes. The sum (4.71) depends on2N−1 values of rN (k ), so as before we expect that we mayrecover these 2N −1 values if we have at least 2N −1 val-ues φN (

2πL

n ) available. We have L such values available,so want to choose L such that

L ≥ 2N −1.

Then a variation of the inverse DDFT (4.67) applies thatstates that (4.71) is invertible with

rN (k ) =1

L

L−1∑

n=0

φN (2πL

n )ei 2πL

nk , k =−N +1, . . . , N −1.

(4.72)Also this inverse DDFT (4.72) may be efficiently com-puted using the FFT. The FFT algorithm generates thevalues of (4.72) for indices k = 0, 1, . . . , L − 1. Only fork ∈ [0, N −1] do they equal rN (k ).

In summary, to estimate the covariance function weapply the following procedure:

1. Choose an L ≥ 2N − 1. Supplement the observedtime series X t , t = 0, 1, . . . , N − 1, with L −N zeros:X t = 0 for t =N , N +1, . . . , L−1.

2. Use the FFT to compute the spectral density func-tion be means of the periodogram defined on L fre-quency grid points

φN (2πL

n ) =1

N

�����

L−1∑

t=0

X t e−i 2πL

nt

�����

2

, n = 0, 1, . . . , L−1.

3. Use the inverse FFT to compute

rtmpN (k ) =

1

L

L−1∑

n=0

φN (2πL

n )ei 2πL

nk , k = 0, 1, . . . , L−1.

Extract from this result the covariance function,

rN (k ) = rtmpN (|k |), k =−N +1, . . . , N −1.

52

Page 59: Time Series Analysis and System Identification

The procedure is illustrated in § 4.6.7 (p. 53) with an ex-ample. In all, the above procedure finds the covariancefunction in O(L log2(L)) operations which is a consider-able improvement compared to the O(L2) operations thatdirect computation of the covariance function requires.

Once the estimate of the covariance function is knownthe next step is to apply a suitable time window. Then theestimate of the windowed spectral density follows by ap-plication of the FFT to the windowed rN (k ). As the timewindow has width 2M + 1 we find with the FFT an esti-mated spectral density function on a frequency grid withat least 2M+1 grid points. Correspondingly the frequencygrid spacing is 2π/(2M + 1) or less. This is smaller thanthe width a = 2π/M of the main lobe of the common win-dows. Increasing the number of grid points much beyondits natural 2M +1 hence is not very useful.

An alternative procedure to estimate the spectral den-sity is to compute the periodogram first, but to omitthe remaining steps. The estimate of the spectral den-sity then follows by direct application of a suitable spec-tral window to the “raw” periodogram, see Problem 4.7(p. 56).

Marple (1987) extensively discusses many methods toestimate spectral densities.

4.6.7 Example

By way of illustration we discuss the estimation of thespectral density of the realization of the AR(2) processthat is plotted in Fig. 2.6(b) (p. 8). The realization con-sists of 200 samples. Fig. 4.8(a) shows the periodogram,obtained by application of the FFT to the realization aftercontinuation with 201 zeros, which is sufficient becausethen L = 401 > 2N − 1. In Fig. 4.9(a) the relevant por-tion of the periodogram φ200(ω) is plotted on the grid-ded frequency axis [0,π). The part for negative frequen-cies follows by periodic continuation. The estimated co-variance function r

tmp200 was obtained by inverse Fourier

transformation of the periodogram, again with the FFT.Figure 4.8(b) shows the result. Figure 4.9(b) shows the es-timated covariance function r200(k ) for k = 0, 1, . . . , 200.Again the part for negative shifts follows by periodic con-tinuation.

Inspection of Fig. 4.9(a) confirms the expected irregularbehavior of the periodogram. In Fig. 4.9(b) we see that theestimated covariance function initially has the expecteddamped harmonic appearance but that for time shifts be-tween about 60 and 110 large errors occur.

By windowing the estimated covariance function andFourier transformation the spectral density may be esti-mated. Figure 4.10 shows the results of windowing withHamming windows of widths 100, 50, and 25. For width100 the behavior of the estimated spectral density still israther irregular. For width 25 there is a clear loss of reso-lution. The best result is found for the width 50.

Example 4.6.5 (Matlab computation). The following scriptwas used to obtain the results of figures 4.8–4.10. First the

script of Example 2.4.3 is executed. Then:

N =length(x);

L =4∗2^nextpow2(N); % big enough

xh=fft(x,L−N); % add L−N zeros to x, then fft

ph=abs(xh).^2/N

rt=ifft(ph);

rt=real(rt); % needed due to rounding errors

rN=rt(1:N);

Direct computation of the covariance function by thecommand

rN1 = covf(x,200);

from the Systems Identification Toolbox yields exactly thesame result.

Windowed estimates of the spectral density as inFig. 4.10 may be obtained with commands of the form

phid = spa(x,50);[omega,phi50] = getff(phid);plot(omega,phi50);

4.7 Continuous time processes

4.7.1 Introduction

In this section we discuss some aspects of the non-parametric statistical analysis of continuous time series.Especially in physics and engineering the phenomenathat are studied often depend continuously on time.From a practical point of view signal processing is onlyfeasible by sampling and digital processing. The questionis to what extent the properties of the underlying contin-uous time process may be retrieved this way.

We assume that the underlying phenomenon may bedescribed as a wide-sense stationary continuous timestochastic process

X t , t ∈R. (4.73)

By sampling with sampling interval T a discrete time pro-cess

X ∗n= XnT , n ∈Z, (4.74)

is created. It is desired to estimate the statistical prop-erties of the continuous time process X t , in particular itsmean, covariance function and spectral density, from N

observations

X ∗n

, n = 0, 1, . . . , N −1, (4.75)

of the sampled process.If the underlying continuous time process has meanEX t = m and covariance function cov(X t , Xs ) = r (t − s )

then the sampled process has mean

EX ∗n =EXnT =m (4.76)

53

Page 60: Time Series Analysis and System Identification

0 π 2π0

5

10

15

20

25

φ( 2πL

n ) (a)

0 50 100 150 200 250 300 350 400-1

-0.5

0

0.5

1

rtmp200 (k ) (b)

Figure 4.8: (a) Periodogram for n = 0, 1, 2, . . . , 400. (b) Estimated covariance function for k = 1, 2, . . . , 400

0 0.5 1 1.5 2 2.5 30

5

10

15

20

ω= 2πL

n

φ200(2πL

n )

(a)

0 25 50 75 100 125 150 175 200-1

-0.5

0

0.5

k

r200(k )

(b)

Figure 4.9: (a) Periodogram on [0,π]. (b) Estimated covariance function on [0, 1, . . . , 199]

0 0.5 1 1.50

5

10

15

ω

φ200(ω)

(a)

0 0.5 1 1.50

5

10

15

ω

(b)

0 0.5 1 1.50

5

10

15

ω

(c)

Figure 4.10: Dashed: Exact spectral density. Solid: Estimated spectral density with a Hamming window. (a)Width time window 100. (b) Width 50. (c) Width 25

54

Page 61: Time Series Analysis and System Identification

and covariance function

cov(X ∗k

, X ∗n) = cov(Xk T , XnT ) = r ((k −n )T ). (4.77)

Estimation of the mean m of the time series X ∗n

henceyields an estimate of the mean m of the continuous timeprocess X t . Likewise, estimation of the covariance func-tion of the time series X ∗n yields an estimate of

r (k T ), k = 0,±1, . . . ,±(N −1). (4.78)

Hence, we may only estimate the covariance function r

on the discrete time axis.

4.7.2 Frequency content of the sampled signal

By processing the sampled signal we estimate the sam-pled covariance function of the continuous time process.The question is what information the Fourier transformof the sampled covariance function contains about thespectral density of the underlying continuous time pro-cess.

We pursue this question for a general continuous timefunction

x (t ), t ∈R, (4.79)

with CCFT

x (ω) =

∫ ∞

−∞x (t ) e−iωt dt , ω ∈R. (4.80)

Sampling yields the sequence

x ∗n= x (nT ), n ∈Z. (4.81)

We define the DCFT of this sequence as

x ∗(ω) = T

∞∑

n=−∞x ∗

ne−iωnT , ω ∈R. (4.82)

This generalization of the DCFT simplifies to the earlierdefinition if T = 1. The DCFT as given by (4.82) is periodicin ω with period 2π

T. The inverse of this Fourier transfor-

mation is

x ∗n =1

∫ πT

− πT

x ∗(ω) eiωnT dω, n ∈Z. (4.83)

The relation between the DCFT x ∗ of the sampled signalx ∗ and the CCFT x of the underlying continuous time sig-nal x is5

x ∗(ω) =∞∑

k=−∞x (ω−k

T), ω ∈R. (4.84)

Figure 4.11 illustrates the effect of sampling. The fre-

5Required are some technical assumptions discussed in Appendix A,(p. 95).

−2π/T −π/T 0 π/T 2π/T

ω

x ∗

x

Figure 4.11: Sampling a continuous time signalcauses aliasing

quency content x of the continuous time process x out-

side the interval [− πT

, πT] is “folded back” into this inter-

val. Because of this the frequency content x ∗ of the sam-pled signal differs from that of the continuous time signal.This phenomenon is called aliasing, because frequenciesoutside the interval are mistaken for frequencies insidethe interval.

If the CCFT x of the signal x is zero outside the in-terval [− π

T, π

T] then no aliasing takes place. The signal x

then is said to be band limited. The angular frequencyπT

is known as the Nyquist frequency. If the bandwidth ofthe signal x is less than the Nyquist frequency then thecontinuous time signal x may be fully recovered from thesampled signal by the interpolation formula

x (t ) =

∞∑

n=−∞x (nT )sinc

π(t −nT )

T, t ∈R. (4.85)

This result is known as Shannon’s sampling theorem.

4.7.3 Consequences for statistical signal processing

From § 4.7.1 (p. 53) we know that by statistical analysis ofthe sampled process an estimate may be obtained of thesampled covariance function of the continuous time pro-cess. Application of the DCFT to the sampled covariancefunction according to (4.82) only yields a correct result forthe spectral densityφ of the continuous time process ifφhas a bandwidth that is less than the Nyquist frequencyπT

. If the process X t is not limited to this bandwidth thenthe spectral density is distorted. The more the band-width exceeds the Nyquist frequency the more severe isthe distortion. In practice aliasing is usually avoided bypassing the continuous time process through a filter thatcuts off or strongly attenuates the signal at all frequenciesabove the Nyquist frequency before sampling. This pre-

sampling filter or anti-aliasing filter of course affects thehigh-frequency content of the continuous time signal.

4.8 Problems

55

Page 62: Time Series Analysis and System Identification

4.1 Estimation of the mean of the MA(1) process. Sup-pose that X t is a realization of the MA(1) processX t = ǫt + bǫt−1. Write the covariance function ofthe process as r (k ) =σ2

Xρ(k ), with ρ the correlationfunction.

a) Determine σ2X and ρ(k )

b) Verify that the variance of the estimator (4.18)equals

var(mN ) =σ2

�1+2b +b 2

N− 2b

N 2

How does the variance behave asymptotically?Explain the dependence of the variance on b .

c) The Cramér-Rao inequality suggests that vari-ances generally are of order 1/N at best, butfor b =−1 the above equation reads var(mN ) =

− 2b

N 2σ2. Is something wrong here?

4.2 Harmonic process. Consider the process defined by

X t = A cos

�2πt

T+ B

, t ∈Z. (4.86)

T is a natural number. A and B are independentstochastic variables. A has expectation zero and B

is uniformly distributed on [0, 2π].

a) Prove that the process is stationary.

b) Is the process ergodic?

4.3 Not all estimated r (k ) are covariance functions⋆. Aproperty of covariance functions r (k ) is that thecovariance matrices are nonnegative definite, seeEqn. (2.10). Estimated covariance functions r (k )

however need not have this property, and this is asource of problems and may for example lead to es-timates φ(ω) of the spectral density that fail to satisfyφ(ω)≥ 0.

a) Show that the revised estimates (4.31) do satisfy(2.10).

b) Show by counter example that the unbiased es-timates (4.28) need not satisfy (2.10).

4.4 Estimation of the covariance function and correlation

function of white noise. Suppose that X t is normallydistributed white noise with zero mean and standarddeviation σ. We consider the estimator (4.32) for thecovariance function r (k ) of the white noise.

a) What are the variances of the estimator for k =

0 and for k 6= 0?

Hint: If X is normally distributed with zeromean and standard deviation σ, then E(X 4) =

3σ4.

b) Define

ρN (k ) =rN (k )

rN (0)(4.87)

as an estimator for the correlation functionρ(k ) = r (k )/r (0) of the process. Argue that forlarge N and k 6= 0 the variance of ρN (k ) approx-imately equals 1/N .

4.5 Test for white noise.6 The correlation function of acentered time series consisting of N data points isestimated as

ρN (k ) =rN (k )

rN (0). (4.88)

Here rN is an estimator for the covariance function.It turns out that:

• ρN (1) equals 0.5.

• For |k | ≥ 2 the values of |ρN (k )| are less than0.25.

How large should N be so that it may be concludedwith good confidence that the time series is not a re-alization of white noise?

-50 -40 -30 -20 -10 0 10 20 30 40 500

0.5

1

τ

r (τ)

Figure 4.12: Experimental correlation function

4.6 Interpretation of experimental results.7 A test vehi-cle with a wheel base of 10 m is pulled along a roadwith a poor surface with constant speed. On the ve-hicle a sensor is mounted that records the verticalacceleration of the body of the vehicle. Figure 4.12shows a plot of the correlation function of the mea-sured signal. The sampling interval is 0.025 s. Whatis the speed of the vehicle?

4.7 Hann’s window in the frequency domain. LetφN (2πn/M ), n ∈ Z, be the estimate of the spectraldensity that is obtained by application of a rectan-gular window of width M . Show that application ofHann’s window yields the estimate

φHannN(

2πn

M)

=1

4φN

�2π(n −1)

M

+1

2φN

�2πn

M

+1

4φN

�2π(n +1)

M

.

What is the corresponding formula for Hamming’swindow?

4.8 Running average process.8 Consider the running av-erage process X defined by

X t =1

k +1

k∑

j=0

ǫt−j , t ∈Z, (4.89)

6Examination May 30, 1995.7Examination May 30, 1995.8Examination May 31, 1994.

56

Page 63: Time Series Analysis and System Identification

with k ≥ 0 a given integer. The process ǫt , t ∈ Z, iswhite noise with mean Eǫt = 0 and variance var ǫt =

σ2.

a) Compute the covariance function r of the pro-cess. Sketch the plot of r .

b) Compute the spectral density function φ of theprocess.

c) Is the MA process invertible?

d) Suppose that we wish to estimate the covari-ance function r and the spectral density func-tion φ of the process based on a single real-ization of the process on {0, 1, . . . , N − 1}, withN ≫ k .

i. Discuss how large N should be chosen toachieve a reasonable estimation accuracy.Specifically, suppose that k = 4 and thatr (0) should be estimated with an accuracyof about 1%.

ii. To estimate the spectral density functiona time or frequency window with a cor-responding window width should be se-lected. What may be said about the bestchoice of the width of the window?

4.9 Spectral analysis of a seasonal model.9 Consider theseasonal model

yt − cq−Pyt = ǫt , t = 1, 2, . . . , (4.90)

with the natural number P ≥ 1 the period, c a realconstant, and ǫt white noise with variance σ2. As-sume that the model is stable.

a) Derive the spectral density function φ(ω).Sketch the plot of φ.

b) For c = 0.8 and certain P the spectral densityfunction φ(ω) on [0,π) equals

0 0.5 1 1.5 2 2.5 30

5

10

15

20

25

What is the value of P?

c) Assume that the spectral density function is es-timated by Fourier transformation of an esti-mate of the covariance function of the process.How large should the width M of the time win-dow be in relation to the period P to preventthat the estimated spectral density function isstrongly distorted in the sense that the peaks asvisible in the above plot remain visible in thewindowed estimate.

9Examination May 25, 1993.

d) Given the M suggested in 4.9c, how largeshould the number N of observed data pointsof the time series be so that a reasonable sta-tistical accuracy is obtained? Here the answermay be more qualitative than for 4.9c.

Matlab problems

10. Consider the AR process

(1−aq−2)X t = ǫt , a ∈R

where ǫt is white noise with mean µ and standarddeviation σ. Let

a =1

2.

a) Generate samples x t , t = 0, 1, . . . , N −1 for N =

400 and plot the result.

b) Plot φN (ω) and rN (k ) using the FFT for suitablychosen L.

c) Compare φN (ω) and rN (k )with the exact φ(ω)and r (k ).

d) Compute φwN for the Hamming window w (k )

with support [−M , M ]. Do this for variousM and compare the results (given the N ofPart 10a) to the exact φ. Which M do you pre-fer? Discuss the results.

e) Change a to a = 3/4. Redo 10d. How doesthis affect the choice of M ? Is it advisable toincrease N ?

57

Page 64: Time Series Analysis and System Identification

58

Page 65: Time Series Analysis and System Identification

5 Estimation of ARMA models

5.1 Introduction

In Chapter 4 (p. 41) a number of non-parametric estima-tors are discussed. These estimators may be applied to ar-bitrary wide-sense stationary time series without furtherassumptions on their structure. In some applications it isadvantageous or even necessary to assume that the timeseries has a specific structure, which may be character-ized by a finite number of parameters.

Because of its flexibility the ARMA scheme is often usedas a model for time series. Estimating the properties ofthe time series then amounts to estimating the coeffi-cients of the scheme. The use of ARMA schemes andthe theory of estimating such schemes has strongly beenstimulated by the work of Box and Jenkins (1970).

In § 5.2 (p. 59) we discuss the estimation of AR models.Section 5.3 (p. 63) treats the estimation of MA schemes.The combined problem of estimating ARMA models isthe subject of § 5.4 (p. 64). In § 5.5 (p. 68) a brief expo-sition about numerical optimization methods for the so-lution of maximum likelihood problems and non-linearleast squares problems follows. In § 5.6 (p. 69) the prob-lem of the choice of the model order is examined.

The various methods are numerically illustrated byMATLAB examples.

5.2 Parameter estimation of AR processes

5.2.1 Introduction

We assume that the observed time series is a realizationof an AR(n) process described by

X t =µ+a 1X t−1+a 2X t−2+ · · ·+a n X t−n + ǫt , (5.1)

t = n , n + 1, . . . , where ǫt is white noise with mean 0 andvarianceσ2.

We consider the problem how to estimate the parame-ters a 1, a 2, . . . , a n , µ andσ2, given the N observations X0,X1, . . . , XN−1. We successively discuss methods based onleast squares in 5.2.2 (p. 59) and on likelihood maximiza-tion in 5.2.4 (p. 60).

5.2.2 Least squares estimator

In view of the available observations we attempt to finda best least squares fit of (5.1) for t = n , n + 1, . . . , N −1. Because the model represents X t as a linear regression

on the n preceding values X t−1, X t−2, . . . , X t−n and theconstant µ we look for those values of the parameters µ,a 1, a 2, . . . , a n for which the sum of squares

N−1∑

t=n

(X t −µ−a 1X t−1−a 2X t−2− · · ·−a n X t−n )2 (5.2)

is minimal. That is, we seek a model that explains the datawith minimal contribution of noise ǫt . By successive par-tial differentiation with respect to µ and a 1, a 2, . . . , a n

and setting the derivatives equal to zero we may find theestimates µ and a j . This can indeed be done but we takea different route, a route that turns out to be useful foron-line estimation when successively more data XN andXN+1 etcetera become available.

We write the AR scheme out in full:

Xn

Xn+1

...XN−1

︸ ︷︷ ︸

X

=

1 Xn−1 Xn−2 · · · X0

1 Xn Xn−1 · · · X1

· · · · · · · · · · · · · · ·1 XN−2 XN−3 · · · XN−1−n

︸ ︷︷ ︸

W

µ

a 1

...a n

︸ ︷︷ ︸

θ

+

ǫn

ǫn+1

...ǫN−1

︸ ︷︷ ︸

ǫ

.

(5.3)In this matrix notation the sum of squares (5.2) equalsǫTǫ. Minimizing this with respect to the vector of coef-ficients θ is a standard projection result treated earlier.The solution is

θ = (W TW )−1W TX . (5.4)

5.2.3 Recursive least squares

In adaptive control and real-time system identificationproblems and many other problems it is often the casethat as time progresses more and more data X t becomeavailable. Each time new data is in, we might want tore-compute the least squares estimation via the rule θ =(W TW )−1W TX . This, however, may become very timeconsuming since as the amount of data grows the matrixdimensions of W and X grow as well. Like in Kalman fil-tering it turns out to be possible to obtain the new esti-mate θ simply from the previous estimate with a modi-fication that is proportional to an appropriate predictionerror.

Suppose we have available N observationsX0, X1, . . . , XN−1. We found for the estimate of θ theexpression

θN−1 = (WT

N−1WN−1)−1W T

N−1X . (5.5)

Here the subscript N − 1 has been added to θ and W toexpress that XN−1 is the last observation used for the esti-mate. Suppose that a next observation XN becomes avail-able. We may now form the next W -matrix,

WN =

WN−1

ZN−1

,

59

Page 66: Time Series Analysis and System Identification

where

ZN−1 =�

1 XN−1 XN−2 · · · XN−n

.

It is composed of the previous WN−1 with a single row vec-tor ZN−1 stacked at the bottom. Consequently

W TN

WN =�

W TN−1 Z T

N−1

��

WN−1

ZN−1

=W TN−1WN−1+Z T

N−1ZN−1.

The matrix W TW has been updated with a rank-one ma-trix Z T

N−1ZN−1. It makes sense to figure out how much θchanges now that the next data XN is available. Thereforeconsider θN − θN−1,

θN − θN−1

= (W TN

WN )−1W T

N

� XXN

�− (W T

N−1WN−1)−1W T

N−1X

= (W TN

WN )−1×

W TN

� XXN

�− (W T

N−1WN−1+Z TN−1ZN−1)(W

TN−1WN−1)

−1W TN−1X

= (W TN

WN )−1×

��

W TN−1 Z T

N−1

�� XXN

�−W T

N−1X −Z TN−1ZN−1 (W

TN−1WN−1)

−1W TN−1X

︸ ︷︷ ︸

θN−1

= (W TN

WN )−1Z T

N−1︸ ︷︷ ︸

KN

(XN −ZN−1θN−1)︸ ︷︷ ︸

eN

. (5.6)

This is in an interesting form. The term labeled eN equals

XN −ZN−1θN−1 = XN −�µ+ a 1XN−1+ · · ·+ a n XN−n

�(5.7)

which is the prediction error XN − XN |N−1 of XN given thecurrent AR model. The result (5.6) thus states that if thenew measurement XN agrees with its prediction that thenthe estimate of θ remains the same θN = θN−1. If the pre-diction error is off then the estimate is updated propor-tional to the prediction error, with scaling factor KN . TheKN is known as the vector gain and is closely related to theKalman gain.

For efficient implementation of the least squares algo-rithm we rewrite Eqn. (5.6). Define

PN = (WT

NWN )

−1. (5.8)

As it stands, this inverse would have to be calculated in(5.6) every time that new data is available. It may be ver-ified, however, that updating W T

N−1WN−1 with a rank-onematrix Z T

N−1ZN−1 is equivalent to (another) rank-one up-date of its inverse PN−1, without having to invert a matrix,

PN = PN−1−PN−1Z TN−1

1

1+ZN−1PN−1Z TN−1

ZN−1PN−1. (5.9)

(See Problem 5.3, p. 72.) With the update of PN in thisform the vector gain KN defined in (5.6) may be expressedin terms of the past data,

KN = PN Z TN−1

=�

PN−1 −PN−1Z TN−1

1

1+ZN−1PN−1Z TN−1

ZN−1PN−1�Z T

N−1

= PN−1Z TN−1

1

1+ZN−1PN−1Z TN−1

.

Note that (5.9) can be written as PN =PN−1−KNZN−1PN−1.In summary, for efficient, recursive estimation of θ we

apply the following procedure:

1. Initialize N , PN and θN ;

2. Increase N ;

3. Form the (n + 1)-vector of the last n observationsZN−1 =

1 XN−1 · · · XN−n

;

4. Form the vector gain KN =PN−1Z TN−1

11+ZN−1PN−1Z T

N−1;

5. update PN = PN−1 − KNZN−1PN−1 and θN = θN−1 +

KN (XN −ZN−1θN−1);

6. Return to Step 2.

Example 5.2.1 (Recursive least squares). Figure 5.1(a)shows a realization of the AR process X t = µ + ǫt withµ= 1, and ǫt normally distributed zero-mean white noisewith variance 1. Part (b) of the figure shows as a func-tion of N the estimate θN of µ on the basis of the firstN observations XN . It seems to converge to µ = 1. Thevector gain KN shown if Part (c) clearly converges to zero.This is because after many observations, the new infor-mation that XN provides pales in comparison with whatthe large number of observations X0, . . . , XN−1 has alreadyprovided. This also indicates a possible problem: if themean µ varies (slowly) with time then the long memoryof the estimate makes adaptation of θM to µ very slow. Insuch situations new observations XN should be weightedmore than observations in the long past XM , M ≪ N .There are ways of coping with this problem, but we shallnot go into that here. �

5.2.4 Maximum likelihood estimators

We now approach the problem of estimating the parame-ters of the AR(n) scheme according to the maximum like-lihood method. To this end we need to determine thejoint probability density of the observations X0, X1, . . . ,XN−1. This is possible if

1. the joint probability density function of the initialconditions X0, X1, . . . , Xn−1 is known, and

2. the probability density of the white noise ǫt isknown.

It is plausible and often justifiable to assume that thewhite noise ǫt is normally distributed. To avoid any as-sumptions on the distribution of the initial conditions wechoose a likelihood function that is not the joint proba-bility density of the observations X0, X1, . . . , XN−1, but theconditional probability density

f Xn ,Xn+1,...,XN−1|X0,X1 ,...,Xn−1 (xn ,xn+1, . . . ,xN−1|x0,x1, . . . ,xn−1)

(5.10)of Xn , Xn+1, . . . , XN−1, given X0, X1, . . . , Xn−1.

60

Page 67: Time Series Analysis and System Identification

0 50 100-2

0

2

4

0 50 100-2

0

2

4

0 50 1000

0.2

0.4

θN

NNN

KN

xN

Figure 5.1: RLS estimate θN and vector gain KN

To obtain a manageable expression for (5.10) we use thewell known formula

f X ,Y |Z = f X |Y ,Z f Y |Z (5.11)

from probability theory, with X , Y and Z (possibly vector-valued) stochastic variables. With this we obtain for(5.10), temporarily omitting the arguments of the prob-ability densities:

f Xn ,Xn+1,...,XN−1|X0,X1 ,...,Xn−1

= f Xn+1,Xn+2,...,XN−1|X0,X1 ,...,Xnf Xn |X0 ,X1,...,Xn−1

= f Xn+2,Xn+3,...,XN−1|X0,X1 ,...,Xn+1 f Xn+1|X0,X1 ,...,Xnf Xn |X0 ,X1,...,Xn−1

= . . . . . .

=

N−1∏

t=n

f X t |X0 ,X1,...,X t−1 .

(5.12)

We next consider the conditional probability densityf X t |X0,X1,...,X t−1 of X t given X0, X1, . . . , X t−1. From

X t =µ+a 1X t−1+a 2X t−2+ · · ·+a n X t−n + ǫt (5.13)

we see that for t ≥ n and given X0, X1, . . . , X t−1 thestochastic variable X t is normally distributed with mean

E(X t |X0, X1, . . . , X t−1) =E(X t |X t−n , X t−n+1, . . . , X t−1)

=µ+a 1X t−1+a 2X t−2+ · · ·+a n X t−n

and variance σ2. Hence, we have

f X t |X0,X1,...,X t−1 (x t |x0,x1, . . . ,x t−1)

=1

σp

2πe− 1

2σ2

x t−µ−∑n

j=1 a j x t−j

�2

.

With this it follows from (5.12) that

f Xn ,Xn+1,...,XN−1|X0,X1,...,Xn−1 (xn ,xn+1, . . . ,xN−1|x0,x1, . . . ,xn−1)

=1

σp

2π�N−n

e− 1

2σ2

∑N−1t=n

x t−µ−∑n

j=1 a j x t−j

�2

.

By taking the logarithm we obtain the log likelihood func-tion

L =−(N−n ) log(σp

2π)− 1

2σ2

N−1∑

t=n

x t −µ−

n∑

j=1

a j x t−j

2

.

(5.14)With the matrix notation of Eqn. (5.3) this simplifies to

L =−(N −n ) log(σp

2π)− 1

2σ2‖X −Wθ ‖2 . (5.15)

Inspection shows that maximization of L with respect tothe parameters θ amounts to minimization of the sum of

‖X −W θ ‖2 . (5.16)

This results in the same estimator θ as the least squaresestimator.

There remains the estimation of σ in the maximumlikelihood framework. For this we need to maximize L

with respect to σ. Partial differentiation of L with respectto σ and setting the derivative equal to zero yields thenecessary condition

−N −n

σ+

1

σ3

X −W θ

2= 0. (5.17)

Solution for σ2 yields

σ2 =1

N −n

X −W θ

2. (5.18)

5.2.5 Accuracy

It may be proved that the least squares estimators for a 1,a 2, . . . , a n , µ, and σ, which are also maximum likelihoodestimators, are asymptotically unbiased, consistent andasymptotically efficient (assuming that the process ǫt isnormally distributed). By the asymptotic efficiency forlarge N the variance of the estimators may be approxi-mated with the help of the Cramér-Rao lower bound. Forthis lower bound we need the vector-valued version of theCramér-Rao inequality of Theorem 3.3.4 (p. 36).

61

Page 68: Time Series Analysis and System Identification

By partial differentiation of the log likelihood functionof (5.15) it follows that

∂ L

∂ θ=

1

σ2W T(X −W θ )

∂ L

∂ σ=−N −n

σ+

1

σ3‖X −Wθ ‖2 .

Further partial differentiation yields the entries of theHessian

∂ 2L

∂ θ∂ θ T=− 1

σ2W TW,

∂ 2L

∂ θ∂ σ=− 2

σ3W T(X −W θ ),

∂ 2L

∂ σ∂ θ T=

�∂ 2L

∂ θ∂ σ

�T

,

∂ 2L

∂ σ2=

N −n

σ2−

3

σ4‖X −W θ ‖2 .

After replacing the samples x t by the stochastic variablesX t and taking expectations we find

−E ∂ 2L

∂ θ∂ θ T= N−n

σ2

1 µ µ · · · µ

µ r (0) r (1) · · · r (n −1)· · · r (1) r (0) · · · r (n −2)· · · · · · · · · · · · · · ·µ r (n −1) r (n −2) · · · r (0)

,

−E∂ 2L

∂ θ∂ σ=E

2

σ3W Tǫ

=E

1 1 · · · 1Xn−1 Xn · · · XN−2

· · · · · · · · · · · ·X0 X1 · · · XN−1−n

ǫn

ǫn+1

...ǫN−1

= 0,

(5.19)

−E∂ 2L

∂ σ2=

2(N −n )

σ2.

In (5.19) we used that the ǫt have zero mean and that allXm−k , k > 0 are uncorrelated with ǫm . For large N thevariance matrix S of the estimators [σ, µ, a 1, a 2, . . . , a n ]

approximately equals the Cramér-Rao lower bound, M−1

with

M =−E

∂ 2L

∂ σ2∂ 2 L

∂ σ∂ θ T

∂ 2L

∂ θ ∂ σ∂ 2 L

∂ θ ∂ θ T

=N −n

σ2

2 0 0 0 · · · 0

0 1 µ µ · · · µ

0 µ r (0) r (1) · · · r (n −1)

0 µ r (1) r (0) · · · r (n −2)

· · · · · · · · · · · · · · · · · ·0 µ r (n −1) r (n −2) · · · r (0)

.

In practice M is approximated by replacing the covari-ance function r , the constant µ and the standard devia-tion σ by their estimates.

5.2.6 Example

The time series plotted in Fig. 2.6(b) (p. 8) is a realizationof an AR(2) process. The parameters a 1, a 2 andσmay beestimated with the help of the MATLAB routine ar of theSystems Identification Toolbox. The routine assumes thatµ = 0. Besides estimates of the parameters the routinealso provides the standard deviations of the estimates.The following results are obtained:

a 1 = 1.5588, estimate a 1 = 1.5375, st.dev. 0.0399,a 2 =−0.81, estimate a 2 =−0.8293, st.dev. 0.0399,σ2 = 0.0888, estimate σ2 = 0.0885, no st.dev. given.

(5.20)The estimates of the parameters are rather accurate, andfall well within the error bounds.

Example 5.2.2 (Matlab session). After executing the scriptof § 2.4.5 (p. 15) to generate the time series the followingMATLAB session provides the results as shown:

>> thd = ar(x,2);

>> present(thd)

This matrix was created by the command AR on

11/18 1993 at 20:40. Loss fcn: 0.0885

Akaike‘s FPE: 0.0903 Sampling interval 1

The polynomial coefficients and their

standard deviations are

A =

1.0000000 −1.53752969131 0.82926219021

0 0.03986290251 0.03989843610

The example invokes optimized routines from the SYS-TEM IDENTIFICATION TOOLBOX which hides the math that isinvolved. A plain MATLAB script that does essentially thesame is:

a1=1.5;

a2=−.75;D=[1 −a1 −a2]; % second order AR scheme

nul=roots(D); % compute zeros of D

abs(nul) % scheme is stable if < 1sig=1; % set st.dev of ǫt

ep=sig∗randn(1,200); % normally distributed ǫt

x=filter(1,D,ep); % generate x t (row)

% Now use the 200 samples of x t to

% estimate an AR(2) scheme:

X=x(3:200)’; % Set up X and W in X =W θ + ǫ

W=[x(2:199)’ x(1:198)’];

theta=W\X; % the estimated a1 and a2

epes=X−W∗theta; % residuals (estimate of ǫ)

sig2=mean(epes.^2); % estimate of var of ǫ

M=W’∗W/sig2; % estimate fisher’s inf.matrix M

Mi=inv(M); % close to var(θ )

disp(’true a1, estimate a1 and its st.dev:’);

disp([a1 theta(1) sqrt(Mi(1,1))]);

disp(’true a2, estimate a2 and its st.dev:’);

disp([a2 theta(2) sqrt(Mi(2,2))]);

62

Page 69: Time Series Analysis and System Identification

5.3 Parameter estimation of MA processes

5.3.1 Introduction

We consider the process X t that is generated by the MA(k )scheme

X t =µ+ ǫt +b1ǫt−1+ · · ·+bkǫt−k , t ≥ 0, (5.21)

with ǫt white noise with mean 0 and varianceσ2. Withoutloss of generality the coefficient of ǫt on the right-handside has been chosen equal to 1. We look for estimatorsof the parameters b1, b2, . . . , bk , µ, and σ2 based on ob-servations of X0, X1, . . . , XN−1.

From § 2.2.1 (p. 9) we know that the covariance func-tion of the process is given by

r (τ) =

(

σ2∑k

i=|τ|b i b i−|τ| for |τ| ≤ k ,

0 for |τ|> k ,(5.22)

with b0 = 1. One possibility to estimate the parameters isto replace r (τ) for τ= 0, 1, . . . , k , in (5.22) with estimates,for instance those from § 4.5 (p. 45). The k+1 (non-linear)equations that result then could be solved for the k + 1parameters b1, b2, . . . , bk and σ2. It turns out that thisprocedure is very inefficient.

We therefore contemplate other estimation methods,in particular (non-linear) least squares and maximumlikelihood estimation. The least squares solution at thisstage is difficult to motivate, but later when we considerprediction error methods the motivation is clear.

5.3.2 Non-linear least squares estimators

The stochastic variables X0, X1, . . . , XN−1 are generatedby the independent, normally distributed stochastic vari-ables ǫ−k , ǫ−k+1, . . . , ǫN−1. It follows from (5.21) that

X =µ

11...1

︸︷︷︸

e

+

bk bk−1 · · · b1 1 0 · · · · · · · · · · · · 00 bk · · · b2 b1 1 0 · · · · · · · · · 0...

......

......

......

......

......

0 · · · · · · · · · · · · · · · 0 bk · · · b1 1

︸ ︷︷ ︸

M

ǫ,

(5.23)with

X =

X0

X1

...XN−1

, ǫ =

ǫ−k

ǫ−k+1

...ǫN−1

. (5.24)

M is an N × (N +k )matrix that depends on the unknowncoefficients b0, b1, . . . , bk .

For given coefficients and given observations X theequation (5.23) cannot be solved uniquely for ǫ. To re-solve this difficulty we replace ǫ−k , ǫ−k+1, . . . , ǫ−1 by their

means 0, and consider

X =µe +

1 0 · · · · · · · · · · · · 0b1 1 0 · · · · · · · · · 0· · · · · · · · · · · · · · · · · · · · ·0 · · · 0 bk · · · b1 1

︸ ︷︷ ︸

M+

ǫ0

ǫ1

· · ·ǫN−1

︸ ︷︷ ︸

ǫ+

.

Because detM+ = 1 the square matrix M+ is invertibleand we have

ǫ+ =M−1+(X −µe ). (5.25)

With the help of ǫ+ we form the sum of squares

ǫT+ǫ+ = (X −µe )T(M+M T

+)−1(X −µe ). (5.26)

This sum of squares is a non-linear function of the pa-rameters µ and b1, b2, . . . , bk . Minimization generally canonly be done numerically. In § 5.5 (p. 68) we briefly reviewthe algorithms that are available for this purpose.

Setting several variables equal to zero to force the equa-tion (5.23)

X −µe =Mǫ (5.27)

to have a unique solution is unsatisfactory. The equation(5.27) has more unknowns in the vector ǫ than there areequations. One of the many solutions of (5.27) is

ǫ =M T(M M T)−1(X −µe ). (5.28)

This is the least squares solution, that is, that solutionfor which ǫTǫ is minimal. A proof is listed in Appendix A(p. 96). For this solution the sum of squares is

ǫTǫ = (X −µe )T(M M T)−1(X −µe ). (5.29)

This solution shows a resemblance to (5.26). M replacesM+. In place of (5.26) we now minimize (5.29) with re-spect to the parameters µ and b1, b2, . . . , bk to obtain es-timates. Also here we need to resort to numerical mini-mization.

Once the parameters µ and b1, b2, . . . , bk have been es-timated by minimization of (5.26) or (5.29) the varianceσ2 may be estimated as the sample average of the squaresof the residuals. Minimization of (5.29) further needs abound on the b1,b2, . . . ,bk ,

5.3.3 Maximum likelihood estimator

Starting point for maximum likelihood estimation is thejoint probability density function f X0,X1 ,...,XN−1 . Again weneed to assume that the white noise ǫt is normally dis-tributed so that also the process X t is normally dis-tributed. From

X =µe +Mǫ (5.30)

it follows for the mean and the variance matrix of X

EX =µe , ΣX =σ2M M T. (5.31)

63

Page 70: Time Series Analysis and System Identification

With the help of (3.7) of § 3.2.2 (p. 31) it follows that themulti-dimensional normal probability density functionof X is given by

f X (x ) =1

(σp

2π)Np

det M M Te−

12σ2 (x−µe )T(M M T )−1(x−µe ),

(5.32)with x = col(x0, x1, . . . , xN−1). The log likelihood functionthus is

L =−N log(σp

2π)− 1

2log(det M M T) (5.33)

− 1

2σ2(X −µe )T(M M T)−1(X −µe ).

Here we replaced x with X . The log likelihood func-tion needs to be maximized with respect to the param-eters. Again this requires numerical optimization. In-spection of (5.33) shows that if the second term — thatwith log(det M M T)— were missing then maximization ofL with respect toµ and b1, b1, . . . , bk would yield the sameresult as minimization of the sum of squares (5.29).

The least squares and maximum likelihood estimationmethods of this section are special cases of correspond-ing methods that are used for the estimation of ARMAschemes. The properties and practical computation ofthese estimators are reviewed in § 5.4 (p. 64).

5.4 Parameter estimation of ARMA processes

5.4.1 Introduction

As seen in § 5.2 (p. 59) it is rather easy to estimate the pa-rameters of AR processes. Parameter estimation for MAprocesses is more involved. The same holds for the ARMAprocesses that are discussed in the present section. Weconsider ARMA(n , k ) schemes of the form

X t =µ+a 1X t−1+a 2X t−2+ · · ·+a n X t−n (5.34)

+b0ǫt +b1ǫt−1+ · · ·+bkǫt−k ,

for t = n , n +1, . . .. Here ǫt is white noise with mean zeroand variance σ2. Without loss of generality we may takeb0 = 1 if needed. Often we work with centered processesso that µ= 0.

The parameters µ, a 1, a 2, . . . , a n , b0, b1, . . . , bk , and σ2

are to be estimated on the basis of observations of X0, X1,. . . , XN−1. The orders n and k are assumed to be known.

ARMA processes are not necessarily identifiable. Thismeans that given the covariance function of an ARMAprocess, or, equivalently, its spectral density function,there may be several ARMA models with these character-istics. To see this we write the ARMA scheme (5.34) in thecompact form D(q )X t =N (q )ǫt . Then we know from § 2.6(p. 17) that the spectral density of the corresponding sta-tionary process — under the condition that the scheme isstable — is given by

φ(ω) =

����

N (eiω)

D(eiω)

����

2

σ2. (5.35)

The polynomials N and D may be modified in the follow-ing ways without affecting φ:

1. N and D may be simultaneously multiplied by thesame polynomial factor, say P , without effect on φ.If all roots of P lie inside the unit circle then stabilityis not affected. Conversely, any common polynomialfactors in N and D may be canceled without chang-ingφ.

To avoid indeterminacies in the model, which com-plicate the identification, it is necessary to choosethe orders n and k as small as possible. If for in-stance n and k both are 1 larger than the actual or-ders then inevitably an undetermined common fac-tor is introduced.

2. For stability all roots of D need to have magnitudesmaller than 1. For the behavior of φ, however, itmakes no difference if a stable factor 1 − aq−1 ofD, with |a | < 1, is replaced with an unstable factorq−1 − a . Such a replacement in N similarly has noinfluence on φ.

Uniqueness may be obtained by requiring that allroots of D and N have magnitude smaller than 1.

Again we derive least squares and maximum likelihoodestimators. Furthermore we introduce in § 5.4.4 (p. 64)the prediction error method as a new but related way tofind estimators.

5.4.2 Least squares estimators

The values of Xn , Xn+1, . . . , XN−1 are generated by the ini-tial conditions X0, X1, . . . , Xn−1 and by ǫn−k , ǫn−k+1, . . . ,ǫN−1. Define the vectors

X =

Xn

Xn+1

· · ·XN−1

, X 0 =

X0

X1

· · ·Xn−1

, ǫ =

ǫn−k

ǫn−k+1

· · ·ǫN−1

.

64

Page 71: Time Series Analysis and System Identification

Then we have from (5.34)

1 0 · · · · · · · · · · · · · · · · · · 0−a 1 1 0 · · · · · · · · · · · · · · · 0−a 2 −a 1 1 · · · · · · · · · · · · · · · 0· · · · · · · · · · · · · · · · · · · · · · · · · · ·0 · · · · · · · · · 0 −a n · · · −a 1 1

︸ ︷︷ ︸

R

X

=

bk bk−1 · · · · · · b0 0 · · · · · · · · · · · · · · · · · · 00 bk · · · · · · b1 b0 0 · · · · · · · · · · · · · · · 00 0 bk · · · · · · b1 b0 0 · · · · · · · · · · · · 0· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·0 · · · · · · · · · · · · · · · · · · · · · 0 bk · · · b1 b0

︸ ︷︷ ︸

M

ǫ

+

a n a n−1 · · · · · · a 1

0 a n · · · · · · a 2

· · · · · · · · · · · · · · ·0 · · · · · · · · · 0

︸ ︷︷ ︸

P

X 0+

11· · ·1

︸︷︷︸

e

µ.

The expression

RX =Mǫ+PX 0+µe (5.36)

is more compact. R is square and non-singular with di-mensions (N −n )× (N −n ). M has dimensions (N −n )×(N −n +k ), and P has dimensions (N −n )×n .

M generally is not invertible, so that (5.36) is notuniquely solvable for ǫ. The least squares solution for ǫis

ǫ =M T(M M T)−1(RX −PX 0−µe ). (5.37)

This yields the sum of squares

ǫTǫ = (RX −PX 0−µe )T(M M T)−1(RX −PX 0−µe ).

Minimization with respect to the parameters µ, a 1, a 2,. . . , a n , b0, b1, . . . , that occur in the coefficient matrices R ,M and P yields the desired estimators. After computingthe residuals with the help of (5.37) the variance σ2 maybe estimated as the sample average of the squares of theresiduals. As with the MA scheme, we need to bound thecoefficients b0, . . . ,bk , say

∑k

j=0 b 2j ≤ 1. Without a bound

on the b j the matrix M may grow without bound render-ing ǫTǫ as close to zero as we want.

5.4.3 Maximum likelihood estimator

After the preparations of § 5.4.2 the determination of themaximum likelihood estimator is not difficult. Again wework with normal probability distributions. For the samereason as for the AR scheme we take as likelihood func-tion the conditional probability density of X given X 0. In-spection of

X = R−1(Mǫ+PX 0+µe ) (5.38)

shows that the conditional expectation of X given X 0 = x 0

equals R−1(Px 0 + µe ). The conditional variance matrix

is σ2R−1M M T(R−1)T. We use formula (3.7) from § 3.2.2(p. 31) for the multi-dimensional normal probability den-sity. After some algebra using the fact that det R = 1 it fol-lows that

f X |X 0 (x |x 0 )

=1

(σp

2π)N−np

det M M Te−

12σ2 (Rx−Px 0−µe )T(M M T)−1(Rx−Px 0−µe ) .

The log likelihood function thus is

L =− (N −n ) log(σp

2π)− 1

2log det M M T

−1

2σ2(RX −PX 0−µe )T(M M T)−1(RX −PX 0−µe ).

(5.39)

Here we replace x with X and x 0 with X 0. Again we recog-nize the sum of squares (RX −PX 0−µe )T(M M T)−1(RX −PX 0−µe ). With the definitions

Q :=�

−P R�

, X :=

X 0

X

(5.40)

we may represent the sum of squares in the more com-pact form

(QX −µe )T(M M T)−1(QX −µe ). (5.41)

Least squares estimation involves minimization of (5.41),maximum likelihood estimation maximization of (5.39).In this case we may fix σ which implies a bound on thecoefficients b j .

5.4.4 Prediction error method

Maximum likelihood estimation requires maximizationof the log likelihood function L as given by (5.39). As weknow from § 3.3 (p. 33) maximum likelihood estimatorsoften have the favorable properties of consistency andasymptotic efficiency. The first two terms of the the loglikelihood function (5.39) turn out to play a secondaryrole for these two properties. The least squares estima-tor also has these properties. The least squares estimator,in turn, is a special case of a more general procedure, thathas become known as the prediction error method (Ljung,1987).

To explain this method we consider the ARMA scheme

D(q )X t =N (q )ǫt . (5.42)

For simplicity we take µ= 0. Without loss of generality wefurthermore assume that the coefficient b0 of the polyno-mial N equals 1. Then the coefficient h0 in the expansion

H (q ) =N (q )

D(q )= h0+h1q−1+h2q−2+ · · · (5.43)

also equals 1. From § 2.8 (p. 24) it follows that the optimalone-step predictor for the scheme is given by

X t+1|t =q [N (q )−D(q )]

N (q )X t . (5.44)

65

Page 72: Time Series Analysis and System Identification

The one-step prediction error is

e t+1 := X t+1− X t+1|t

= ǫt+1.

Hence we may use the one-step predictor to reconstructthe residuals ǫt according to

ǫt = e t = X t − X t |t−1, t = 0, 1, . . . , N −1. (5.45)

In this case it follows with (5.44) that

e t = X t − X t |t−1

= X t −N (q )−D(q )

N (q )X t

=D(q )

N (q )X t . (5.46)

Hence, the residuals may simply be found by inversion ofthe the ARMA scheme. In other applications of the pre-diction error method the situation is less elementary.

The simplest form of the application of the predictionerror method for estimating model parameters is to min-imize the sum of the squares of the prediction errors

1

N

N−1∑

t=0

e 2t

(5.47)

with respect to parameters a 1, a 2, . . . , a n and b1, b2, . . . ,bk . The prediction errors e t are generated from the obser-vations X t , t = 0, 1, . . . , N−1, with the help of the one-steppredictor. If the one-step predictor is based on parame-ter values that do not agree with the actual values thenthe residuals e t that are determined differ from the real-ization of the white noise that generate the process X t .The idea of the prediction error method is to choose theparameters in the one-step predictor such that the sum ofsquares (5.47) obtained from the computed residuals is assmall as possible. The idea hence is to consider a modelfor a time series good if it predicts the time series well.This makes sense. Think of the models used for weatherforecasts. If the forecasts are often correct, then we mayconsider the model satisfactory.

Once the parameters a i and b i have been estimatedthen the variance σ2 is estimated as the sample averageof the squares of the prediction errors.

For the initialization of the predictor at time 0 values ofX t and X t |t−1 for t < 0 are required. The simplest choiceis to take them equal to 0.

The practical computation of the minimum predictionerror estimates is done numerically by minimization ofthe sum of squares (5.47). In § 5.5.2 (p. 69) algorithms forthis purpose are reviewed.

The following points are important.

1. The assumed model structure needs to be correct.We return to this problem in § 5.6 (p. 69).

2. The initial values of the parameters need to be suit-ably chosen to avoid lengthy iteration or conver-gence to erroneous values. These starting valuessometimes may be found from a priori information.Exploratory pre-processing to obtain tentative esti-mates may be very effective.

To validate the final estimation result it is recommendedto compute the residuals once a model has been ob-tained. After plotting the residuals they may be checkedfor bias, trend and outliers. The residuals should be a re-alization of white noise. To test this the covariance func-tion of the residuals may be estimated. If the estimatedcovariances are within the appropriate confidence inter-vals then it may safely be assumed that the estimatedmodel is correct. The confidence intervals follow from thevariances of the estimated covariances (see Problem 4.4(p. 56).)

Example 5.4.1 (Prediction error method for MA(1)

schemes). Suppose we choose to model an observedtime series X0, X1, . . . , XN−1 by an zero mean invertibleMA(1) process X t = (1+bq−1)ǫt . The value of b we wantto determine using the prediction error method as theone that minimizes the mean squared prediction errors(5.47) of e t , where

e t = ǫt =1

1+bq−1X t . (5.48)

Now suppose the observations were actually generated byan (unknown) zero mean MA(1) scheme X t = (1+cq−1)ǫt .The prediction error e t may be related to the white noiseǫt as

(1+bq−1)e t = X t = (1+ cq−1)ǫt . (5.49)

Therefore

e t =1+ cq−1

1+bq−1ǫt

= ǫt +(c −b )q−1

1+bq−1ǫt

= ǫt +(c −b )ηt−1 where ηt−1 =1

1+bq−1ǫt−1.

Note that ǫt andηt−1 are uncorrelated, so the expectationof e 2

tequals

Ee 2t=E ǫ2

t+(c −b )2Eη2

t−1. (5.50)

For the sum of the squares of the prediction errors wethus have

E1

N

N−1∑

t=0

e 2t=

1

N

N−1∑

t=0

�E ǫ2

t+(c −b )2Eη2

t−1

�(5.51)

The minimum of this expression is obtained for b = c . �

66

Page 73: Time Series Analysis and System Identification

5.4.5 Accuracy of the minimum prediction error estima-

tor

Ljung (1987) proves that under reasonable assumptionsthe estimators that are obtained with the prediction errormethod are consistent. Following his example (§§ 9.2–9.3) we present a heuristic derivation of the asymptoticproperties of the estimator. Important for this deriva-tion is that if the parameters of the predictor are equal tothe actual parameters then the prediction errors consti-tute a sequence of independent stochastic variables. Thederivation that follows is also valid for other applicationsof the prediction error method.

We collect the unknown parameters in a column vector

θ =�

a 1 a 2 · · · a n b1 b2 · · · bk

�T. (5.52)

To make the dependence on the parameters explicit wewrite the sum of the squares of the prediction errors as

1

N

N−1∑

t=0

e 2t(θ ). (5.53)

The minimum prediction error estimator θN minimizesthis sum. Therefore, the gradient of (5.53) with respect toθ at the point θN equals 0. Evaluation of this gradient andsubstitution of θ = θN yields

N−1∑

t=0

e t (θN )eθt(θN ) = 0(n+k )×1 (5.54)

Here e θt denotes the gradient of e t with respect to θ .Suppose that θ0 is the correct value of the parameter

vector. Write θN = θ0+ θN −θ0 and suppose that θN −θ0 issmall. Then it follows by Taylor expansion of (5.54) aboutthe point θ0 that

0=N−1∑

t=0

e t (θN )eθt (θN )

=

N−1∑

t=0

e t (θ0+ θN −θ0)eθt (θ0+ θN −θ0)

≈N−1∑

t=0

e t (θ0)eθt(θ0)

+

N−1∑

t=0

e θt(θ0)e

θt(θ0)

T+ e t (θ0)eθθ T

t(θ0)

(θN −θ0).

Here e θθT

tis the Hessian of e t with respect to θ . We

rewrite this expression as

N−1∑

t=0

e t (θ0)eθt(θ0) (5.55)

≈−

N−1∑

t=0

e θt (θ0)eθt (θ0)

T+ e t (θ0)eθθ T

t (θ0)�

!

(θN −θ0).

We introduce the further approximation

N−1∑

t=0

e θt (θ0)eθt (θ0)

T+ e t (θ0)eθθ T

t (θ0)�

(5.56)

≈N−1∑

t=0

Ee θt(θ0)e

θt(θ0)

T ≈N M .

On the left both terms have been approximated by theirexpectations. The expectation of the second term on theleft is zero (see Problem 5.7, p. 73). On the right we have

M := limN→∞

1

N

N−1∑

t=0

Ee θt(θ0)e

θt(θ0)

T. (5.57)

With this we obtain from (5.55)

θN −θ0 ≈−M−1

1

N

N−1∑

t=0

e t (θo)eθt(θo)

!

. (5.58)

Consider the sum

1

N

N−1∑

t=0

e t (θ0)eθt(θ0). (5.59)

The terms e t (θ0)e θt (θ0) asymptotically have expectationzero and asymptotically are uncorrelated (see Prob-lem 5.7, p. 73). Therefore, the variance of the sum (5.59)approximately equals

var

1

N

N−1∑

t=0

e t (θ0)eθt (θ0)

!

≈1

N 2

N−1∑

t=0

Ee 2t (θ0)e

θt (θ0)e

θt (θ0)

T

≈ σ2

N 2

N−1∑

t=0

Ee θt(θ0)e

θt(θo)

T

≈ σ2

NM .

With the help of this it follows from (5.58) that asymptot-ically

var(θN −θ0)≈M−1var

1

N

N−1∑

t=0

e t (θ0)eθt(θ0)

!

M−1

≈ σ2

NM−1.

From (5.58) it is seen that the estimation error θN − θ0

is the sum of N terms. By the central limit theoremwe therefore expect that

pN (θN − θ0) asymptotically is

normally distributed with mean 0 and variance matrixσ2M−1.

The derivation that is presented is heuristic but the re-sults may be proved rigorously.

To compute the asymptotic variance of the estimationerror the matrix M of (5.57) is needed. In practice thismatrix is estimated as the sample average

M =1

N

N−1∑

t=0

e θt(θN )e

θt(θN )

T. (5.60)

67

Page 74: Time Series Analysis and System Identification

The gradients e θt of the residuals that occur are neededfor numerical minimization of the sum of squares (5.53)and, hence, are available. This is explained in § 5.5 (p. 68).The matrix M itself is often also needed in this minimiza-tion, depending on the algorithm that is used.

The gradient e θt of the residuals with respect to the pa-rameter θ is needed for the numerical minimization ofthe the sum of squared prediction errors, and for assess-ing the accuracy of the final estimates, as explained in§ 5.4.5. See Problem 5.7.

5.4.6 Example

We apply the minimum prediction error method to theAR(2) time series of Fig. 2.6(b) (p. 8). The parameters a 1

and a 2 may be estimated with the MATLAB routine armax,which implements the minimum prediction algorithm.The routine assumes that µ = 0. The following result isis obtained.

a 1 = 1.5588, estimate a 1 = 1.5424, st.dev. 0.0399,

a 2 =−0.81, estimate a 2 =−0.8362, st.dev. 0.0399,

σ2 = 0.0888, estimate σ2 = 0.0885, no st.dev. given.

The results are not the same as those of the least squaresmethod of 5.2.6 (p. 62), but are not significantly different.

Example 5.4.2 (Matlab session). After executing the scriptof 2.4.5 (p. 15) to generate the time series the followingMATLAB session yields the results that are shown.

>> thd = armax(x,[2 0]);

...........

>> present(thd)

This matrix was created by the command ARMAX

on 11/22 1993 at 19:45 Loss fcn: 0.0885

Akaike‘s FPE: 0.0903 Sampling interval 1

The polynomial coefficients and their

standard deviations are

A =

1.0000 −1.5424 0.8362

0 0.0399 0.0399

At the position of the dots armax displays several inter-mediate results of the iterative optimization process. �

5.5 Non-linear optimization

5.5.1 Non-linear optimization

This section follows pp. 282–284 of Ljung (1987).The implementation of the various estimation meth-

ods that are presented in § 5.3 (p. 63) and § 5.4 (p. 64) re-lies on the numerical minimization of a non-linear scalarfunction f (x ) with respect to the vector-valued variablex ∈Rn .

All numerical optimization procedures are iterative.Often, given an intermediate point x i the next point x i+1

is determined according to

x i+1 = x i +αp i . (5.61)

Here p i ∈Rn is the search direction and the positive con-stant α the step size. The search direction is determinedon the basis of information about f that is accumulatedin the course of the iterative process. Depending on theinformation that is made available the methods for deter-mining the search direction may be classified into threegroups.

1. Methods that are only based on values of the func-tion f .

2. Methods that use both function values and values ofthe gradient f x of f .

3. Methods that use function values, the values of thegradient and of the Hessian f x x T of f , that is, the ma-trix of second order partial derivatives of f (x ) withrespect to the elements of x .

A well known method from group 2 is the steepest descent

method, where the search direction is chosen as

p i =− f x (xi ). (5.62)

This method follows from the Taylor expansion

f (x i+1) = f (x i +αp i )

= f (x i )+α f Tx (x

i )p i +O(α2),

which holds if f is sufficiently smooth. If the last term onthe right-hand side is neglected then the greatest changeis obtained by choosing the search direction according to(5.62). By varyingα a search is performed in the search di-rection until a sufficiently large decrease of f is obtained.This is called a line search. The steepest descent methodtypically initially provides good progress. Near the min-imum progress is very slow because the gradient f x be-comes smaller and smaller.

A typical representative of group 3 is the Newton-

Raphson algorithm. Here the search direction is the“Newton direction”

p i =− f −1x x T(x

i ) f x (xi ). (5.63)

This direction results by including a quadratic term in theTaylor expansion:

f (x i+αp i ) = f (x i )+α f Tx

p i +1

2α2 (p i )T f x x T(x i )p i +O(α3).

For α = 1 the right-hand side of this expression — ne-glecting the last term — is minimal if the search direc-tion p i is chosen according to (5.63). Also with the New-ton algorithm normally line searching is used with α = 1as starting value. The Newton algorithm may be unpre-dictable in the beginning of the iterative process but often

68

Page 75: Time Series Analysis and System Identification

converges very fast once the neighborhood of the mini-mum has been reached.

A disadvantage of the Newton method is that besidesthe gradient in each iteration point also the Hessianneeds to be computed. This implies that formulas for thegradient and the Hessian need to be available and mustbe coded. This is often problematic, certainly for theHessian. There exist several algorithms, known as quasi-

Newton algorithms, where the Hessian is approximatedor is estimated during the iteration on the basis of the val-ues of the gradients that are successively computed, and,hence, no explicit formulas are needed.

Group 1 contains methods where the gradient is esti-mated by taking finite differences. Other methods in thisgroup use specific search patterns.

For all methods the following holds:

1. The function f needs to be sufficiently smooth forthe search to converge.

2. The algorithm may converge to a local minimumrather than to the desired global minimum. The onlyremedy is to choose the starting point “suitably.”

There is an ample choice of standard software for thesealgorithms. Within MATLAB the Optimization Toolbox

provides for this.

5.5.2 Algorithms for non-linear least squares

In the case of non-linear least squares the function f thatneeds to be minimized has the form

f (x ) =1

2

N∑

j=1

e 2j(x ), (5.64)

where the e j (x ) are more or less complicated functions ofx . The structure of f allows to structure the minimiza-tion algorithm correspondingly. The gradient of f is thecolumn vector

f x (x ) =

N∑

j=1

e j (x )g j (x ), (5.65)

where g j is the gradient of e j (x ) with respect to x . By dif-ferentiating once again the Hessian of f follows as

f x x T(x ) =

N∑

j=1

g j (x )gTj(x )+

N∑

j=1

e j (x )h j (x ), (5.66)

with h j the Hessian matrix of e j (x ) with respect to x .We see that to apply the Newton algorithm to the least

squares problem in principle the Hessian of the residu-als e j is needed. For the algorithm to work well it is im-portant that the Hessian f x x T be correct in the neighbor-hood of the minimum. For a well posed least squaresproblem the residuals e j near the minimum approxi-mately are realizations of uncorrelated stochastic vari-ables with zero mean. Then the mean of the second term

on the right-hand side of (5.66) is (approximately) zero.Therefore near the minimum we may assume f x x T(x ) =∑N

j=1 g j (x )gTj (x ). This idea leads to a quasi-Newton algo-

rithm where the search direction is given by

p i =−H−1(x i ) f x (xi ), (5.67)

where the gradient f x is given by (5.65) and the matrix

H (x ) =

N∑

j=1

g j (x )gTj (x ) (5.68)

is an approximation of the Hessian. If the step sizeα is always chosen as 1 then this is called the Gauss-

Newton algorithm. If α is adapted to smaller values thenthe method is sometimes known as the damped Gauss-Newton algorithm.

The matrix H as given by (5.68) is always non-negativedefinite. In some applications, such as when the model isover-parameterized or the data contain insufficient infor-mation, it may happen that H is singular or almost singu-lar. This causes numerical problems. A well known wayto remedy this is to modify H to

H (x ) =

N∑

j=1

g j (x )gTj (x )+δI , (5.69)

with δ a small positive number that is to be chosen suit-ably. This is known as the Levenberg-Marquardt algo-

rithm. As argued, the H (x ) is a good approximation ofthe Hessian near the minimum. Away from the minimumthe approximation error may be considerable, but the factthat H (x ) is nonsingular and nonnegative definite guar-antees that the p i of (5.67) is a direction of descent, that is,f (x +αp i )< f (x ) for small enough α> 0.

For the application of these least squares algorithms tothe estimation of ARMA schemes formulas for the gradi-ent of the residuals e j with respect to the parameters a 1,a 2, . . . , a n and b0, b1, . . . , bk , need to be available. Suchformulas are derived by Ljung (1987), see Problem 5.7(p. 73).

5.6 Order determination

5.6.1 Introduction

In this section we review the problem of determining theorder of the most suitable model. We successively discussseveral possibilities.

5.6.2 Order determination with the covariance function

As we saw when discussing AR, MA and ARMA processesin § 2.4 (p. 11), § 2.2 (p. 9), and § 2.5 (p. 16), these pro-cesses are characterized by a specific behavior of the co-variance function. MA(k ) processes have a covariance

69

Page 76: Time Series Analysis and System Identification

function that is exactly zero for time shifts greater than k .The covariance function of AR and ARMA processes de-creases exponentially. To obtain a first impression of thenature and possible order of the process it is in any casevery useful to compute and plot the covariance function.

Because of the estimation inaccuracies it often is notpossible to distinguish whether the covariance functiondecreases exponentially to zero or becomes exactly zeroafter a finite number of time shifts. If the covariancedrops to zero fast then it is obvious to consider an MAscheme. The order then may be estimated as that timeshift from which the covariance remains within a con-fidence interval around zero whose size is to be deter-mined.

If the covariance function decreases slowly then it isbetter to switch to an AR or ARMA scheme. Such schemesprobably describe the process with fewer parametersthan an MA scheme.

5.6.3 Order determination with partial correlations

In § 2.4.4 (p. 14) it is shown that the partial correlationsof an AR(n) scheme become exactly zero for time shiftsgreater than n . If the MA scheme has been eliminated asa suitable model then it is useful to estimate and plot thepartial correlations.

We briefly review the notion of partial correlations. TheYule-Walker equations of an AR(n) process are

ρ(k ) =

n∑

i=1

a niρ(i ), k = 1, 2, . . . . (5.70)

The a ni are the coefficients of the scheme and ρ is thecorrelation function. For given correlation function ρ thecoefficients of the scheme may be computed from the setof n linear equations that results by considering the Yule-Walker equations for k = 1, 2, . . . , n . Even if the processis not necessarily an AR process then for given n the co-efficients a n1, a n2, . . . , a nn , still may be computed. Thelast coefficient a nn is the nth partial correlation coeffi-cient of the process. The partial correlation coefficientsmay be computed recursively with the Levinson-Durbinalgorithm (2.65–2.66) (p. 14). For a given realization oflength N of a process with unknown correlation functionthe correlations may be estimated as

ρN (k ) =rN (k )

rN (0), (5.71)

with rN an estimate of the covariance function such asproposed in § 4.5 (p. 45). By substituting these estimatesfor ρ in (5.70) estimates of the partial correlation coeffi-cients are obtained. Like for the covariance function itoften is difficult to distinguish whether the partial corre-lations become exactly zero or decrease exponentially tozero. It may be proved that for time shifts greater than n

the estimated partial correlations of an AR(n) process areapproximately independent with zero mean and variance

equal to 1/N . For large N they are moreover with goodaccuracy normally distributed. With these facts a confi-dence interval may be determined, for instance the in-

terval (−2p

1/N , 2p

1/N ). With this confidence intervalthe order of a potential AR scheme may be established. Ifthis order is large then it is better to resort to an ARMAscheme.

5.6.4 Analysis of the residuals

It is to be expected that as the model is made more com-plex (that is, as the order n of the AR part and the order k

of the MA part are taken larger) the estimated variance σ2

of the residuals ǫt becomes smaller. Once the correct val-ues of the orders have been reached the mean square er-ror will no longer decrease fast as complexity is increased.This provides an indication for the correct values of n andk .

All estimation methods that are discussed yield esti-mates for the residuals ǫt . If the process has been cor-rectly estimated with the right model then the residu-als are a realization of white noise. Estimating the cor-relation function of the residuals and checking whetherfor nonzero time shifts the correlations are sufficientlysmall constitutes a very useful test for the reliability of themodel.

5.6.5 Information criteria

In the previous subsection it is noted that if the complex-ity of the model increases the estimated variance of theresiduals decreases. If the correct complexity is reacheda “knee” occurs in the plot. Sometimes it is not easyto establish where the knee is located. If complexity isfurther increased the variance of the residuals keeps de-creasing, with an accompanying over-parameterizationof the model.

The converse holds for the value L max that the log likeli-hood function assumes for the estimated model. L max in-creases monotonically with the complexity of the model.The quantity −L max decreases monotonically.

Based on information theoretic arguments Akaike (seefor instance Ljung (1987)) proposed Akaike’s information

criterion

ICAkaike =−L max+2m

N(5.72)

with m the total number of parameters in the model. Forthe ARMA scheme we have m = n + k + 1. The secondterm in the criterion is a measure for the complexity ofthe model. With increasing m the first term of the crite-rion decreases and the second increases. The best modelfollows by minimization of the information criterion withrespect to m .

On the basis of other information theoretic considera-tions Rissanen (also see Ljung (1987)) derived the mini-

mum length information criterion, also known as Rissa-

70

Page 77: Time Series Analysis and System Identification

nen’s information criterion,

ICRissanen =−L max+m logN

N. (5.73)

Rissanen’s criterion yields consistent estimators for boththe order and the model parameters. Because the secondterm of ICRissanen has more weight than that of ICAkaike Ris-sanen’s criterion generally yields models of lower orderthan that of Akaike.

Akaike and Rissanen’s criteria may also be appliedwhen the prediction error method is used. Let V be theminimal mean square prediction error

V =1

N

N−1∑

t=0

e 2t (pN ). (5.74)

Then Akaike’s information criterion takes the form

(1+2m

N)V , (5.75)

and Rissanen’s criterion becomes

(1+m logN

N)V . (5.76)

A last criterion is the final prediction error (FPE) criterion

of Akaike. It is given by

1+ m

N

1− m

N

V , (5.77)

and equals the mean variance of the one-step predictionerror if the model is applied to a different realization ofthe time series than the one that is used to estimate themodel (Ljung, 1987, § 16.4). For m ≪ N the criterion re-duces to Akaike’s information criterion.

5.6.6 Example

By way of example we consider the time series of fig-ures 1.4 (p. 2), which represents the monthly productionof bituminous coal in the USA during 1952–1959. Thereare 96 data points.

We begin by centering the time series (with the MAT-LAB function detrend) and dividing by 10000 to scalethe data. The resulting time series is plotted in Fig. 5.2.

As a first exploratory step we compute and plot thecorrelation function. For this purpose the MATLAB rou-tine cor is available, which was especially developed forthe Time Series Analysis and Identification Theory course.Figure 5.2(a) shows the result. Inspection reveals that thecorrelation function decays more or less monotonicallyto zero. An AR scheme of order 1 or 2 may explain thiscorrelation function well.

Next we calculate the partial autocorrelation function.For this the routine pacf has been developed. Fig-ure 5.2(b) shows the result, again with confidence limits.The partial autocorrelations decay rather abruptly. Also

10 20 30 40 50 60 70 80 90-1.5

-1

-0.5

0

0.5

1

1.5

x t

t [month]

Figure 5.2: Centered and scaled bituminous coalproduction

0 5 10 15 20 25 30 35 40 45 50-1

-0.5

0

0.5

1

0 5 10 15 20 25 30 35 40 45 50-1

-0.5

0

0.5

1

Correlation function

Partial correlation function

(a)

(b)

Figure 5.3: (a) Estimated correlation function.Dashed: Confidence limits. (b) Es-timated partial correlation function.Dashed: confidence limits.

71

Page 78: Time Series Analysis and System Identification

ARMA scheme σ2 FPE a 1 a 2 a 3 b1

(1, 0) 0.1007 0.1028 0.7288

(0.0729)

(2, 0) 0.0907 0.0945 0.5008 0.2689

(0.0980) (0.0954)

(3, 0) 0.0898 0.0956 0.4808 0.2575 0.0664

(0.1037) (0.1073) (0.1013)

(2, 1) 0.0905 0.0963 0.5395 0.2435 −0.0474

(0.3392) (0.2503) (0.3573)

Table 5.1: Estimation results ARMA schemes. In parentheses: standard deviations of the estimates

this indicates that an AR scheme of low order explains theprocess.

We now use the routine armax from the Systems Iden-

tification Toolbox to estimate several ARMA schemes bythe prediction error method. Table 5.1 summarizes theresults. Comparison of the outcomes for the AR(1) andAR(2) schemes shows that the AR(2) scheme is better thanthe AR(1) scheme, and that the extra estimated coefficienta 2 differs significantly from 0. If we go from the AR(2)to the AR(3) schema then we note that the extra coeffi-cient a 3 no longer differs significantly from 0 (comparedwith the standard deviation). Of the three AR schemesthe AR(2) scheme has the smallest final prediction error(FPE).

The last row of the table shows that refinement of theAR(2) scheme to an ARMA(2, 1) scheme is not useful.The extra estimated coefficient b1 is not significantly dif-ferent from 0. Also all standard deviations are consid-erably larger than for the AR(2) scheme. This is an in-dication for over-parameterization. In the case of over-parameterization the matrix M in § 5.4.5 (p. 67) becomessingular or near-singular, resulting in large but meaning-less numbers in the estimated variance matrix. Finally, inTable 5.1 the FPE of the ARMA(2,1) scheme is larger thanfor the AR(2) scheme. The FPE is minimal for the AR(2)scheme. To validate the estimated AR(2) model we com-pute the residuals (with the routine resid). Figure 5.4(a)displays the result. Near the time 10 two outliers occur.These may also be found in the original times series. Theroutine resid also produces the correlation function ofthe residuals plotted in Fig. 5.4(b). This correlation func-tion gives no cause at all to doubt the whiteness of theresiduals.

5.7 Problems

5.1 Recursive least squares for mean estimation. Con-sider the scheme X t =µ+ǫt , with ǫt zero mean whitenoise. We want to estimate θ = µ. Determine forthis scheme the recursive least squares expressionsfor θN , PN and the vector gain KN . What can you say

10 20 30 40 50 60 70 80 90-1.5

-1

-0.5

0

0.5

1

1.5

0 5 10 15 20 25-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

(b)

(a)

Correlation function of residuals. Output 1

Residuals

lag

Figure 5.4: (a) Residuals for the estimated AR(2)model. (b) Correlation function ofthe residuals for the estimated AR(2)model. Dashed: confidence limits.

about limN→∞PN .

5.2 Recursive least squares for zero mean processes. Con-sider the stable AR(n) scheme D(q )X t = ǫt withoutthe term µ. It is hence assumed that we know thatEX t = Eǫt = 0. This simplifies the recursive leastsquares solution a little. Show that

EW TW = (N −n )

r (0) r (τ1) · · · r (τn−1)

r (τ1) r (0) · · · r (τn−2)

......

......

r (τn−1) r (τn−2) · · · r (0)

where W is the W -matrix of (5.3) without the firstcolumn (which corresponds to µ.)

5.3 Matrix inverse. Verify (5.9).

5.4 Estimation of seasonal model.1 Consider the sea-

1Examination May 25, 1993.

72

Page 79: Time Series Analysis and System Identification

sonal model

Yt − cq−P Yt = ǫt , t = 1, 2, . . . , (5.78)

with the natural number P the period, c a real con-stant, and ǫt zero mean white noise with varianceσ2.

a) Prove that the model is stable if and only if |c |<1. Assume that this condition is satisfied in theremainder of the problem.

b) Prove that the mean value function of the sta-tionary process defined by (5.78) is identicalto 0. Also prove that the covariance functionr (τ) =EYt Yt+τ is given by

r (τ) =

σ2

1−c 2 c |τ|/P for |τ|= 0, P, 2P, . . . ,

0 for other values of τ.

(5.79)

c) Determine the joint probability density func-tion of YP , YP+1, . . . , YN−1, given Y0, Y1 , . . . , YP−1,with N > P .

d) Determine maximum likelihood estimators c

and σ2 for the model parameters c and σ2

based on the probability density function ob-tained in Part (c).

e) Instead of the conditional probability densityfunction

f YP ,YP+1,...,YN−1|Y0,Y1 ,...,YP−1 (5.80)

that is chosen in (c) consider the unconditionalprobability density function

f Y0, Y1,..., YN−1 (5.81)

based on the assumption that the joint proba-bility density of Y0, Y1, . . . , YP−1 equals the sta-tionary density. What form does the likelihoodfunction now take? What is the effect on the re-sulting maximum likelihood estimates for largevalues of N ?

5.5 Variance of estimates. We consider the problem of

estimating the coefficients θ =�

a 1 a 2 · · · a n

�T

in the AR(n )model (5.1) under the assumption thatµ= 0, and that we know that µ= 0. As usual we baseour estimation on N observations X0, X1, . . . , XN−1.

a) Under some assumptions Subsection 5.2.5 ex-plains how to obtain estimates of var(θ ) whereθ is the Maximum Likelihood estimator of Sub-section 5.2.4. Show that for our case this spe-cializes to

var(θ )≈σ2

N −n

rN (0) rN (1) · · · rN (n −1)

rN (1) rN (0) · · · rN (n −2)

· · · · · · · · · · · ·rN (n −1) rN (n −2) · · · rN (0)

−1

b) Now consider the AR(1) and AR(2)-models

X t = a 11X t−1+ǫt , X t = a 12X t−1+a 22X t−2+ǫ′t .

Show that with the above approximation ofvariances that

var(a 11)≤ var(a 12).

c) Explain in words why the above is intuitive.That is, explain that it is intuitive that the vari-ances increase if there are more degrees of free-dom in the model. [It may be useful to considerthe special case that the realization on whichthe estimation is based is constant: X0 = X1 =

· · ·= XN−1.]

5.6 Scaling maximum likelihood estimator. Consider thelikelihood function L defined in (5.39). It is a func-tion of (σ,b0, . . . ,bk , a 1, . . . , a n ,µ). Show that for anyλ 6= 0,

L(σ,b0, . . . ,bk , a 1, . . . , a n ,µ)

= L(λσ,1

λb0, . . . ,

1

λbk , a 1, . . . , a n ,µ).

Explain this result in terms of the ǫt .

5.7 Gradients of the residuals. In this problem wedemonstrate how the gradient e θ

tpractically may be

computed. We also consider some theoretical prop-erties of this gradient. From (5.46) we know that theresiduals e t may be determined from the observedtime series X t according to the inverse scheme

N (q )e t =D(q )X t , t = 0, 1, . . . , N −1. (5.82)

The recursive computation of the residuals is usuallyinitialized by choosing the missing values of X t ande t for t < 0 equal to 0. By assumption the inversescheme is stable. The parameters that are estimatedare the coefficients of the polynomials

N (q ) = 1+b1q−1+b2q−2+ · · ·+bk q−k ,

D(q ) = 1−a 1q−1−a 2q−2− · · ·−a nq−n .

We define ea i

t as the gradient of e t with respect to a i

and eb i

t as the gradient of e t with respect to b i .

a) Verify that the gradients ea i

t may be computedby application of the scheme

N (q )e a i

t=−q−i X t , t = 0, 1, . . . , N −1,

to the observed time series. How is the compu-tation initialized?

b) Also verify that the gradients eb i

t may be com-puted by application of the scheme

N (q )e b i

t=−q−i e t , t = 0, 1, . . . , N −1,

to the residuals. Again, how is the computationinitialized?

73

Page 80: Time Series Analysis and System Identification

c) Suppose that the polynomials D and N that areused to generate the residuals e t and the gra-dients are precisely equal to the polynomialsof the ARMA scheme that generated X t . Prove

that for t →∞ the gradients ea i

t and eb j

t are sta-tistically independent of e t .

d) Prove that for t → ∞ the stochastic variablee t (θ0)e θθ

T

t(θ0) has expectation zero.

e) Prove that asymptotically for t →∞ and s →∞

Ee t (θ0)e s (θ0)eθt (θ0)e

θt (θ0)

T

=

(

0 for s 6= t ,

σ2Ee θt (θ0)e θt (θ0)T for s = t .

5.8 Minimum prediction error for an MA(1) process.2 Inthis problem we consider the minimum predictionerror estimation of the parameter b of the MA(1)process

X t = ǫt +bǫt−1, t ∈Z, (5.83)

with |b |< 1. We assume the process to be invertible.The process ǫt , t ∈Z, is white noise with mean 0 andvarianceσ2.

a) What is the one-step prediction error e t = X t −X t |t−1 of this process? What is the most conve-nient choice of X0|−1?

b) Given a realization X t , t = 0, 1, . . . , N − 1 of theprocess, how may the one-step prediction errore t , t = 0, 1, . . . , N −1, be determined?

c) For the implementation of a suitable min-imization algorithm (for instance a quasi-Newton algorithm) for the minimum predic-tion error estimation of b it is necessary tocompute besides the mean square predictionerror

V =1

N

N−1∑

t=0

e 2t

(5.84)

also the gradient∂ V

∂ b(5.85)

of V with respect to the parameter b . Derivean algorithm for computing this gradient. (UseProblem 5.7.)

5.9 MA(10) process. Application of armax to the MA(10)process of § 2.2.2 (p. 10) leads to incorrect results.Why?

2Examination May, 1994.

74

Page 81: Time Series Analysis and System Identification

6 System identification

systemzu y

v

Figure 6.1: System with input signal u and noisev

6.1 Introduction

This chapter is devoted to system identification. Fig-ure 6.1 repeats the paradigm for system identificationthat is described in § 1.1.2 (p. 1). We study the prob-lem how to reconstruct the dynamical properties of thesystem and perhaps also the statistical properties of thenoise v from the recorded signals u and y .

The paradigm may be detailed in different ways. Wegive two examples.

1. The system is assumed to be linear and time-invariant, with unknown impulse response aboutwhich nothing is known. The signal v is assumed tobe a realization of a wide-sense stationary stochasticprocess, with unknown covariance function.

The problem is to estimate the unknown impulse re-sponse and covariance function from the observa-tions. This is an example of non-parametric systemidentification.

2. It is assumed that based on known first principles astructured dynamical model may be constructed forthe system and the signal v , but that this model con-tains a number of parameters with unknown values.

The problem is to estimate the values of these pa-rameters from the observations. This is an exampleof parametric system identification.

In § 6.2 we discuss Case 1. In the remaining sections ofthis chapter Case 2 is studied.

6.2 Non-parametric system identification

6.2.1 Introduction

In this section we study Case 1 of § 6.1. We assumethat the system of Fig. 6.1 is linear, time-invariant andcausal. This means that the system is described by an

equation z t =∑∞

m=0 hm u t−m and that the system is fullydetermined by the function hm known as the impulse re-sponse. Now

yt =

∞∑

m=0

hm u t−m + vt , t ∈Z, (6.1)

and we consider the question how the impulse responsehm and the properties of the signal vt may be estimatedfrom the N observation pairs

(u t , yt ), t = 0, 1, . . . , N −1, (6.2)

of the input and output signal.

6.2.2 Impulse and step response analysis

If the noise vt is absent or very small then the function h

may be measured directly by choosing the input signal asthe “impulse”

u t =

¨

u 0 for t = 0,0 for t > 0,

(6.3)

with u 0 a constant that is chosen as large as possible with-out violating the model assumptions. Then we have

yt = u 0h t + vt

≈ u 0h t , t ≥ 0.

We estimate the impulse response according to h t =

yt /u 0, t = 0, 1, . . .. Obviously this experiment is only pos-sible if the input signal u may be chosen freely. Also, thesystem needs to be “at rest” at the initial time 0.

Another possibility is to choose a step function

u t = u 0 for t ≥ 0 (6.4)

for the input signal. If the system is initially at rest thenthe response is

yt = u 0

t∑

m=0

hm + vt

≈ u 0

t∑

m=0

hm , t ≥ 0.

The estimates for the impulse response follow recursivelyas

h0 =y0

u 0,

h t =1

u 0(yt − yt−1), t = 1, 2, . . . .

Both methods to estimate the impulse response usuallyonly are suitable to establish certain rough properties ofthe system such as the static gain, the dominant timeconstants, and the intrinsic delay, if any is present.

75

Page 82: Time Series Analysis and System Identification

6.2.3 Frequency response analysis

Suppose that the input is chosen as the complex-harmonic signal

u t =u 0 eiω0t , t ≥ 0, (6.5)

with the real constants u 0 and ω0 the amplitude and theangular frequency, respectively. We assume that the sys-tem is BIBO stable, which is that the output z t is boundedfor any bounded input u t . A system (6.1) is BIBO stableif and only if

∑∞m=0 |hm | <∞. By substitution into (6.1) it

follows that

yt =

t∑

m=0

hm u 0 eiω0(t−m )+vt

=

t∑

m=0

hm e−iω0m

!

u 0 eiω0t +vt , t ≥ 0.

For t →∞we find

yt → h(ω0)u 0 eiω0t +vt . (6.6)

Here

h(ω0) =

∞∑

m=−∞hm e−iω0m , −π≤ω0 <π, (6.7)

is the DCFT of the impulse response h and, hence, thefrequency response function of the system. By writing

h(ω0) = |h(ω0)|eiα(ω0), (6.8)

with α(ω0) the argument of the complex number h(ω0), itfollows that

yt →|h(ω0)|u 0 ei(ω0t+α(ω0))+vt

= |h(ω0)|u 0[cos(ω0t +α(ω0))+ i sin(ω0t +α(ω0))]+ vt .(6.9)

This complex output signal is the (asymptotic) responseto the complex input signal

u t = u 0 eiω0t =u 0[cos(ω0t )+ i sin(ω0t )]. (6.10)

By the linearity of the system it follows that the responseto the real part

u t = u 0 cos(ω0t ), (6.11)

of the input signal (6.10) asymptotically equals the realpart

yt ≈ |h(ω0)|u 0 cos(ω0t +α(ω0))+ vt (6.12)

of the asymptotic response (6.9).If the noise v is negligibly small then by applying the

real-harmonic input signal (6.11) (with u 0 as large aspossible), waiting until the stationary response has beenreached, and measuring the amplitude and phase of theoutput signal the magnitude |h(ω0)| and phase α(ω0) of

the frequency response at the frequency ω0 may be de-termined with the help of (6.12). If the noise is not neg-ligibly small then often by averaging the output over asufficiently large number of periods an accurate estimatemay be found. By repeating this measurement for a largenumber of frequencies an estimate of the behavior of thefrequency response h is obtained. By inverse Fouriertransformation an estimate of the impulse response h fol-lows.

For slow systems this method is excessively time con-suming. For electronic systems that operate at audio,video or other communication frequencies the method isa proven technique for which specialized measurementequipment exists.

6.2.4 Spectral analysis

We consider the equation

yt =

∞∑

m=0

hm u t−m + vt , t ∈Z, (6.13)

which describes the relation between the input and out-put signals of the configuration of Fig. 6.1. Suppose thatthe input signal u is a realization of a wide-sense sta-tionary stochastic process Ut with mean 0. Also assumethat the noise vt is a realization of a wide-sense station-ary stochastic process Vt with mean 0. Then the outputsignal yt is a realization of the stochastic process

Yt =

∞∑

m=0

hmUt−m +Vt , t ∈Z, (6.14)

which also has mean 0. Given the processes Ut and Yt wedefine the cross covariance function of the two processesas

Ry u (t1, t2) =EYt1Ut2 , t1 ∈Z, t2 ∈Z. (6.15)

It follows with the help of (6.14) that

Ry u (t1, t2) =E

∞∑

m=0

hmUt1−m +Vt1

!

Ut2

=

∞∑

m=0

hm EUt1−mUt2 +EVt1Ut2 .

If we assume that the input signal Ut and the noise Vt areuncorrelated processes then it follows that

Ry u (t1, t2) =

∞∑

m=0

hm Ru (t1−m , t2)

=

∞∑

m=0

hm ru (t1− t2−m ),

where Ru (t1, t2) = ru (t1− t2) is the covariance function ofthe wide-sense stationary input process Ut . Inspection ofthe right-hand side of this expression shows that it only

76

Page 83: Time Series Analysis and System Identification

depends on the difference t1− t2 of the two time instantst1 and t2. Apparently the cross covariance function is afunction

EYt1Ut2 = Ry u (t1, t2) = ry u (t1− t2) (6.16)

of the difference t1− t2 of the arguments. We have

ry u (k ) =

∞∑

m=0

hm ru (k −m ), k ∈Z. (6.17)

The cross covariance function ry u clearly is the convolu-tion of the impulse response h and the covariance func-tion1 ru of the input signal. By application of the DCFT itfollows that

φy u (ω) = h(ω)φu (ω), −π≤ω<π, (6.18)

Here h is the frequency response function of the systemand φu the spectral density function of the input processUt . The DCFT φy u of the cross covariance function ry u iscalled the cross spectral density of the stochastic processesYt and Ut .

Note that the relation (6.18) is independent of the sta-tistical properties of the noise Vt (as long as it has mean0 and is uncorrelated with the input process Ut ). By divi-sion we find

h(ω) =φy u (ω)

φu (ω), −π≤ω<π, (6.19)

With the help of this relation the frequency responsefunction h may be estimated in the following mannerfrom an observation series (Ut , Yt ), t = 0, 1, . . . , N −1.

1. Estimate the auto-covariance function ru and thecross covariance function ry u according to

ru (k ) =1

N

N−1∑

t=|k |Ut Ut−|k |, (6.20)

ry u (k ) =

¨1N

∑N−1t=k

Yt Ut−k for k ≥ 0,1N

∑N+k−1t=0 Yt Ut−k for k < 0,

(6.21)

both for k = 0,±1,±2, . . . ,±(M −1), with M ≪N . Ap-ply suitable time windows to ru and ry u .

2. Estimate the spectral density function φu and thecross spectral densityφy u by Fourier transformationof the windowed estimates of the covariance func-tions.

3. Estimate the frequency response function from theestimated spectral densities with the help of (6.19) .

Like for the estimation of the spectral density function asdiscussed in § 4.6 (p. 46) the windows need to be chosenso that an acceptable compromise is achieved between

1In this context the covariance function is sometimes called auto-

covariance function.

the statistical accuracy and resolution. Ljung (1987, § 6.4)discusses this in more detail.

Note that this method provides no or no useful esti-mate of the frequency response function h for those fre-quencies for which the spectral density functionφu of theinput process is zero or very small. If φu (ω) > 0 for all ωthen the input signal is called persistently exciting. Thisproperty is a necessary condition for being able to iden-tify the system.

The spectral estimation method as described is suitableas an exploratory tool to get a first impression of the dy-namical properties of the system. The parametric estima-tion methods that are the subject of the next sections usu-ally are more attractive if the dynamic model is neededfor applications such as forecasting and control systemdesign.

Spectral estimation may also be used to determine thespectral density of the noise Vt . Write

Yt =

∞∑

m=0

hmUt−m

︸ ︷︷ ︸

Zt

+Vt =Zt +Vt . (6.22)

If the noise Vt is independent of the input process Ut thenalso the noise Vt and the process Zt are independent. Wethen have that

ry (k ) = rz (k )+ rv (k ), (6.23)

with ry , rz and rv the covariance functions of Yt , Zt andVt , respectively. By Fourier transformation it follows that

φy (ω) =φz (ω)+φv (ω), (6.24)

with φy the spectral density function of Yt , φz that of Zt

andφv that of Vt . With the help of

φz (ω) = |h(ω)|2φu (ω) (6.25)

(see § 2.6, p. 17) it follows that the spectral density func-tion of the output process Yt is given by

φy (ω) = |h(ω)|2φu (ω)+φv (ω). (6.26)

With the help of (6.19) it follows that

φv (ω) =φy (ω)− |h(ω)|2φu (ω) (6.27)

=φy (ω)−|φy u (ω)|2φu (ω)

.

For the estimation of the frequency response h it henceis necessary to estimate the spectral density φu and thecross spectral densityφy u . If in addition the spectral den-sity function φy of the output process is estimated (byfirst estimating the covariance function of the output pro-cess) then (6.27) yields an estimate forφv .

The function

Ky u (ω) =

È

|φy u (ω)|2φy (ω)φu (ω)

, −π≤ω<π, (6.28)

77

Page 84: Time Series Analysis and System Identification

is known as the coherence spectrum of the processes Yt

and Ut . This function may be viewed as a frequencydependent correlation coefficient between the two pro-cesses. If the model (6.14) applies then we have

φv (ω) = [1−K 2y u(ω)]φy (ω). (6.29)

If Ky u equals 1 for certain frequencies then the spectraldensity of the noise Vt equals 0 for those frequencies.

The computations of this section may advantageouslybe done in the frequency domain with the use of the FFT.As we know from § 4.7 (p. 42) the spectral density func-tions φu and φy of the processes Ut and Yt may be esti-mated by spectral windowing of their periodograms

1

N|UN (ω)|2 and

1

N|YN (ω)|2, (6.30)

where the DCFTs

UN (ω) =

N−1∑

t=0

Ut e−iωt and YN (ω) =

N−1∑

t=0

Yt e−iωt (6.31)

may be computed by the FFT. Similarly the cross spectraldensity φy u of the processes Yt and Ut may be estimatedby spectral windowing of the “cross periodogram”

1

NYN (ω)U

∗N (ω). (6.32)

In this formula ∗ denotes the complex conjugate. Moredetails may be found in Chapter 6 of Ljung (1987).

6.2.5 Matlab example

By way of example we consider the laboratory processof 1.2.7 (p. 3). This example is treated in the demoiddemo1 of the Systems Identification Toolbox. The totallength of the observed series is 1000 points. In Fig. 6.2 thefirst 300 points are plotted. We use these 300 measure-ment pairs for identification.

With the following MATLAB commands an estimate ofthe frequency response function of the process may beobtained:

z = [y(1:300) u(1:300)];

z = detrend(z,’constant’);

hh = spa(z);

bode(hh);

The first command combines the first 300 points of themeasurement series to an input/output pair. In the sec-ond command the means are subtracted. The third com-mand yields an estimate of the frequency response func-tion in a special format that is explained in the manual ofthe Systems Identification Toolbox.

To obtain an estimate of the frequency response func-tion the routine spa first estimates the auto-covariancefunction of the input signal and the cross covariancefunction of the input and output signals. Next these

functions are windowed with Hamming’s window. AfterFourier transformation of the windowed covariance func-tions the estimation of the frequency response functionfollows. See the manual or the MATLAB help function for afurther description and a specification of the optional ar-guments of the function spa. The command bode pro-duces the plot of Fig. 6.3.

An estimate of the impulse response of the system maybe obtained with the command

ir = cra(z);

The function cra estimates the impulse response ir ofthe system by successively executing the following steps:

1. Estimate an AR(n) scheme for the input process. Thedefault value of n is 10.

2. Filter the input and output process u and y with theinverse of the AR scheme. This amounts to the appli-cation of an MA scheme, and serves to “prewhiten”the input signal.

3. Compute the cross covariance function of the fil-tered input and output signals. Because the inputsignal now is white the cross covariance functionprecisely equals the impulse response (within a pro-portionality constant that equals the standard devi-ation of the input signal).

This computation entirely takes place in the time do-main.

Figure 6.4 shows the graphical output of cra, consist-ing of a plot of the estimated impulse response along witha 99% confidence region. Inspection shows that the (es-timated) impulse response hm only differs (significantly)from 0 for m ≥ 3. Obviously the system has a time delay

0 50 100 150 200 2503

4

5

6

7

u t

0 50 100 150 200 2503

4

5

6

7

sampling instant t

yt

Figure 6.2: Measured input and output signals ofa laboratory process

78

Page 85: Time Series Analysis and System Identification

10−2 10−1 100 10110−2

10−1

100

am

pli

tud

e

10−2 10−1 100 101−60◦

−40◦

−20◦

0◦

ω [rad/sec]

ph

ase

Figure 6.3: Estimated frequency response func-tion of the laboratory process

0 2 4 6 8 10 12 14 16 18 20

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

lags

Impulse response estimate

Figure 6.4: Output of cra

10−2 10−1 100 10110−3

10−2

10−1

100

am

pli

tud

e

10−2 10−1 100 101−60◦

−40◦

−20◦

0◦

ω [rad/sec]

ph

ase

Figure 6.5: Bode plots of two estimates (dashedand solid) of one frequency responsefunction

(or “dead time”) of 3 sampling intervals. This time delay iscaused by the transportation time of the air flow throughthe tube.

We may compare the results of the two estimationmethods by computing the frequency response from theestimated impulse response ir. This is done like this:

thir = poly2th(1,ir’);hhir = th2ff(thir);

The first command defines the scheme yt = P(q )u t wherethe coefficients of the polynomial P are formed fromthe estimated impulse response. The second commandserves to compute the frequency response function of thecorresponding system. The command

bode([hh hhir]);

displays the magnitude and phase plots of the frequencyresponse functions that are found with the two methodsin one frame. Figure 6.5 shows the result. The estimatedfrequency responses differ little.

6.3 ARX models

6.3.1 Introduction

The first parametric system identification method that weconsider applies to the case that the system of Fig. 6.1(p. 75) may be represented as a linear time-invariant sys-tem described by the difference equation

yt −a 1yt−1−a 2yt−2− · · ·−a n yt−n︸ ︷︷ ︸

D(q )yt

= c0u t + c1u t−1+ · · ·+ cm u t−m︸ ︷︷ ︸

P(q )u t

,

79

Page 86: Time Series Analysis and System Identification

t ∈Z. Many practical systems may adequately be charac-terized by difference equations of this form. To accountfor measurement errors and disturbances we modify theequation to

D(q )yt −P(q )u t =w t , t ∈Z, (6.33)

orD(q )yt =P(q )u t +w t , t ∈Z, (6.34)

with w t a noise term. Because of (6.33) this model issometimes known as the error-in-the-equation model. Ifthe noise w t is a realization of white noise then (6.34) iscalled an ARX scheme. As far as the effect of the noise w t

on the output process concerns (6.34) is an AR scheme.The character X refers to the presence of the “exogenous”(external) signal u t .

6.3.2 Least squares estimation

A least squares estimate of the coefficients a 1, a 2, . . . , a n ,c0, c1, . . . , cm may be obtained by minimization of thesum of squares

1

2

t

[D(q )yt −P(q )u t ]2. (6.35)

To set up the least squares problem we again representthe equations in matrix form. For ease of exposition weassume that n ≥m . Then the relevant matrix equation is

yn

yn+1

...yN−1

︸ ︷︷ ︸

Y

=

yn−1 · · · y0 u n · · · u n−m

yn · · · y1 u n+1 · · · u n−m+1

......

......

......

yN−2 · · · yN−n u N−1 · · · u N−m−1

︸ ︷︷ ︸

F

a 1

...a n

c0

...cm

︸ ︷︷ ︸

θ

+

wn

wn+1

...wN−1

︸ ︷︷ ︸

W

.

(6.36)

The solution of the least squares problem θN =

(F TF )−1F TY exists iff F has full column rank. For that u t

obviously has to be at least nonzero. Similarly as in Sub-section 5.2.5, Ljung (1987) showed that if the noise w t is arealization of white noise and the input signal u t is “suf-ficiently rich” (see p. 83) then the least squares estimatorexists and is consistent and asymptotically efficient.

This does not apply if the noise is not white. We con-sider this situation. Suppose that the input signal u t andthe noise w t are realizations of wide-sense zero mean sta-tionary processes Ut and Wt . Then also the output sig-nal yt is a realization of such a process Yt . Now multiply(6.36) from the left with 1

NF T:

1

NF TY =

1

NF TF θ +

1

NF TW (6.37)

and realize that the least squares estimate θN satisfies thesimilar equation

1

NF TY =

1

NF TF θN . (6.38)

Under weak conditions the sample covariance matrices1N

F TY and 1N

F TF converge to well defined limits2. There-

fore if θN is to be an asymptotically unbiased estimator ofθ , then necessarily we need to have that

limN→∞E

1

NF TW = 0.

Now we have that

limN→∞E

1

NF TW (6.39)

=E1

N

Yn−1 Yn · · · YN−2

......

......

Y0 Y1 · · · YN−n

Un Un+1 · · · UN−1

......

......

Un−m Un−m+1

... UN−m−1

Wn

Wn+1

...WN−1

=

ry w (1)...

ry w (n )

ru w (0)...

ru w (m )

.

If Wt is white noise then Wt and Yt−i are uncorrelated fori > 0, so that ry w (i ) = 0 for i > 0. These terms appear in1NEF TW and hence are all zero; the remaining terms in

1NEF TW are ru w (i ), i ≥ 0 and they are zero by assump-

tion. Hence 1NEF TW = 0 which is what we need.

If Wt is not white noise then Wt and Yt−i are correlatedfor at least one i > 0 so that ry w (i ) 6= 0 for at least onei > 0. As a result limN→∞E

1N

F TW is not identically zero,

and as a result the estimate θN is biased, no matter howlarge the number of observations N is.

6.3.3 Instrumental variable method

In practice the biasedness of the least squares estimatorfor non-white w t is a sufficiently serious handicap to jus-tify looking for other estimators θ . Ljung (1987) advo-cates the following method. He proposed to avoid themismatch between (6.37) and (6.38) by premultiplyingthe equation Y = Fθ +W not with 1

NF T but with another

matrix 1N

X T:

1

NX TY =

1

NX TFθ +

1

NX TW, (6.40)

where

X =

xn xn−1 · · · x−m x−m−1

· · · · · · · · · · · · · · ·xN−1 xN−2 · · · xN−n−m xN−n−m−1

2They converge in fact to matrices of cross-covariances and covari-ances, comparable to that of (6.39).

80

Page 87: Time Series Analysis and System Identification

which is a matrix whose entries x t are a realization of awide sense stationary process that is uncorrelated withthe noise Wt . Such time series x t are called instrumental

variables. Then by assumption we have that EX TW = 0,and now as estimator is proposed the solution of (6.40) inwhich X TW is replaced with zero,

θ = (X TF )−1X TY . (6.41)

A possible choice for the instrumental variable is x t =u t .If the noise w t is uncorrelated with the input signal thenthis yields a correct instrument.

It may be proved (see Ljung (1987)) that the instrumen-tal variable method (abbreviated to IV method) yields themost accurate estimate of θ (in some sense) if the equa-tion error W is not correlated with yt and u t as is donein (6.39), but with z t and u t , where z t is the output sig-nal of the system without noise. This means that z t is thesolution of the difference equation

D(q )z t =P(q )u t . (6.42)

Because the coefficients of the polynomials D and P arenot known this result is not immediately useful since theinstrument u t cannot be computed. However, the in-strument may be approximated by first making a pre-liminary estimate of the unknown coefficients with theleast squares method. These (inaccurate) estimates arethen used in (6.42) to compute the instrument z t approx-imately. Next the coefficients are estimated according tothe instrumental variable method from the (6.41) with

X =

z n−1 · · · z 0 u n · · · u n−m

z n · · · z 1 u n+1 · · · u n−m+1

......

......

......

z N−2 · · · z N−n u N−1 · · · u N−m−n

.

(6.43)

6.3.4 Matlab example

Again we use the laboratory process as an illustration. In§ 6.2.5 (p. 78) the impulse response was estimated. Forestimating the ARX model the Systems Identification Tool-

box provides the function arx. This function is an imple-mentation of the least squares method.

This is the record of the MATLAB session (in continua-tion of that of § 6.2.5) that produces an estimate:

>> th = arx(z,[2 2 3]);

>> present(th)

This matrix was created by the command ARX

on 1/18 1998 at 12:52 Loss fcn: 0.0016853

Akaike‘s FPE: 0.0017309 Sampling interval 1

The polynomial coefficients and their

standard deviations are

B =

0 0 0 0.0666 0.0445

0 0 0 0.0021 0.0033

A =

1.0000 −1.2737 0.3935

0 0.0208 0.0190

>> e=pe(th,z); % (resid(th,z) is buggy)

In the second argument [2 2 3] of arx the first param-eter is the degree n = 2 of the denominator polynomialD. The second parameter m = 2 and the third parameterd = 3 determine the structure of the numerator polyno-mial P in the form

P(q ) = q−d (c0+ c1q−1+ · · ·+ cm q−m ). (6.44)

The integer d is the dead time of the system. In § 6.2.5we saw that the dead time is 3 for the laboratory process.The degrees n and m were chosen after some experimen-tation.

The command present displays the estimated coef-ficients of the polynomials D and P (A and B in the no-tation of the Toolbox.) The coefficients all differ signifi-cantly from 0, so that there is no reason to decrease thedegrees of D and P .

To validate the result the command resid is used tocompute the residuals and their sample correlation func-tion. Figure 6.6 shows the graphical output of resid.The upper plot is the sample correlation function of theresiduals, together with its confidence level. The plot jus-tifies the conclusion that the residuals are a realization ofwhite noise. The lower plot shows the sample cross cor-relation function of the residuals and the input signal u ,also with its confidence level. The result shows that theresiduals are uncorrelated with the input signal. Hence,the model hypotheses of the ARX model are satisfied.There is no need to resort to an IV estimation method.

It is interesting to compare the estimation result withthat of § 6.2.5. From the impulse response that is esti-mated in § 6.2.5 the step response may easily be com-puted. With the help of the ARX model that is estimatedin the present subsection the step response may also bedetermined. These commands do the job:

stepr = cumsum(ir);

step = ones(20,1);

mstepr = idsim(step,th);

The variable stepr is the step response that follows fromthe estimated impulse response and mstepr the step re-sponse of the estimated ARX model. In Fig. 6.7 both stepresponses are plotted. They show good agreement.

6.4 ARMAX models

6.4.1 Introduction

ARX schemes of the form

D(q )yt = P(q )u t +w t , t ∈Z, (6.45)

81

Page 88: Time Series Analysis and System Identification

0 5 10 15 20 25-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-25 -20 -15 -10 -5 0 5 10 15 20 25-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Correlation function of residuals, Output 1

Cross correlation function between input 1 and residuals from out.

lag

lag

Figure 6.6: Sample auto correlation function ofthe residuals and sample cross corre-lation function with the input signal

0 2 4 6 8 10 12 14 16 18

0

0.2

0.4

0.6

0.8

1

time t

est

ima

ted

ste

pre

spo

nse

Figure 6.7: Solid: step response of the estimatedARX model. Dashed: step responsecomputed from the estimated im-pulse response

have limited possibilities for modeling the noise. If theinstrumental variable method is used then the noise w t

may be non-white, but the method does not immediatelyprovide information about the statistical properties of thenoise w t .

We therefore now consider the ARMAX model, which isgiven by

D(q )Yt = P(q )u t +N (q )ǫt , t ∈Z. (6.46)

D and P are the same polynomials as in § 6.3, while N isthe polynomial

N (q ) = 1+b1q−1+b2q−2+ · · ·+bk q−k . (6.47)

The “system noise” ǫt is white noise with mean 0 and vari-ance σ2. The input signal u t may be a realization of astochastic process but this is not necessary. The ARMAXmodel is a logical extension of on the one hand the ARXmodel and on the other the ARMA model.

We study the problem how to estimate the coefficientsof the polynomials D, N and P from a series of observa-tions of (Yt , u t ) for t = 0, 1, . . . , N −1.

6.4.2 Prediction error method

The prediction error method of § 5.4.4 (p. 65) for estimat-ing ARMA schemes may easily be extended to the estima-tion of ARMAX schemes. The output signal of the ARMAXscheme (6.46) is given by

Yt =P(q )

D(q )u t +

N (q )

D(q )ǫt

︸ ︷︷ ︸

X t

, t ∈Z. (6.48)

According to § 5.4.4 the one-step predictor for the processX t is equal to

X t |t−1 =N (q )−D(q )

N (q )X t . (6.49)

It follows that the one step predictor for the process Yt isgiven by

Yt |t−1 =P(q )

D(q )u t + X t |t−1 (6.50)

=P(q )

D(q )u t +

N (q )−D(q )

N (q )X t . (6.51)

Note that for the prediction of the output signal the futurebehavior of the input signal u t is assumed to be known.

By substituting

X t = Yt −P(q )

D(q )u t (6.52)

into the right-hand side of (6.50) it follows that

Yt |t−1 =N (q )−D(q )

N (q )Yt +

P(q )

N (q )u t . (6.53)

82

Page 89: Time Series Analysis and System Identification

The prediction error hence is

e t = Yt − Yt |t−1 (6.54)

=D(q )

N (q )Yt −

P(q )

N (q )u t .

By substitution of Yt from (6.48) it follows that e t = ǫt .Inspection of (6.54) shows that the prediction error maybe generated by the difference scheme

N (q )e t =D(q )Yt −P(q )u t . (6.55)

Application of the minimum prediction error method im-plies minimization of the sum of the squares of the pre-diction errors of the prediction errors

N−1∑

t=0

e 2t

(6.56)

with respect to the unknown parameters. The numericalimplementation follows that for the estimation of ARMAschemes.

The estimator that is obtained this way is consistentunder plausible conditions. The most relevant condi-tion is that the experiment is “sufficiently informative.”This in particular means that the input signal be suffi-ciently “rich.” An input signal that is identical to 0, forinstance, provides no information about the coefficientsof the polynomial P and, hence, is not sufficiently rich. Asufficient condition for the experiment to be sufficientlyinformative is that the input signal u t be a realization of awide-sense stationary process Ut whose spectral densityfunction is strictly positive for all frequencies. Such an in-put signal is said to be persistently exciting (see also § 6.2,p. 75).

6.4.3 Matlab example

As an example to illustrate the methods of this section wechoose the “Åström system.” This is the system describedby the ARMAX scheme D(q )Yt = P(q )u t +N (q )ǫt , with

D(q ) = 1−1.5q−1+0.7q−2,

P(q ) = q−1(1+0.5q−1),

N (q ) = 1−q−1−0.2q−2.

Realizations of the input and output signals may be gen-erated with the following series of commands:

D = [1 −1.5 0.7];

P = [0 1 0.5];

N = [1 −1 0.2];

th0 = poly2th(D,P,N);

randn(’seed’,0)

u = sign(randn(300,1));

e = randn(300,1);

y = idsim([u e],th0);

z = [y u];

0 50 100 150 200 250 300

-1

-0.5

0

0.5

1u t

0 50 100 150 200 250 300

-10

0

10

time t

yt

Figure 6.8: Input and output signals for theÅström system

0 2 4 6 8 10 12 14 16 18 20-1

-0.5

0

0.5

1

1.5

2

2.5

lags

impulse response estimate

Figure 6.9: Output of cra

83

Page 90: Time Series Analysis and System Identification

Figure 6.8 shows the plots of the input and output signals.To obtain a first impression of the system we estimate the

impulse response:

ir = cra(z);

Figure 6.9 shows the result. The estimated impulse re-sponse suggests a second-order slightly oscillatory sys-tem with a dead time of 1 sampling instant. We initiallydo not include a dead time in the model, however.

To obtain a better impression of the structure of thesystem we try to identify the system as an ARX scheme.To avoid problems caused by non-white noise — suchas indeed present in the Åström system — we use an IVmethod for identification. With the following series ofcommands we test each time an ARX model D(q )Yt =

P(q )u t +w t , where D has degree n and P has degree n−1,for n = 1, 2, 3, 4 and 5.

th1 = iv4(z,[1 1 0]); present(th1); pause

th2 = iv4(z,[2 2 0]); present(th2); pause

th3 = iv4(z,[3 3 0]); present(th3); pause

th4 = iv4(z,[4 4 0]); present(th4); pause

th5 = iv4(z,[5 5 0]); present(th5); pause

Table 6.1 summarizes the results. Inspection shows thatif n increases from 1 to 3 both the sample variance of theresiduals and the FPE (see § 5.6, p. 69) decrease.

If n changes from 3 to 4 then both quantities increaseagain. It is not clear why the sample variance of the resid-uals increases.

If subsequently n is increased from 4 to 5 the varianceand the FPE decrease again to a value that is less than thatfor n = 3.

We consider the roots of the polynomials P and D asestimated for n = 5:

Roots of P : 88.6472, −0.3589± i 0.8137, −0.7826;

Roots of D: −0.4253±i 0.6716, 0.7380±i 0.3670, −0.2522.

We note this:

1. One of the roots of P is very large. This is caused bythe small first coefficient c0 of P . Because this coef-ficient does not differ significantly from 0 it may beset equal to 0.

2. The root pair −0.3589± i 0.8137 of P does not lie farin the complex plane from the root pair −0.4253 ±i 0.6716 of D. Canceling the polynomial factors thatcorrespond to these root pairs in the numerator anddenominator has little effect on the response of thesystem.

If we cancel the two root pairs then D has degree 3. Theremaining roots of D are 0.7380 ± i 0.3670 and −0.2522,which exhibit a reasonable resemblance to the roots0.7429± i 0.3659 and −0.1094 of the polynomial D that isestimated for n = 3.

Structure σ2 FPE

2 2 1 1 0.994 1.027

2 2 2 1 0.968 1.008

2 2 3 1 0.971 1.0017

Table 6.2: Results of the estimation with armax

After cancellation of the root pair from P the roots88.6472 and −0.7826 are left. The roots of P for n = 3 are23.6658 and −0.5196.

These considerations show that the model for n = 5 es-sentially is the same as that for n = 3.

On the basis of these findings we conjecture that D hasdegree 3. Inspection shows, however, that for n = 3 theestimated last coefficient a 3 of D does not deviate sig-nificantly from 0. We therefore set this coefficient equalto 0, which reduces the degree of D to 2. Furthermorewe conjecture on the basis of 1 that P is of the formP(q ) = q−1(c0+ c1q−1).

Application of the IV method to this structure yields:

>> tht=iv4(z,[2 2 1]);

>> present(tht)

This matrix was created by the command IV4

on 12/17 1993 at 20:5 Loss fcn: 1.027

Akaike‘s FPE: 1.055 Sampling interval 1

The polynomial coefficients and their

standard deviations are

B =

0 1.0687 0.4660

0 0.0584 0.0784

A =

1.0000 −1.4825 0.6854

0 0.0186 0.0163

The sample variance of the residuals and the FPE are lessthan for n = 3. All coefficients are significantly differentfrom 0.

Testing with resid shows that the residuals for the es-timated ARX scheme are not white. We therefore try to es-timate an ARMAX scheme. The function armax from theToolbox is based on the prediction error method. We esti-mate a model of the form D(q )X t = P(q )u t +N (q )ǫt withthe structure of D and P as just found. For the degree ofN we try a number of possibilities:

th2211 = armax(z,[2 2 1 1]);

th2221 = armax(z,[2 2 2 1]);

th2231 = armax(z,[2 2 3 1]);

The parameters of the second argument such as [2 2 11] of armax successively are the degrees of D, P and N ,and the dead time. The sample variances and FPE valuesthat are found are given in Table 6.2. The smallest valueof the FPE is reached for the (2, 2, 2, 1) structure:

84

Page 91: Time Series Analysis and System Identification

Scheme σ2 FPE a 1 a 2 a 3 a 4 a 5 c0 c1 c2 c3 c4

1 9.515 9.643 0.2114 −0.4114(14.49) (0.1753)

2 1.207 1.24 −1.5450 0.7331 −0.0739 1.3577(0.0143) (0.0141) (0.0633) (0.0657)

3 1.069 1.113 −1.3765 0.5234 0.0750 −0.0482 1.1154 0.5926(0.2160) (0.3428) (0.1666) (0.0591) (0.0669) (0.2657)

4 1.15 1.213 −0.7785 −0.5329 0.7660 −0.1393 −0.0112 1.0389 1.2351 0.1405(0.1149) (0.1560) (0.1002) (0.0442) (0.0624) (0.0747) (0.1467) (0.1949)

5 0.9795 1.047 −0.3730 −0.1019 −0.3409 0.3398 0.1083 −0.0123 1.0712 1.6182 1.4661 0.6744(0.1985) (0.2134) (0.2156) (0.1813) (0.1233) (0.0578) (0.0620) (0.2372) (0.3573) (0.2428)

Table 6.1: Results IV estimation ARX scheme. (In parentheses: standard deviations of the estimates)

>> present(th2221)

This matrix was created by the command ARMAX

on 12/17 1993 at 20:19 Loss fcn: 0.968

Akaike‘s FPE: 1.008 Sampling interval 1

The polynomial coefficients and their

standard deviations are

B =

0 1.0774 0.4471

0 0.0571 0.0775

A =

1.0000 −1.4807 0.6812

0 0.0184 0.0152

C =

1.0000 −1.0225 −0.17120 0.0610 0.0602

A, B and C successively are the polynomials D, P and N .Comparison with the actual values of the coefficients ofthe Åström system reveals that the estimates are reason-ably accurate.

A residual test with resid shows that the residuals maybe considered to be white.

The observed cancellation of corresponding factors inD and P does not occur in the model that is estimatedwith the IV method for n = 4. Subsequent application ofarmax with an assumed degree 4 of D, however, quicklyleads to the conclusion that the degree of D is 2.

6.5 Identification of state models

6.5.1 Introduction

In this section we consider the very general problem ofestimating parameters in linear state models. Many of themodels we studied so far are special cases of this problem.The extension to state models also opens up the possi-bility to study multivariable estimation problems, that is,estimation problems for systems with several inputs andoutputs.

We consider state models of the form

X t+1 = AX t + Bu t +Vt , (6.57)

Yt =C X t +Du t +Wt , t ∈Z. (6.58)

A ∈ Rn×n , B ∈ Rn×k , C ∈ Rm×n and D ∈ Rm×k are ma-trices with suitable dimensions. The n-dimensional pro-cess X t is the state vector. The k -dimensional signal u t

is the input signal and the m -dimensional process Yt theoutput signal. The input signal u t may be a realization ofa stochastic process but this is not necessary.

Vt and Wt are vector-valued white noise processes ofdimensions n and m , respectively, with zero means. Thismeans that

Vt =

V1,t

V2,t

· · ·Vn ,t

, Wt =

W1,t

W2,t

· · ·Wm ,t

, t ∈Z, (6.59)

where the components Vi ,t and Wj ,t all are scalar zeromean white noise processes. The component processesmay but need not be independent. It is assumed that

E

Vt1

Wt1

��

V Tt2

W Tt2

=

¨

R for t1 = t2,0 for t1 6= t2,

(6.60)

with R an (n +m )× (n +m ) symmetric matrix that maybe written in the form

R =

R1 R12

RT12 R2

. (6.61)

R is called the variance matrix of the vector-valued whitenoise process Vt and Wt .

The equations (6.57–6.58) typically originate fromknown first principles for the system. It is required toestimate one or several unknown parameters in the sys-tem equations based on a number of observation pairs(u t , Yt ), t = 0, 1, . . . , N −1, of the input and output signals.This problem has the previous estimation problems forAR, ARX and ARMAX models as a special cases.

85

Page 92: Time Series Analysis and System Identification

Example 6.5.1 (ARX model as state model). The ARXmodel

Yt −a 1Yt−1−a 2Yt−2− · · ·−a n Yt−n

= c0u t + c1u t−1+ · · ·+ cm u t−m + ǫt

with ǫt the noise term, can be cast as a state model usingwhat is called the observer canonical form,

X t+1 =

0 · · · · · · 0 a n

1 0 · · ·... a n−1

0 1...

......

......

... 0...

0 · · · 0 1 a 1

︸ ︷︷ ︸

A

X t +

cn +a n c0

cn−1+a n−1c0

...

...c1+a 1c0

︸ ︷︷ ︸

B

u t ,

Yt =�

0 · · · · · · 0 1�

︸ ︷︷ ︸

C

X t + c0︸︷︷︸

D

u t + ǫt︸︷︷︸

Wt

.

(6.62)

Estimation of the coefficients a i and c j can be seen asestimation of the elements of A , B and D. �

6.5.2 Identification with the prediction error method

We apply the prediction error method for the estimationof the parameters θ that occur in the model (6.57–6.58).To make the dependency on θ explicit we write the sys-tem matrices as Aθ , Bθ , C θ and Dθ and similarly for otherdata the depends on θ . To solve the prediction errorproblem we use several well-known results from Kalmanfiltering theory (see for instance Kwakernaak and Sivan(1972) or Bagchi (1993)). Given the system

X t+1 = AθX t + Bθu t +Vt , (6.63)

Yt =C θX t +Dθu t +Wt , t ∈Z, (6.64)

define X θt |t−1 as the best estimate of X t based on the ob-servations Y0, Y1, . . . , Yt−1. For Y we use a correspondingnotation. Note that we consider u t as fixed and knownfor all t . Then we have

X θt+1|t = Aθ X θ

t |t−1+ Bθu t +K θ (t )[Yt −C θ X θt |t−1−Dθu t ],

(6.65)for t = 0, 1, . . .. The best one-step predictor X θt+1|t of X t+1,given Y0, Y1, . . . , Yt , hence may be computed recursively.The initial condition for the recursion is

X0|−1 =EX0. (6.66)

The sequence of “gain matrices” K θ (0), K θ (1), . . . is alsofound recursively from the matrix equations

K θ (t ) =�

AθQθ (t )C θ ,T+R12

��

R2+C θQθ (t )C θ ,T�−1

,

(6.67)

Qθ (t +1) =�

Aθ −K θ (t )C θ�

Qθ (t )Aθ ,T+R1−K θ (t )RT12,

(6.68)

for t = 0, 1, . . .. The symmetric matrix Q(t ) equals

Qθ (t ) = var(X t − X θt |t−1)

=E(X t − X θt |t−1)(X t − X θ

t |t−1)T,

and, hence, is the variance matrix of the one-step predic-tion of the state. The initial condition for (6.67–6.68) is

Q(0) = varX0 (6.69)

=E(X0−EX0)(X0−EX0)T.

The best one-step prediction Y θt+1|t of Yt is

Y θt+1|t =C θ X θt+1|t +Dθu t+1, (6.70)

so that the one-step prediction error of Yt equals

e θt = Yt − Y θt |t−1

= Yt −C θ X θt |t−1−Dθu t .

We summarize how the one-step prediction errors are de-termined:

1. Solve the recursive matrix equations (6.67–6.68) with the initial condition (6.69) forK θ (0), K θ (1), . . . , K θ (N −1).

2. Solve the Kalman filter equation (6.65) with the ini-tial condition (6.66) recursively to determine theone-step predictions of the state X θt |t−1 for t =

0, 1, . . . , N −1.

3. Compute the one-step prediction errors

e θt= Yt −C θ X θ

t |t−1−Dθu t , t = 0, 1, . . . , N −1.

From Kalman filtering theory it is known that the one-step prediction errors e θ

tform a vector-valued white

noise process.The minimum prediction error method for the identi-

fication of the system (6.57–6.58) involves the minimiza-tion of the sum of the squares of the prediction errors

N−1∑

t=0

e θ ,Tt

e θt

(6.71)

with respect to the unknown parameter vector θ that oc-curs in the model. This is done numerically using the op-timization algorithms reviewed in § 5.5 (p. 68). The appli-cation of these methods requires that formulas are devel-oped to compute the gradient of the prediction error e θtwith respect to θ .

Under reasonable conditions (such as persistent exci-tation) consistent estimators are obtained, like for otherapplications of the prediction error method. The accu-racy of the estimates may be analyzed in a way that is sim-ilar to that in § 5.4.5 (p. 67).

86

Page 93: Time Series Analysis and System Identification

6.6 Further problems in identification theory

6.6.1 Introduction

In this section we list a number of further current prob-lems and themes in identification theory.

6.6.2 Structure determination

In § 5.6 (p. 69) we discussed the problem of determiningthe correct order of ARMA schemes for time series. In sys-tem identification the same problem arises. The methodsthat are mentioned in § 5.6 may also be applied to sys-tem identification. We already did this in the example of§ 6.4.3 (p. 83).

A problem that is closely related to structure determi-nation is the question whether the system that is to beidentified does or does not belong to the model set that isconsidered. Suppose for instance that we attempt to esti-mate a system as an ARMAX scheme with certain degreesof the various polynomials. It is very well possible that theactual system does not belong to this class of models. Thequestion is how we may ascertain this. For the parametricidentification methods we considered the ultimate testfor accepting the model that is found is whether the resid-uals are white.

A further problem in this context is the identifiability

of the system. A system is not identifiable if within theassumed model set there are several systems that explainthe observed system behavior. This occurs for instanceif we attempt to explain an ARMA(3,2)-process D(q )X t =

N (q )ǫt with an ARMA(4,3)-scheme. The polynomials D

and N then are allowed to have an arbitrary common fac-tor of degree 1.

Lack of identifiability often manifests itself by large val-ues of the estimated estimation error of the parameters.The reason is that in case of lack of identifiability thematrix of second derivatives of the maximum likelihoodfunction or the mean square prediction error becomessingular. Sometimes this phenomenon only becomes no-ticeable for long measurement series.

6.6.3 Recursive system identification

The system identification methods that were discussed sofar are based on batch processing of a complete observa-tion series. In some applications the dynamic propertiesof the system or process change with time. Then it may benecessary to keep observing the system, and to use eachnew observation to update the system model. For thispurpose recursive identification methods have been de-veloped. Often these are recursive versions of the batch-oriented algorithms that we discussed.

The best known recursive identification algorithm ap-plies to the ARX-model

yt = a 1yt−1+a 2yt−2+ · · ·+a n yt−n

+ c0u t + c1u t−1+ · · ·+ cm u t−m + ǫt .

We rewrite this equation in the form

yt =φ(t )θt + ǫt , (6.72)

with

φ(t ) =�

yt−1 yt−2 · · · yt−n u t u t−1 · · · u t−m

,

θt =�

a 1 a 2 · · · a n c0 c1 · · · cm

�T.

If the parameters do not change with time then we have

θt+1 = θt . (6.73)

Equations (6.73) and (6.72) together define a system instate form as in (6.63–6.64):

X t+1 = AX t + Bu t +Vt ,

Yt =C X t +Du t +Wt , t ∈Z.

We have X t = θt , A = I , B = 0, Vt = 0, C =φ(t ), D = 0 andWt = ǫt . Note that the matrix C now is time-varying, andalso note that this state representation of the ARX modelhas nothing to do with the observer canonical form of theARX model as explained in Example 6.5.1.

The Kalman filter equations (6.65–6.68) now apply, witha small modification for the time dependence of C . Forthe best estimate of the parameter vector we obtain therecursive equation

θt+1|t = θt |t−1+K (t )[yt −φ(t )θt |t−1], t ≥ 0. (6.74)

The sequence of gain matrices K (0), K (1), . . . is recursivelydetermined from the matrix equations

K (t ) =Q(t )φT(t )�

σ2+φ(t )Q(t )φT(t )�−1

, (6.75)

Q(t +1) =�

I −K (t )φ(t )�

Q(t )+R1, t ≥ 0. (6.76)

Here we have taken take R12 = 0 and R2 = σ2, but R1 hasbeen left as it is.

The equations (6.74–6.76) form a recursive algorithmfor the identification of the ARX scheme. For R1 = 0 thealgorithm actually is a recursive implementation of theleast squares algorithm of § 6.3.2 (p. 80), see also Sec-tion 5.2.2. If the input signal is persistently exciting thenfor R1 = 0 the estimates of the parameters converge to thecorrect values.

If the parameters are not constant but vary (slowly)with time then this may be accounted for by choosing R1

different from 0 (but positive). This corresponds to themodel

θt+1 = θt +Vt , (6.77)

with Vt vector-valued white noise. Each parameter ismodeled as a random walk. In the estimation algorithm(6.74–6.76) the gain matrix K (t ) now does not approach 0— as is the case for R1 = 0 — but the algorithm continuesto update the estimates.

87

Page 94: Time Series Analysis and System Identification

controller systemu y

v

Figure 6.10: System in feedback loop

6.6.4 Design of identification experiments

In system identification the accuracy of the estimates thatare found depends on the properties of the input signal.The input signal needs to be sufficiently “rich” to acquirethe desired information about the system. An importantlimitation often is that the input signal amplitude cannotbe larger than the system can absorb. The problem howto select the input signal — if this freedom is available —is known as the problem of the design of identification ex-periments.

6.6.5 Identification in the closed loop

It sometimes happens that the system that needs to beidentified is part of a feedback loop as in Fig. 6.10. By thefeedback connection the input signal u is no longer un-correlated with the noise v . This causes special difficul-ties.

6.6.6 Robustification

Commonly, observed time series and signal records con-tain flaws, such as missing data points or observationsthat are completely wrong (“outliers”). To spot and pos-sibly correct such exceptions it is imperative to plot thedata and inspect them for missing data points, outliersand other anomalies before further processing.

If minimum prediction error methods are used thenthe estimates may be “robustified” by weighting outliersless heavily. Similar techniques may be conceived forother identification techniques. Robustification is a sub-ject that is extensively studied in statistics.

6.7 Problems

6.1 Cross periodogram. Prove that the cross peri-odogram (6.32) is the DCFT of the estimate (6.21)of the cross covariance function of Yt and Ut withM =N . Hint: Compare Proof A.1.10 (p. 94).

6.2 Consider the diagram of Fig. 6.1 (p. 75) and assumethat the system is a convolution z t =

m hm u t−m .Why is it not a good idea to try to identify the sys-tem with the wide sense stationary input u t gener-ated via the MA(2) scheme

u t = (1+q−2)ǫt .

6.3 Mean-square error linear filtering⋆. Consider

Yt =

∞∑

m=−∞hmUt−m +Vt

and suppose that Yt ,Uy and Vt are zero mean widesense stationary processes. In Subsection 6.2.4 it isshown that H (eiω) = φy u (ω)/φu (ω) provided thatUt and Vt are uncorrelated processes (i.e. EUt Vn =

0 ∀t , n).

In the absence of knowledge of properties of Vt itmakes sense to try to “explain” Yt as much as pos-sible by the input Ut . Therefore consider

minhm ,m∈Z

E�Yt −

∞∑

m=−∞hmUt−m

�2 (6.78)

This is an example of linear filtering. Show that theDCFT h(ω) of the hm that minimize this expressionis determined by the linear equation

h(ω) =φy u (ω)

φu (ω). (6.79)

Note: Recall Eqn. (6.19). The above equation when

seen as an equation in h(ω) is the frequency domain

version of the Wiener-Hopf equation. The so ob-

tained hm need not correspond to a causal system,

that is, the inverse DCFT hm ofφy u (ω)

φu (ω)may be nonzero

for certain m < 0. Minimizing (6.78) with respect to

causal systems hm is more involved; it is the famous

Wiener filtering problem, related to Kalman filtering.

A fairly straightforward way to circumvent the prob-

lem of causality is to consider FIR systems, considered

in the following problem.

6.4 Causal FIR systems. It is well known that discretetime systems of the form Zt =

∑∞k=−∞ hmUt−m are

causal if and only if hm = 0 for all m < 0, that is, if

Zt =

∞∑

m=0

hmUt−m .

We say the system is FIR (finite impulse response) ifonly a finite number of values hm , m ∈N is nonzero.Causal FIR systems thus can be expressed as a finitesum

Zt =

M−1∑

m=0

hmUt−m , M <∞.

FIR systems are popular for their simplicity and in-herent stability.

a) Show that FIR systems are BIBO-stable

b) Given zero mean wide sense stationary Yt andUt , show that hm , m = 0, 1, . . . , M minimizes

88

Page 95: Time Series Analysis and System Identification

E(Yt −∑M

m=0 hmUt−m )2 if and only if

ru (0) ru (1) · · · ru (M −1)ru (1) ru (0) · · · ru (M −2)· · · · · · · · · · · ·

ru (M −1) ru (M −2) · · · ru (0)

︸ ︷︷ ︸

Ru

h0

h1

...hm−1

(6.80)

=

ry u (0)ry u (1)

...ry u (M −1)

c) Suppose the hm minimize the above expectedsquared error. Let E t = Yt −

∑M−1m=0 hmUt−m .

Show that

σ2E=σ2

Y−hT

Ru h (6.81)

where h is the vector h =�

h0 · · · hm−1

�T

and Ru is the covariance matrix defined inEqn. (6.80).

Note the resemblance of (6.80) with the Yule-Walker

equations (2.63). The expression (6.81) shows that

σE ≤σY , as is to be expected.

6.5 Estimation of the ARX model.3 Argue that the max-imum likelihood method and the prediction errormethod lead to (approximately) the same estima-tors for the parameters of the ARX model as the leastsquares method.

6.6 Schemes of Box-Jenkins and Ljung. The ARMAXscheme

Yt =P(q )

D(q )u t +

N (q )

D(q )ǫt (6.82)

is a very general model for linear time-invariant sys-tems. Several variants exist.

a) The Box-Jenkins scheme is given by

Yt =P(q )

D(q )u t +

N (q )

R(q )ǫt . (6.83)

i. Show that by reducing the fractions on theright-hand side to the common denomi-nator D(q )R(q ) the scheme may be con-verted to an ARMAX scheme. Some struc-tural information is lost this way, however.

ii. Determine the one-step predictor and theresulting prediction error for this scheme.

b) Ljung (1987) works with the general model

D(q )Yt =P(q )

Q(q )u t +

N (q )

R(q )ǫt . (6.84)

i. Show that also this scheme may be con-verted to an ARMAX scheme.

ii. Determine the one-step predictor and theresulting prediction error for this scheme.

3Examination May 30, 1995.

89

Page 96: Time Series Analysis and System Identification

90

Page 97: Time Series Analysis and System Identification

A Proofs

This appendix contains a number of proofs and addi-tional technical results.

Chapter 2

Lemma A.1.1 (Cauchy-Schwarz). For stochastic variables

X , Y with finite second order momentsE(X 2)<∞, E(Y 2)<

∞, there holds that

1. |E(X Y )|2 ≤E(X 2) E(Y 2)

2. |E(X Y )|2 =E(X 2) E(Y 2) if and only if Y = 0 or X = k Y

for some k ∈R.

Proof. Letσ2X :=E(X 2) andσ2

Y :=E(Y 2). The result is triv-ial if X = 0 or Y = 0. Consider X 6= 0 and Y 6= 0. Then forany γ ∈R,

0≤E� X

σX

−γY

σY

�2(A.1)

=E� X 2

σ2X

−2γX Y

σXσY

+γ2 Y 2

σ2Y

= 1−2γE(X Y )

σXσY

+γ2.

As the above is nonnegative it follows that

2γE(X Y )

σXσY

≤ 1+γ2. (A.2)

For γ=±1 this reads |E(X Y )|σXσY≤ 1. This proves Condition 1.

If equality |E(X Y )|2 = E(X 2)E(Y 2) holds then for γ =sgn(E(X Y )) the expression (A.1) is zero, so necessarily

X

σX−γ Y

σY= 0. Take k = γσX /σY .

Proof A.1.2 (Lemma 2.3.1). If r (τ) is absolutely summablethen its spectral density φ(ω) exists and r (0) =

12π

∫ π

−πφ(ω)dω<∞. Therefore ψ(ω) :=p

φ(ω) is squareintegrable and hence its inverse Fourier transform—call ith t —is square summable. The spectral density of Y := h∗ǫequals σ2|ψ|2 = σ2φ hence, modulo scaling, Yt has co-variance function r (τ).

Proof A.1.3 (Levinson-Durbin algorithm (p. 14)). For anyn let θn denote the column vector of coefficients θn =�

a n1 a n2 · · · a nn

�Tdetermined by (2.64). Then θn+1

by definition satisfies

ρ(0) ρ(1) · · · ρ(n −1) ρ(n )

ρ(1) ρ(0) · · · ρ(n −2) ρ(n −1)...

.

.....

.

.....

ρ(n −1) ρ(n −2) · · · ρ(0) ρ(1)ρ(n ) ρ(n −1) · · · ρ(1) ρ(0)

︸ ︷︷ ︸

Pn+1

a 1

a 2

...a n

a n+1

︸ ︷︷ ︸

θn+1

=

ρ(1)ρ(2)

...ρ(n )

ρ(n +1)

︸ ︷︷ ︸

vn+1

.

(A.3)

For notational convenience we introduced here besidesθn+1 also the short-hands Pn+1 and vn+1 (for any n). Wefurther need the matrix Jn defined as the n × n anti-diagonal matrix

Jn =

0 · · · 0 10 · · · 1 0· · · · · · · · · · · ·1 0 0 0

. (A.4)

Premultiplication by J reverses the order of the rows, soJ M is M but with its rows reversed. Postmultiplicationas in M J reverses the columns of M . Now, an interestingproperty of Pn is that it is and symmetric and constantalong diagonals. This fact implies that

JnPn =Pn Jn , (A.5)

that is reversing the order of rows is the same reversingthe columns of Pn . You may want to Verify this.

Assume we have determined θn , that is, we solved θn

from

Pnθn = vn . (A.6)

Equation (A.3) may be expressed in block-partitionedmatrices as

Pn J vn

v Tn J ρ(0)

θn+1 =

vn

ρ(n +1)

. (A.7)

Since vn =Pnθn and ( JPn )θn = (Pn J )θn we get that

Pn Pn Jθn

v Tn

J ρ(0)

θn+1 =

Pnθn

ρ(n +1)

. (A.8)

Split θn+1 as θn+1 =� s

a n+1,n+1

�with s ∈Rn , then

Pn Pn Jθn

v Tn

J ρ(0)

��

s

a n+1,n+1

=

Pnθn

ρ(n +1)

. (A.9)

From the top row-block we can solve for s ,

s = θn −a n+1,n+1 Jθn , (A.10)

then a n+1,n+1 follows from the bottom row (insertingρ(0) = 1),

v Tn J (θn −a n+1,n+1 Jθn )+a n+1,n+1 =ρ(n +1). (A.11)

This gives the partial correlation coefficient

a n+1,n+1 =ρ(n +1)− v T

nJθn

1− v Tnθn

(A.12)

and then the other coefficients s of θn+1 =� s

a n+1,n+1

�follow

from (A.10).

91

Page 98: Time Series Analysis and System Identification

Chapter 3

Proof A.1.4 (Cramér-Rao inequality, (p. 35)). Denote thejoint probability density function of the stochastic vari-ables X1, X2, · · · , XN as f (x ,θ ). By the definition of theprobability density function and the assumed unbiased-ness of the estimator we have

1=

RN

f (x ,θ )dx , θ =

RN

s (x ) f (x ,θ )dx .

The integrals are multiple integrals with respect to x1, x2,· · · , xN . Partial differentiation of the two expressions withrespect to θ yields

0=

RN

f θ (x ,θ )dx , 1=

RN

s (x ) f θ (x ,θ )dx , (A.13)

where the subscript θ denotes partial differentiation withrespect to θ . By the substitution

f θ (x ,θ ) =

�∂

∂ θlog f (x ,θ )

f (x ,θ ) = Lθ (x ,θ ) f (x ,θ )

(A.14)it follows that

0=

RN

Lθ (x ,θ ) f (x ,θ )dx , 1=

RN

s (x )Lθ (x ,θ ) f (x ,θ )dx .

(A.15)These equalities may be rewritten as

0=ELθ (X ,θ ), 1=ESLθ (X ,θ ), (A.16)

with S = s (X ). By subtracting θ times the first equalityfrom the second we find

E[(S−θ )Lθ (X ,θ )] = 1. (A.17)

Then by the Cauchy-Schwarz inequality,

1≤E(S−θ )2 E[Lθ (X ,θ )]2. (A.18)

This proves (3.15) with M (θ ) =E[Lθ (X ,θ )]2. It remains toprove the other equality of (3.16). Partial differentiationof the first equality of (A.15) yields

0=

RN

Lθθ (x ,θ ) f (x ,θ ) dx +

RN

Lθ (x ,θ ) f θ (x ,θ ) dx .

(A.19)With (A.14) we find from this that

0=

RN

Lθθ (x ,θ ) f (x ,θ ) dx +

RN

[Lθ (x ,θ )]2 f (x ,θ ) dx ,

orM (θ ) =E[Lθ (X ,θ )]2 =−ELθθ (X ,θ ).

This completes the proof of (3.15–3.16). The Cauchy-Schwarz inequality (A.18) is an equality if and only if

Lθ (x ,θ ) = k (θ ) (s (x )−θ ) for all x ∈RN , θ ∈R (A.20)

for some k ∈R (which may depend on θ ).

Proof A.1.5 (Cramér-Rao inequality for the vector case,

(p. 36)). Denote the joint probability density function ofthe stochastic variables X1, X2, · · · , XN as f (x ,θ ). By thedefinition of the probability density function and the as-sumed unbiasedness of the estimator we have

RN

f (x ,θ )dx = 1,

RN

s (x ) f (x ,θ )dx = θ .

The integrals are multiple integrals with respect to x1, x2,· · · , xN . Partial differentiation of the two expressions withrespect to (vector) θ yields

RN

f θ T(x ,θ )dx = 0,

RN

s (x ) f θ T(x ,θ )dx = I (A.21)

where the subscript θ T denotes partial differentiationwith respect to row vector θ T. By the substitution

f θ T(x ,θ ) = f (x ,θ )

�∂

∂ θ Tlog f (x ,θ )

= f (x ,θ )LTθ(x ,θ )

(A.22)it follows that

ELTθ =

RN

f (x ,θ )Lθ T(x ,θ )dx = 0, (A.23)

E(SLTθ) =

RN

s (x ) f (x ,θ )Lθ T(x ,θ )dx = I .

By subtracting θ times the first inequality from the sec-ond we find

E[(S−θ )LTθ] = I . (A.24)

Note that

E

S−θLθ

��

(S−θ )T LTθ

=

var(S) I

I M

. (A.25)

Here M =ELθ LTθ is Fischer’s information matrix. By con-

struction matrix (A.25) is nonnegative definite,

var(S) I

I M

≥ 0. (A.26)

Then so is

I −M−1��

var(S) I

I M

��

I

−M−1

= var(S)−M−1 ≥ 0.

This is what we set out to prove. It also shows that

var(S)−M−1

=�

I −M−1��

E

S−θLθ

��

(S−θ )T LTθ

���

I

−M−1

=E�S−θ −M−1Lθ

��S−θ −M−1Lθ

�T.

Therefore var(S) =M−1 if and only if S − θ −M−1Lθ = 0,that is, if and only if Lθ =M (S−θ ).

92

Page 99: Time Series Analysis and System Identification

It remains to prove the other equality of (3.20). Partialdifferentiation of the first equality of (A.23) with respectto column vector θ yields

0=

RN

Lθθ T(x ,θ ) f (x ,θ )dx +

RN

f θ (x ,θ )Lθ T (x ,θ )dx .

(A.27)With (A.22) we find from this that

0=

RN

Lθθ T(x ,θ ) f (x ,θ ) dx

+

RN

Lθ (x ,θ ) f (x ,θ )Lθ T(x ,θ )dx ,

which means that

E[Lθ (X ,θ )Lθ T(X ,θ )] =−ELθθ T (X ,θ ). (A.28)

Proof A.1.6 (Lemma 3.4.2). The expectation of K X is

E(K X ) =E(K (W θ + ǫ))

= K W θ +K (Eǫ)

= K W θ .

so the estimator K X is unbiased (for arbitrary θ ) if andonly if K W = I .

It is interesting that sum of squares such asE‖θ−θ ‖2 =∑

j E(θj − θj )2 can also be expressed as the sum of di-

agonal elements of the corresponding variance matrixE(θ − θ )(θ − θ )T. Verify this. The sum of diagonal ele-ments is called the trace and is denoted with tr.

For any unbiased estimator θ = K X of θ the sum ofvariances can now be expressed as

E‖θ − θ‖2 =E‖θ −K (W θ + ǫ)‖2 (A.29)

=E‖K ǫ‖2

= trE(K ǫǫTK T) =σ2tr(K K T).

Write K asK = (W TW )−1W T+ L

with L not yet determined. Now for unbiasedness weneed that K W = I ,

I = K W =�(W TW )−1W T+ L

�W = I + LW.

Hence LW = 0. As a result we have that

K K T =�(W TW )−1W T+ L

��(W TW )−1W T+ L

�T(A.30)

= (W TW )−1+ LLT.

It is now direct that tr(K K T) is minimal iff L = 0, i.e., (A.29)is minimized for Kopt = (W TW )−1W T. With this notationwe get that

K K T = KoptKTopt+ LLT

≥ KoptKTopt.

This finally shows that cov(K X ) ≥ cov(KoptX ) for any un-biased estimator K X of θ .

Chapter 4

Proof A.1.7 (Eqn. (4.3)). In a sequence of N elementsthere are N − 2 triples (xk−1,xk ,xk+1). Per triple (a ,b , c )

there are six equally likely possibilities (with =largest,=smallest, 6 =medium)

(a ,b , 6 c ) (a , 6b , c ) (a ,b , 6 c ) ( 6 a ,b , c ) (a , 6b , c ) ( 6 a ,b , c ).

Of these 6 there are 4 with a turning point. So per tripleEθ = 4/6 = 2/3. There are N − 2 triples, so that Eθ =23(N −2).

Proof A.1.8 (Variance of the estimator (4.18) for the mean).

For the variance of mN we have

var(mN ) =E(mN −m )2

=E

1

N

N−1∑

t=0

(X t −m )

!2

=E

1

N

N−1∑

t=0

(X t −m )

!

1

N

N−1∑

s=0

(Xs −m )

!

=1

N 2

N−1∑

t=0

N−1∑

s=0

E(X t −m )(Xs −m )

=1

N 2

N−1∑

t=0

N−1∑

s=0

r (t − s ).

Let k = t − s . For a fixed 0 ≤ k ≤ N − 1 there are N − k

pairs of (t , s ) for which t − s = k ,

(t , s )∈ {(k , 0), (k +1, 1), . . . , (N −1, N −k −1)} (A.31)

For reasons of symmetry there are N − |k | such pairs if−(N −1)≤ k ≤ 0. Therefore

var(mN ) =1

N 2

N−1∑

k=−N+1

(N − |k |)r (k )

=1

N

N−1∑

k=−N+1

�1− |k |

N

�r (k ).

Proof A.1.9 (Covariance of the estimator (4.32) for the co-

variance function). For k ≥ 0 and k ′ ≥ 0 we have

cov(rN (k ), rN (k′))

=E (rN (k )−E rN (k ))�

rN (k′)−E rN (k

′)�

=E rN (k )rN (k′)−E rN (k )E rN (k

′)

=1

N 2

N−1−k∑

t=0

N−1−k ′∑

s=0

EX t+k X t Xs+k ′Xs

− (1− k

N)(1− k ′

N)r (k )r (k ′).

(A.32)

93

Page 100: Time Series Analysis and System Identification

With 3.2 (p. 38) it follows that

EX t+k X t Xs+k ′Xs

=EX t+k X t EXs+k ′Xs +EX t+k Xs+k ′ EX t Xs

+EX t+k Xs EX t Xs+k ′

= r (k )r (k ′)+ r (t − s +k −k ′)r (t − s )

+ r (t − s +k )r (t − s −k ′).

Substitution of this into (A.32) and evaluation yields

cov(rN (k ), rN (k′))

=1

N 2

N−1−k∑

t=0

N−1−k ′∑

s=0

[r (t − s +k −k ′)r (t − s )

+ r (t − s +k )r (t − s −k ′)]

Substituting s = t − i into the second summation and in-terchanging the order of summation yields

cov(rN (k ), rN (k′)) (A.33)

=1

N

N−1−k∑

i=−N+1+k ′wN ,k ,k ′ (i )[r (i +k −k ′)r (i )+ r (i +k )r (i −k ′)].

Here the function wN ,k ,k ′ is defined by

wN ,k ,k ′ (i ) =

N−k ′−i

Nfor i ≤ 0,

N−k−i

Nfor i ≥ k ′−k ,

N−k ′

Nfor 0< i < k ′−k .

(A.34)

For simplicity it is assumed here that k ′ ≥ k (otherwiseinterchange the role of k ′ and k ). If k = k ′ (A.33) reducesto (4.34). For N ≫ max(k , k ′) we have wN ,k ,k ′ (i ) ≈ 1 and(A.33) simplifies to (4.36).

Proof A.1.10 (Relation between the estimator (4.42) and

periodogram (4.43)). We write the function pN of (4.43) as

pN (ω) =1

N

�����

N−1∑

t=0

X t e−itω

�����

2

=1

N

N−1∑

t=0

X t e−itωN−1∑

s=0

Xs eisω

=1

N

N−1∑

t=0

N−1∑

s=0

X t Xs e−i(t−s )ω .

The above double sum equals the sum of all entries in theN ×N matrix

1

N

X0X0 e0 X0X1 eiω X0X2 ei2ω · · ·X1X0 e−iω X1X1 e0 X1X2 eiω · · ·X2X0 e−i2ω X2X1 e−i1ω X2X2 e0 · · ·· · · · · · · · · · · ·

. (A.35)

On each diagonal the exponential term is constant. If wesum the entries of the matrix per diagonal we get

N−1∑

k=−N+1

1

N

N−|k |−1∑

t=0

X t+|k |X t

!

︸ ︷︷ ︸

rN (k )

e−iωk = φN (ω). (A.36)

Proof A.1.11 (Lemma 4.6.2). For the moment assume thatX t is also white, X t = ǫt i.e. that r (k ) = 0 for all k 6= 0.Using the fact that φǫ(ω) equals the periodogram we getthat

φǫ(ω1)φǫ(ω2) (A.37)

=1

N 2

N−1∑

t=0

N−1∑

s=0

N−1∑

u=0

N−1∑

v=0

(ǫt ǫsǫu ǫv )e−iω1(t−s ) e−iω2(u−v )

(We skipped the subscript N and added subscript ǫ to ex-press that this is for white noise only.) In order computeits expectation we first use (3.23) to find the expectationof ǫt ǫsǫu ǫv ,

Eǫt ǫsǫu ǫv =Eǫt ǫs Eǫu ǫv +Eǫt ǫu Eǫsǫv +Eǫt ǫv Eǫsǫu

=σ4(δt−sδu−v +δt−uδs−v +δt−vδs−v )

where δz denotes the Kronecker delta. Insert thisin (A.37) to get

E φǫ(ω1)φǫ(ω2)

=1

N 2

� N−1∑

t=s=0

N−1∑

u=v=0

σ4+

N−1∑

t=u=0

N−1∑

s=v=0

σ4 e−i(ω1+ω2)(t−s )

+

N−1∑

t=v=0

N−1∑

s=u=0

σ4 e−i(ω1−ω2)(t−s )

=σ4+σ4

N 2

�����

N−1∑

t=0

e−i(ω1+ω2)t

�����

2

+σ4

N 2

�����

N−1∑

t=0

e−i(ω1−ω2)t

�����

2

=σ4+σ4

N 2

�sin((ω1+ω2)N/2)

sin((ω1+ω2)/2)

�2

(A.38)

+σ4

N 2

�sin((ω1−ω2)N/2)

sin((ω1−ω2)/2)

�2

=σ4�1+1

NWB (ω1+ω2)+

1

NWB (ω1−ω2)

�. (A.39)

Indeed, the quotient of the two sinusoids here is aBartlett’s spectral window WB for M = N as givenby (4.53). Its plot is shown in Fig. 4.6. From this plot itshould be intuitively clear (it can also easily be proved)that, still with M =N ,

limN→∞

1

NWB (ω) =δω, ∀ω∈ (−2π, 2π).

This together with the fact that φ is asymptotically unbi-ased gives us

limN→∞

cov(φǫ(ω1), φǫ(ω2))

= limN→∞E(φǫ(ω1), φǫ(ω2))−φǫ(ω1)φǫ(ω2)

︸ ︷︷ ︸

σ4

=σ4(δω1+ω2 +δω1−ω2 ) ∀ω1,ω2 ∈ (−π,π).

In the limit the covariance hence is zero almost every-where, but it equals σ4 ifω1 =ω2 6= 0. For white noise wehave σ4 =φ2

ǫ(ω), so both two conditions of Lemma 4.6.2

are verified for the white noise case X t .

94

Page 101: Time Series Analysis and System Identification

The proof of the non-white case proceeds as follows.Since we assume X t to be stationary normally distributedit must be that X t =

∑∞k=−∞ hk ǫt−k for some square

summable sequence h j . (This is a generalization of (3.1)).We know that thenφX (ω) = |H (ω)|2φǫ(ω). Now for the es-timates it may be shown that

φX (ω) = |H (ω)|2φǫ(ω)+O(1/p

N ).

This is enough to prove Lemma 4.6.2. Consider

limN→∞

cov(φX (ω1), φX (ω2))

=E φX (ω1), φX (ω2)−φX (ω1)φX (ω2)

=E |H (ω1)|2φǫ(ω1)|H (ω2)|2φǫ(ω2)−φX (ω1)φX (ω2)

= |H (ω1)|2|H (ω2)|2σ4

︸ ︷︷ ︸

φX (ω1)φX (ω2)

(1+δω1+ω2 +δω1−ω2 )−φX (ω1)φX (ω2)

=φX (ω1)φX (ω2)(δω1+ω2 +δω1−ω2 ).

Inserting ω1 = ω2 and ω1 6= ±ω2 yields the two condi-tions of Lemma 4.6.2.

Proof A.1.12 (Lemma 4.6.4).

E φwN(ω1)φ

wN(ω2)

=E

1

∫ π

−πW (ω1−ρ)φN (ρ)dρ

1

∫ π

−πW (ω2−β )φN (β )dβ

=1

(2π)2

∫ π

−π

∫ π

−πW (ω1−ρ)W (ω2−β )[E φN (ρ)φN (β )] dβdρ.

According to (A.39) we have that EφN (ρ)φN (β ) = σ4(1+1N

WB (ρ+β )+1N

WB (ρ−β )), (for N =M ). Now, as N →∞the function WB (ω) approaches 2π

k δ(ω− 2kπ), withδ(ω) denoting the Dirac delta function. This is becauseof (4.50) and the fact that WB has period 2π and for anyǫ > 0 the integral

ǫ≤|ω|≤πWB (ω)dω→ 0 as N →∞. Then

(think about it),

limN→∞Eφw

N(ω1)φ

wN(ω2)

=1

(2π)2

∫ π

−π

∫ π

−πW (ω1−ρ)W (ω2−β )×

[σ4(1+2π

Nδ(ρ+β )+

Nδ(ρ−β )] dβdρ.

Without the two delta terms the right-hand side actuallyis φ(ω1)φ(ω2) =σ4! So

limN→∞

cov(φX (ω1),φX (ω2))

= limN→∞E φw

N(ω1)φ

wN(ω2)−σ4

=σ4

2πN

∫ π

−π

∫ π

−πW (ω1−ρ)W (ω2−β )[δ(ρ+β )+δ(ρ−β )] dβdρ

=σ2

2πN

∫ π

−πW (ω1−ρ)[W (ω2−ρ)+W (ω2+ρ)] dρ.

In the second equality we used the sifting property ofdelta functions which says that

∫∞−∞ f (t )δ(t −a )dt = f (a )

for any function f (t ) that is continuous at t = a .

Proof A.1.13 (Inverse DDFT (4.67)). For given xk define

xn :=∑L−1

k=0 xk e−i 2πL

nk . Then

1

L

L−1∑

n=0

xn ei 2πL

nk

=1

L

L−1∑

n=0

� L−1∑

m=0

xm e−i 2πL

nm

ei 2πL

nk

=1

L

L−1∑

n=0

L−1∑

m=0

xm e−i 2πL

nm ei 2πL

nk

=1

L

L−1∑

m=0

xm

L−1∑

n=0

e−i 2πL

n(m−k )

=1

L

L−1∑

m=0

xm

L−1∑

n=0

�e−i 2π

L(m−k )

�n

=1

L

L−1∑

m=0

xm

1−�

e−i2π/L(m−k )�L

1−e−i2π/L (m−k ) if e−i 2πL(m−k ) 6= 1

L if e−i 2πL(m−k ) = 1

=1

L

L−1∑

m=0

xm

¨

0 if m 6= k

L if m = k

= xk .

Frequency content of sampled signals. We review aproper mathematical foundation for Equality (4.84),which we copy here

x ∗(ω) =

∞∑

k=−∞x (ω−k

T), ω ∈R. (A.40)

The equality is about the connection of the CCFT x ofa continuous time signal x (t ) and the adjusted DCFT x ∗

(4.83) of the sampled signal x ∗n

:= x (nT ), n ∈Z, with sam-pling time T .

There are counter examples of (A.40) in that functionsx (t ) exist for which both x and x ∗ are well defined, whilethe right-hand side of (A.40) is not defined. An example isthe continuous function x (t ) defined per interval as

x (t ) = e−α|t | sin(((2(22k +1))+1)t ),

forkπ≤ |t | ≤ (k +1)π, k ∈N.

The value of αmust be positive but is otherwise arbitrary.Since |sin | is bounded by one, we see that x (t ) decays tozero exponentially fast. This guarantees that x and x ∗ ex-ist. The x (t ) is unusual in that it oscillates extremely fastas |t | increases. This is the source of the violation (A.40).We need to exclude arbitrary fast oscillations.

Let Va ,b (x ) denote the total variation of a function x (t )

on the interval [a ,b ]⊂R,

Va ,b = supa=t0<t1<···<tn=b

n∑

i=1

|x (t i )−x (t i−1)|. (A.41)

95

Page 102: Time Series Analysis and System Identification

We say that a function x (t ) is of uniform bounded vari-

ation if supt∈RVt ,t+1(x ) < ∞. Uniform bounded varia-tion implies, roughly speaking, that the oscillations of thefunction do no grow without bound as time increases. Itmay now be shown that if x (t ) is exponentially boundedby some C e−α|t |, α > 0, C > 0 and if x (t ) is of uniformbounded variation then (A.40) is correct provided we takethe sampled signal x ∗

nto be defined as

x ∗n :=x (nT−)+x (nT+)

2. (A.42)

This is well defined because for functions of boundedvariation the limits x (t −) and x (t +) exist for every t . Ifx (t ) is continuous then this is the ordinary sampled sig-nal x ∗n = x (nT ) and we recover the original (A.40).

Chapter 5

Proof A.1.14 (Eqn. (5.29) is the minimal ǫTǫ). Let y = ǫ− ǫwhere ǫ is defined as

ǫ =M T(M M T)−1(X −µe ). (A.43)

The ǫ satisfies X −µe =Mǫ if and only if M y = 0 because

X −µe =Mǫ

=M (ǫ+ y )

=M M T(M M T)−1(X −µe )+M y

= X −µe +M y .

Knowing that M y = 0 gives ǫTy = 0. Therefore

ǫTǫ = [ǫ+ y ]T[ǫ+ y ]

= ǫTǫ+ ǫTy + y Tǫ+ y Ty

= ǫTǫ+ y Ty .

It is now immediate that ǫTǫ is minimal if and only if y =

0. In that case we have

ǫTǫ = ǫTǫ

= (X −µe )(M M T)−1M ·M T(M M T)−1(X −µe )

= (X −µe )(M M T)−1(X −µe ).

96

Page 103: Time Series Analysis and System Identification

B The Systems Identification Toolbox

In this appendix we list those routines of the Systems

Identification Toolbox of MATLAB that appear in the lec-ture notes and are useful for the computer laboratory ex-ercises. For more extensive information we refer to thehelp function of MATLAB and the Toolbox manual. In thelist only the simplest way of calling each routine is de-scribed.

All estimation routines assume that the times series oroutput/input pairs are centered. This can be done withthe function dtrend.

th = ar(z,n)

Estimate the parameters th of an AR scheme fromthe observed time series z. The integer n is thedegree of the polynomial that characterizes thescheme. The time series z is arranged as a columnvector. See page 62.

th = armax(z,nn)

Estimate the parameters th of an ARMA or ARMAXscheme from the time series z or the output/inputpair z = [y u]. The row matrix nn contains the de-grees of the polynomials that occur in the scheme.The parameters are specified in the special thetaformat of the Toolbox. Enter the command helptheta for a description of this format. See also thefunction present. See pages 68 and 84.

th = arx(z,nn)

Estimate the parameters th (in theta format) of anAR or ARX scheme from the time series z or the out-put/input pair z = [y u]. The row matrix nn con-tains the degrees of the polynomial that characterizethe scheme. See page 81.

bode(g)

Show a Bode plot of the frequency function g. Usehelp idmode/help for more information. Seepages 78 and 79.

r = covf(z,M)

Determine an estimate r of the covariance functionof the time series z over M points. See page 46.

ir = cra(z)

Produce an estimate ir of the impulse response ofa system from the output/input pair z = [y u] us-ing correlation techniques. See pages 78 and 84.

z = detrend(z,’constant’)

Center the time series z or the output/input pair z= [y u]. dtrend(z,’linear’) removes a lineartrend. See page 78.

ffplot(g)

Plot the frequency function g with linear scales.

[omega,ampl,phase] = getff(g)

Determine the frequency axis omega, the amplitudeampl and the phase phase of the frequency functiong. See pages 20 and 53.

y = idsim(u,th)

Compute the output signal y of a system with pa-rameters th (in theta format) for the input signal u.See for instance page 15 and 81.

th = iv4(z,nn)

Estimate the parameters th (in theta format) of anARX model with output/input pair z = [y u] ac-cording to a four stage IV method. The row vector nncontains the degrees of the polynomials that deter-mine the model. See page 84

th = poly2th(A,B,C,D,F,lam)

Convert the polynomial A, B, C, D, F of a Ljungscheme to the parameter matrix th in theta format.The parameter lam is the variance of the white noise.See for instance page 15.

p = predict(z,th,k)

Determine a k step prediction p of the output signalof the system with parameters th (in theta format)with output/input pair z = [y u]. This functionmay also be used for the prediction of time series.See page 25.

present(th)

Show the information in the theta matrix th. Seepage 84.

e = pe(th,x)

Compute the residuals of a system model with pa-rameter matrix th (in theta format) that follow fromthe time series z or the output/input pair z = [yu]. The command resid(th,x) is supposed toplot and also test the residuals, but it contains a mis-take. See page 81.

g = spa(z)

Find an estimate g (in frequency format) of thefrequency response function of a system with out-put/input pair z = [y u] using spectral methods.If z is a time series then g is an estimate of thespectral density function. The routine also suppliesthe standard deviation of the estimation error. Seepage 53.

97

Page 104: Time Series Analysis and System Identification

g = th2ff(th)

Compute the frequency response function g (in fre-quency function format) of the system with thetastructure th. See page 20.

[A,B,C,D,F] = th2poly(th)

Convert the theta matrix th to the polynomials A,B, C, D, F of the corresponding Ljung scheme.

Especially for this course the following routines weredeveloped.

r = cor(z,n)

This routine estimates the correlation function of z.The integer n is the number of points over which thefunction is computed. The output argument r con-tains the correlation function in an n-dimensionalrow vector. In addition the function and its confi-dence intervals are plotted.

p = pacf(z,n)

This routine estimates the partial auto-correlationfunction of z. The integer n is the number of pointsover which the function is computed. The output ar-gument p contains the partial correlation function inan n-dimensional row vector. In addition the func-tion and its confidence intervals are plotted.

98

Page 105: Time Series Analysis and System Identification

C Glossary English–Dutch

aliasing vouweffect

amplitude distribution amplitudeverdeling

angular frequency hoekfrequentieARMA process ARMA-proces

ARMA scheme ARMA-schema

auto-covariance function autocovariantiefunctie

auto-regressive process autoregressief proces

backward shift operator achterwaartse verschuivingsoperator

band filter bandfilter

band limited in bandbreedte begrensd

bias onzuiverheid

centered process gecentreerd proces

characteristic function karakteristieke functie

coherence spectrum coherentiespectrum

complex conjugate toegevoegd complexe waarde

consistent consistent

consistency consistentie

continuous time continue-tijd

convolution convolutie

correlation function correlatiefunctie

covariance covariantie

covariance function covariantiefunctie

cross covariance function kruiscovariantiefunctie

cross spectral density kruis-spectraledichtheidsfunctie

denumerable aftelbaar

difference operator differentieoperator

disturbance signal stoorsignaal

entry element

efficient efficiënt

ergodic ergodisch

estimation error schattingsfout

estimation schatting

estimator schatter

exogenous exogeen

expectation verwachting, verwachtingswaarde

fast Fourier transform snelle Fouriertransformatie

filter filter

final prediction error uiteindelijke voorspelfout

Fisher’s information matrix informatiematrix van Fisher

forecasting voorspellen

Fourier transform Fouriergetransformeerde

Fourier transformation Fouriertransformatie

frequency frequentie

frequency response frequentieresponsie

frequency response function frequentieresponsiefunctie

gain versterking

gain matrix versterkingsmatrix

generating function genererende functie

gradient gradiënt

Hessian matrix van Hess

identifiability identificeerbaarheid

impulse response impulsresponsie

information criterion informatiecriterium

initial condition beginvoorwaarde

input ingang

input signal ingangssignaal

instrumental variable method instrumentele-variabelemethode

invertible inverteerbaar

joint gezamenlijk

Kalman filter Kalmanfilter

least squares method kleinste-kwadratenmethode

likelihood function aannemelijkheidsfunctie

line search lijnzoeken

long division staartdelenmain lobe hoofdlob

maximum likelihood estimator maximum-aannemelijkheidsschatter

mean value function gemiddelde-waardefunctie

mean gemiddelde

model set modelverzameling

moving average process gewogen-gemiddeldeproces

moving average gewogen gemiddelde

multi-dimensional meerdimensionaal

noise ruis

non-parametric niet-parametrisch

nonnegative definite positief-semidefiniet

normally distributed normaal verdeeld

Nyquist frequency Nyquistfrequentie

order determination ordebepaling

outlier uitschieter

output uitgang

output signal uitgangssignaal

parametric parametrisch

partial correlation coefficients partiële correlatiecoëfficiënten

partial fraction expansion breuksplitsen

periodogram periodogram

persistently exciting persistent exciterend

plot grafiek

prediction error voorspelfout

prediction voorspelling

predictor voorspeller

presampling filter conditioneringsfilter

probability density kansdichtheid

probability law kanswet

random process toevalsproces, stochastisch proces

random variable toevalsvariabele, stochastische variabele

random walk stochastische wandeling

realization realisatie

rectangular rechthoekig

residual residu

resolution oplossend vermogen

running average process lopend-gemiddeldeproces

sample (n.) steekproef

sample (v.) bemonsteren

sample average steekproefgemiddelde

sample mean steekproefgemiddelde

sampling theorem bemonsteringsstelling

script script

seasonal component seizoenscomponent

side lobe zijlob

spectral density spectrale dichtheid

spectral density function spectrale dichtheidsfunctie

stable stabiel

standard deviation standaarddeviatie, spreiding

state toestand

stationary stationair

steepest descent method steilste-hellingmethode

step response stapresponsie

stochastic stochastisch

system identification systeemidentificatie

test toets, test

time dependent tijdsafhankelijk

time invariant tijdsonafhankelijk

time series tijdreeks

time shift tijdverschuiving

transform getransformeerde

transformation transformatie

trend trend

unbiased zuiver

update (v.) bijwerken

variance matrix variantiematrix

white noise witte ruis

wide-sense stationary zwak-stationair

window venster

windowing vensteren

z -transform z -getransformeerde

99

Page 106: Time Series Analysis and System Identification

100

Page 107: Time Series Analysis and System Identification

D Bibliography

A. Bagchi. Optimal Control of Stochastic Systems. PrenticeHall International, Hempstead, U.K., 1993.

A. Bagchi and R. C. W. Strijbos. Tijdreeksenanalyse enidentificatietheorie. Lecture notes (in Dutch), Facultyof Mathematical Sciences, University of Twente, 1988.

G. E. P. Box and G. M. Jenkins. Time Series Analysis, Fore-

casting and Control. Holden-Day, San Francisco, 1970.

C. S. Burrus and T. W. Parks. DFT/FFT and Convolution

Algorithms. John Wiley, New York, 1985.

J. Durbin. Estimation of parameters in time-series regres-sion models. J. R. Statist. Soc. B., 22:139–153, 1960.

M. Kendall and J. K. Orr. Time Series. Edward Arnold, U.K.,third edition, 1990.

H. Kwakernaak and R. Sivan. Linear Optimal Control Sys-

tems. Wiley-Interscience, New York, 1972.

H. Kwakernaak and R. Sivan. Modern Signals and Systems.Prentice Hall, Englewood Cliffs, N. J., 1991.

L. Ljung. System Identification: Theory for the User. Pren-tice Hall, Englewood Cliffs, N. J., 1987.

L. Ljung. User’s Guide of the System Identification Toolbox

for Use with MATLAB. The MathWorks, Natick, Mass.,U.S.A., 1991.

S. L. Marple. Digital Spectral Analysis. Prentice-Hall, En-glewood Cliffs, N. J., 1987.

101

Page 108: Time Series Analysis and System Identification

102

Page 109: Time Series Analysis and System Identification

Index

anti-aliasing filter, 55ARIMA scheme, 22ARMAX scheme, 81ARMA scheme, 16

estimation of-, 64frequency response, 18

ARX scheme, 80AR scheme, 11

asymptotically stationary, 12estimation of-, 59stable-, 13

asymptoticallyefficient estimator, 35stable, 13stable ARMA process, 16stationary AR processes, 12unbiased estimator, 34wide-sense stationary process, 12, 13

auto-covariance function, 77auto-regressive process, 11

band filter, 19band limited signal, 55Bartlett’s window, 48bias, 34BIBO stability, 76Box-Jenkins, 59, 89

scheme, 89

CCFT, 21Census X-11, 41, 43centered process, 11centering, 11characteristic function, 33Chebyshev inequality, 34classical time series analysis, 41coherence spectrum, 78consistent estimator, 34continuous time processes, 53convolution, 11

sum, 17system, 17

Cooley-Tukey, 51correlation

function, 7matrix, 7, 8

covariancefunction, 6, 8matrix, 7, 8, 32

Cramér-Rao inequality, 35, 61, 92

vector case, 36, 92cross

covariance function, 76periodogram, 78spectral density, 77

Daniell’s window, 50DCFT, 17, 50

inverse, 18DDFT, 51density function, 5design of experiments, 88DFT, 17difference operator, 22distribution

Gaussian-, 31joint probability-, 7normal-, 31probability-, 5standard normal-, 31

efficient estimator, 35ergodic process, 44error-in-the-equation model, 80estimate, 34estimation error, 34estimator, 34, 35

asymptotically unbiased, 34consistent, 34unbiased, 34

exogenous signal, 80

fast Fourier transform, 51FFT, 51filter, 19final prediction error criterion, 71Fisher’s information matrix, 36forecasting, 24forward shift operator, 10Fourier transform, 21

discrete, 17, 51fourth-order moment, 38FPE criterion, 71frequency

analysis, 76response function, 17, 22

Gauss-Newton algorithm, 69Gaussian distribution, 31generating function, 20

103

Page 110: Time Series Analysis and System Identification

Hamming’s window, 50Hann’s window, 48Hanning, 48

i, 17identifiability, 87identification in closed loop, 88impulse response, 17, 22

analysis, 75incidental component, 22, 41information criterion

of Akaike, 70of Rissanen, 71

instrumental variable, 81instrumental variable method, 80inverse DCFT, 18invertible scheme, 15IV method, 81

Kalman filter, 86

least squaresalgorithm, 69estimator

AR scheme, 59ARMA scheme, 64ARX scheme, 80MA scheme, 63

non-linear, 63, 69recursive, 59

Levenberg-Marquardt algorithm, 69Levinson-Durbin algorithm, 14, 70likelihood function, 35line search, 68Ljung’s scheme, 89

main lobe, 48Markov scheme, 11matrix

correlation-, 7, 8covariance-, 7, 8, 32nonnegative definite-, 7, 36positive definite-, 36variance-, 7, 32

maximum likelihood estimator, 34, 60, 65MA scheme, 9

estimation of-, 63mean, 5

estimation of-, 43value function, 5

model set, 87moving average, 42

process, 9

Newton algorithm, 68non-linear optimization, 68nonnegative definite matrix, 7, 36normally distributed processes, 31

normal distribution, 31standard-, 31

Nyquist frequency, 55

observer canonical form, 86order determination, 69, 70

Parseval, 50partial correlation coefficients, 14, 70periodogram, 47persistently exciting, 77, 83, 86positive definite matrix, 36power, 19power spectral density, 18prediction, 24

of ARMA process, 24prediction error, 24prediction error method, 65, 82, 86

accuracy of-, 67presampling filter, 55probability density under transformation, 32probability distribution, 5probability law, 7process

asymp. wide sense stationary, 12, 13centered, 11moving average, 9running average, 10

quasi-Newton algorithm, 69

recursive identification, 87residuals

analysis of-, 70resolution, 50, 77robustification, 88running average process, 10

sampling, 53sampling theorem, 51, 55seasonal

component, 22, 41process, 23

Shannon, 55shift operator, 10side lobes, 48sinc, 51spectral

analysis, 17, 21, 76density, 18, 21

estimation of-, 46resolution, 48window, 48

Spencer’s formula, 42stability

AR scheme, 13asymptotic, 13

standard biased covariance estimator, 45

104

Page 111: Time Series Analysis and System Identification

standard deviation, 6, 8standard normal distribution, 31state model, 85

identification of-, 85stationary process, 7

wide sense, 8steepest descent method, 68step response analysis, 75stochastic process, 5, 7strictly stationary process, 7structure determination, 87system identification, 75

non-parametric, 75parametric, 75

testfor stochasticity, 41for trend, 41for whiteness, 56, 70

time axis, 7tr, 93trace, 93trend, 22, 41, 42

unbiased estimator, 34uncorrelated, 7

variancematrix, 7, 32

vector gain, 60

white noise, 9, 20wide-sense stationary process, 8window, 47

Bartlett’s, 48Daniell’s, 50Hamming’s, 50Hann, 48Hanning, 48rectangular, 48

windowing, 47

Yule-Walker equations, 13, 70Yule scheme, 26

z -transform, 20

105