springer texts in statistics -...

Springer Texts in Statistics

Advisors:George Casella Stephen Fienberg Ingram Olkin

Springer Texts in Statistics

Alfred: Elements of Statistics for the Life and Social SciencesBerger: An Introduction to Probability and Stochastic ProcessesBilodeau and Brenner: Theory of Multivariate StatisticsBlom: Probability and Statistics: Theory and ApplicationsBrockwell and Davis: Introduction to Times Series and Forecasting, Second

EditionCarmona: Statistical Analysis of Financial Data in S-PlusChow and Teicher: Probability Theory: Independence, Interchangeability,

Martingales, Third EditionChristensen: Advanced Linear Modeling: Multivariate, Time Series, and

Spatial Data—Nonparametric Regression and Response SurfaceMaximization, Second Edition

Christensen: Log-Linear Models and Logistic Regression, Second EditionChristensen: Plane Answers to Complex Questions: The Theory of Linear

Models, Third EditionCreighton: A First Course in Probability Models and Statistical InferenceDavis: Statistical Methods for the Analysis of Repeated MeasurementsDean and Voss: Design and Analysis of Experimentsdu Toit, Steyn, and Stumpf: Graphical Exploratory Data AnalysisDurrett: Essentials of Stochastic ProcessesEdwards: Introduction to Graphical Modelling, Second EditionFinkelstein and Levin: Statistics for LawyersFlury: A First Course in Multivariate StatisticsGut: Probability: A Graduate CourseHeiberger and Holland: Statistical Analysis and Data Display:

An Intermediate Course with Examples in S-PLUS, R, and SASJobson: Applied Multivariate Data Analysis, Volume I: Regression and

Experimental DesignJobson: Applied Multivariate Data Analysis, Volume II: Categorical and

Multivariate MethodsKalbfleisch: Probability and Statistical Inference, Volume I: Probability,

Second EditionKalbfleisch: Probability and Statistical Inference, Volume II: Statistical

Inference, Second EditionKarr: ProbabilityKeyfitz: Applied Mathematical Demography, Second EditionKiefer: Introduction to Statistical InferenceKokoska and Nevison: Statistical Tables and FormulaeKulkarni: Modeling, Analysis, Design, and Control of Stochastic SystemsLange: Applied ProbabilityLange: OptimizationLehmann: Elements of Large-Sample Theory

(continued after index)

Robert H. ShumwayDavid S. Stoffer

Time Series Analysis and Its ApplicationsWith R Examples

Second Edition

With 160 Illustrations

Robert H. Shumway David S. StofferDepartment of Statistics Department of StatisticsUniversity of California, Davis University of PittsburghDavis, CA 95616 Pittsburgh, PA 15260USA [email protected] [email protected]@wald.ucdavis.edu

Editorial BoardGeorge Casella Stephen Fienberg Ingram OlkinDepartment of Statistics Department of Statistics Department of StatisticsUniversity of Florida Carnegie Mellon University Stanford UniversityGainesville, FL 32611-8545 Pittsburgh, PA 15213-3890 Stanford, CA 94305USA USA USA

Library of Congress Control Number: 2005935284

ISBN-10: 0-387-29317-5ISBN-13: 978-0387-29317-2

Printed on acid-free paper.

©2006 Springer Science+Business Media, LLCAll rights reserved. This work may not be translated or copied in whole or in part without thewritten permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street,New York, NY 10013, USA), except for brief excepts in connection with reviews or scholarlyanalysis. Use in connection with any form of information storage and retrieval, electronic adap-tation, computer software, or by similar or dissimilar methodology now known or hereafterdeveloped is forbidden.The use in this publication of trade names, trademarks, service marks, and similar terms, even ifthey are not identified as such, is not to be taken as an expression of opinion as to whether ornot they are subject to proprietary rights.

Printed in the United States of America. (MVY)

9 8 7 6 5 4 3 2 1

springer.com

To my wife, Ruth, for her support and joie de vivre, and to thememory of my thesis adviser, Solomon Kullback.

R.H.S.

To my family, who constantly remind me what is important.D.S.S.

Preface to the Second Edition

The second edition marks a substantial change to the first edition. Per-haps the most significant change is the introduction of examples based onthe freeware R package. The package, which runs on most operating systems,can be downloaded from The Comprehensive R Archive Network (CRAN) athttp://cran.r-project.org/ or any one of its mirrors. Readers who haveexperience with the S-PLUS R© package will have no problem working withR. For novices, R installs some help manuals, and CRAN supplies links tocontributed tutorials such as R for Beginners. In our examples, we assumethe reader has downloaded and installed R and has downloaded the neces-sary data files. The data files can be downloaded from the website for thetext, http://www.stat.pitt.edu/stoffer/tsa2/ or any one of its mirrors.We will also provide additional code and other information of interest on thetext’s website. Most of the material that would be given in an introductorycourse on time series analysis has associated R code. Although examples aregiven in R, the material is not R-dependent. In courses we have given usinga preliminary version of the new edition of the text, students were allowed touse any package of preference. Although most students used R (or S-PLUS), anumber of them completed the course successfully using other programs suchas ASTSA, MATLAB R©, SAS R©, and SPSS R©.

Another substantial change from the first edition is that the material hasbeen divided into smaller chapters. The introductory material is now containedin the first two chapters. The first chapter discusses the characteristics oftime series, introducing the fundamental concepts of time plot, models fordependent data, auto- and cross-correlation, and their estimation. The secondchapter provides a background in regression techniques for time series data.This chapter also includes the related topics of smoothing and exploratory dataanalysis for preprocessing nonstationary series.

In the first edition, we covered ARIMA and other special topics in the timedomain in one chapter. In this edition, univariate ARIMA modeling is pre-sented in its own chapter, Chapter 3. The material on additional time domaintopics has been expanded, and moved to its own chapter, Chapter 5. Theadditional topics include long memory models, GARCH processes, thresholdmodels, regression with autocorrelated errors, lagged regression, transfer func-tion modeling, and multivariate ARMAX models. In this edition, we haveremoved the discussion on reduced rank models and contemporaneous mod-els from the multivariate ARMAX section. The coverage of GARCH modelshas been considerably expanded in this edition. The coverage of long memorymodels has been consolidated, presenting time domain and frequency domainapproaches in the same section. For this reason, the chapter is presented afterthe chapter on spectral analysis.

The chapter on spectral analysis and filtering, Chapter 4, has been ex-panded to include various types of spectral estimators. In particular, kernelbased estimators and spectral window estimators have been included in the dis-

viii Preface to the Second Edition

cussion. The chapter now includes a section on wavelets that was in anotherchapter in the first edition. The reader will also notice a change in notationfrom the previous edition.

In the first edition, topics were supplemented by theoretical sections atthe end of the chapter. In this edition, we have put the theoretical topics inappendices at the end of the text. In particular, Appendix A can be usedto supplement the material in the first chapter; it covers some fundamentaltopics in large sample theory for dependent data. The material in Appendix Bincludes theoretical material that expands the presentation of time domaintopics, and this appendix may be used to supplement the coverage of thechapter on time series regression and the chapter on ARIMA models. Finally,Appendix C contains a theoretical basis for spectral analysis.

The remaining two chapters on state-space and dynamic linear models,Chapter 6, and on additional statistical methods in the frequency domain,Chapter 7, are comparable to the their first edition counterparts. We do men-tion that the section on multivariate ARMAX, which used to be in the state-space chapter, has been moved to Chapter 5. We have also removed spectraldomain canonical correlation analysis and the discussion on wavelets (now inChapter 4) that were previously in Chapter 7. The material on stochasticvolatility models, now in Chapter 6, has been expanded. R programs for someChapter 6 examples are available on the website for the text; these programsinclude code for the Kalman filter and smoother, maximum likelihood estima-tion, the EM algorithm, and fitting stochastic volatility models.

In the previous edition, we set off important definitions by highlightingphrases corresponding to the definition. We believe this practice made it dif-ficult for readers to find important information. In this edition, we have setoff definitions as numbered definitions that are presented in italics with theconcept being defined in bold letters.

We thank John Kimmel, Executive Editor, Statistics, for his guidance inthe preparation and production of this edition of the text. We are particularlygrateful to Don Percival and Mike Keim at the University of Washington, fornumerous suggestions that led to substantial improvement to the presentation.We also thank the many students and other readers who took the time tomention typographical errors and other corrections to the first edition. Inparticular, we appreciate the efforts of Jeongeun Kim, Sangdae Han, and MarkGamalo at the University of Pittsburgh, and Joshua Kerr and Bo Zhou at theUniversity of California, for providing comments on portions of the draft ofthis edition. Finally, we acknowledge the support of the National ScienceFoundation.

Robert H. ShumwayDavis, CA

David S. StofferPittsburgh, PA

August 2005

Preface to the First Edition

The goals of this book are to develop an appreciation for the richness andversatility of modern time series analysis as a tool for analyzing data, and stillmaintain a commitment to theoretical integrity, as exemplified by the seminalworks of Brillinger (1981) and Hannan (1970) and the texts by Brockwell andDavis (1991) and Fuller (1995). The advent of more powerful computing, es-pecially in the last three years, has provided both real data and new softwarethat can take one considerably beyond the fitting of simple time domain mod-els, such as have been elegantly described in the landmark work of Box andJenkins (see Box et al., 1994). This book is designed to be useful as a textfor courses in time series on several different levels and as a reference workfor practitioners facing the analysis of time-correlated data in the physical,biological, and social sciences.

We believe the book will be useful as a text at both the undergraduate andgraduate levels. An undergraduate course can be accessible to students with abackground in regression analysis and might include Sections 1.1-1.8, 2.1-2.9,and 3.1-3.8. Similar courses have been taught at the University of California(Berkeley and Davis) in the past using the earlier book on applied time seriesanalysis by Shumway (1988). Such a course is taken by undergraduate studentsin mathematics, economics, and statistics and attracts graduate students fromthe agricultural, biological, and environmental sciences. At the master’s degreelevel, it can be useful to students in mathematics, environmental science, eco-nomics, statistics, and engineering by adding Sections 1.9, 2.10-2.14, 3.9, 3.10,4.1-4.5, to those proposed above. Finally, a two-semester upper-level graduatecourse for mathematics, statistics and engineering graduate students can becrafted by adding selected theoretical sections from the last sections of Chap-ters 1, 2, and 3 for mathematics and statistics students and some advancedapplications from Chapters 4 and 5. For the upper-level graduate course, weshould mention that we are striving for a less rigorous level of coverage thanthat which is attained by Brockwell and Davis (1991), the classic entry at thislevel.

A useful feature of the presentation is the inclusion of data illustratingthe richness of potential applications to medicine and in the biological, phys-ical, and social sciences. We include data analysis in both the text examplesand in the problem sets. All data sets are posted on the World Wide Webat the following URLs: http://www.stat.ucdavis.edu/˜shumway/tsa.htmland http://www.stat.pitt.edu/˜stoffer/tsa.html, making them easilyaccessible to students and general researchers. In addition, an exploratorydata analysis program written by McQuarrie and Shumway (1994) can bedownloaded (as Freeware) from these websites to provide easy access to all ofthe techniques required for courses through the master’s level.

Advances in modern computing have made multivariate techniques in thetime and frequency domain, anticipated by the theoretical developments inBrillinger (1981) and Hannan (1970), routinely accessible using higher level

x Preface to the First Edition

languages, such as MATLAB and S-PLUS. Extremely large data sets driven byperiodic phenomena, such as the functional magnetic resonance imaging seriesor the earthquake and explosion data, can now be handled using extensionsto time series of classical methods, like multivariate regression, analysis ofvariance, principal components, factor analysis, and discriminant or clusteranalysis. Chapters 4 and 5 illustrate some of the immense potential thatmethods have for analyzing high-dimensional data sets.

The many practical data sets are the results of collaborations with researchworkers in the medical, physical, and biological sciences. Some deserve specialmention as a result of the pervasive use we have made of them in the text.The predominance of applications in seismology and geophysics is joint workof the first author with Dr. Robert R. Blandford of the Center for MonitoringResearch and Dr. Zoltan Der of Ensco, Inc. We have also made extensive useof the El Nino and Recruitment series contributed by Dr. Roy Mendelssohn ofthe National Marine Fisheries Service. In addition, Professor Nancy Day of theUniversity of Pittsburgh provided the data used in Chapter 4 in a longitudinalanalysis of the effects of prenatal smoking on growth, as well as some of thecategorical sleep-state data posted on the World Wide Web. A large magneticimaging data set that was developed during joint research on pain perceptionwith Dr. Elizabeth Disbrow of the University of San Francisco Medical Centerforms the basis for illustrating a number of multivariate techniques in Chap-ter 5. We are especially indebted to Professor Allan D.R. McQuarrie of theUniversity of North Dakota, who incorporated subroutines in Shumway (1988)into ASTSA for Windows.

Finally, we are grateful to John Kimmel, Executive Editor, Statistics, forhis patience, enthusiasm, and encouragement in guiding the preparation andproduction of this book. Three anonymous reviewers made numerous helpfulcomments, and Dr. Rahman Azari and Dr. Mitchell Watnik of the Universityof California, Davis, Division of Statistics, read portions of the draft. Anyremaining errors are solely our responsibility.

Robert H. ShumwayDavis, CA

David S. StofferPittsburgh, PA

August 1999

Contents

1 Characteristics of Time Series 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The Nature of Time Series Data . . . . . . . . . . . . . . . . . 41.3 Time Series Statistical Models . . . . . . . . . . . . . . . . . . . 111.4 Measures of Dependence: Autocorrelation

and Cross-Correlation . . . . . . . . . . . . . . . . . . . . . . . 181.5 Stationary Time Series . . . . . . . . . . . . . . . . . . . . . . . 231.6 Estimation of Correlation . . . . . . . . . . . . . . . . . . . . . 291.7 Vector-Valued and Multidimensional Series . . . . . . . . . . . 34Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2 Time Series Regression and Exploratory Data Analysis 482.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.2 Classical Regression in the Time Series Context . . . . . . . . . 492.3 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . 572.4 Smoothing in the Time Series Context . . . . . . . . . . . . . . 71Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3 ARIMA Models 843.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.2 Autoregressive Moving Average Models . . . . . . . . . . . . . 853.3 Difference Equations . . . . . . . . . . . . . . . . . . . . . . . . 983.4 Autocorrelation and Partial Autocorrelation Functions . . . . . 1033.5 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1103.6 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223.7 Integrated Models for Nonstationary Data . . . . . . . . . . . . 1403.8 Building ARIMA Models . . . . . . . . . . . . . . . . . . . . . 1433.9 Multiplicative Seasonal ARIMA Models . . . . . . . . . . . . . 154Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

4 Spectral Analysis and Filtering 1744.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1744.2 Cyclical Behavior and Periodicity . . . . . . . . . . . . . . . . . 1764.3 The Spectral Density . . . . . . . . . . . . . . . . . . . . . . . . 181

xii Contents

4.4 Periodogram and Discrete Fourier Transform . . . . . . . . . . 1874.5 Nonparametric Spectral Estimation . . . . . . . . . . . . . . . . 1974.6 Multiple Series and Cross-Spectra . . . . . . . . . . . . . . . . . 2154.7 Linear Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2204.8 Parametric Spectral Estimation . . . . . . . . . . . . . . . . . . 2284.9 Dynamic Fourier Analysis and Wavelets . . . . . . . . . . . . . 2324.10 Lagged Regression Models . . . . . . . . . . . . . . . . . . . . 2454.11 Signal Extraction and Optimum Filtering . . . . . . . . . . . . 2514.12 Spectral Analysis of Multidimensional Series . . . . . . . . . . . 256Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

5 Additional Time Domain Topics 2715.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2715.2 Long Memory ARMA and Fractional Differencing . . . . . . . 2715.3 GARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . 2805.4 Threshold Models . . . . . . . . . . . . . . . . . . . . . . . . . . 2895.5 Regression with Autocorrelated Errors . . . . . . . . . . . . . . 2935.6 Lagged Regression: Transfer Function Modeling . . . . . . . . . 2955.7 Multivariate ARMAX Models . . . . . . . . . . . . . . . . . . . 302Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320

6 State-Space Models 3246.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3246.2 Filtering, Smoothing, and Forecasting . . . . . . . . . . . . . . 3306.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . 3396.4 Missing Data Modifications . . . . . . . . . . . . . . . . . . . . 3486.5 Structural Models: Signal Extraction and Forecasting . . . . . 3526.6 ARMAX Models in State-Space Form . . . . . . . . . . . . . . 3556.7 Bootstrapping State-Space Models . . . . . . . . . . . . . . . . 3576.8 Dynamic Linear Models with Switching . . . . . . . . . . . . . 3626.9 Nonlinear and Non-normal State-Space

Models Using Monte Carlo Methods . . . . . . . . . . . . . . . 3766.10 Stochastic Volatility . . . . . . . . . . . . . . . . . . . . . . . . 3886.11 State-Space and ARMAX Models for

Longitudinal Data Analysis . . . . . . . . . . . . . . . . . . . . 394Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404

7 Statistical Methods in the Frequency Domain 4127.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4127.2 Spectral Matrices and Likelihood Functions . . . . . . . . . . . 4167.3 Regression for Jointly Stationary Series . . . . . . . . . . . . . 4177.4 Regression with Deterministic Inputs . . . . . . . . . . . . . . . 4267.5 Random Coefficient Regression . . . . . . . . . . . . . . . . . . 4347.6 Analysis of Designed Experiments . . . . . . . . . . . . . . . . 4387.7 Discrimination and Cluster Analysis . . . . . . . . . . . . . . . 449

Contents xiii

7.8 Principal Components and Factor Analysis . . . . . . . . . . . 4647.9 The Spectral Envelope . . . . . . . . . . . . . . . . . . . . . . . 479Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495

Appendix A: Large Sample Theory 501A.1 Convergence Modes . . . . . . . . . . . . . . . . . . . . . . . . 501A.2 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . 509A.3 The Mean and Autocorrelation Functions . . . . . . . . . . . . 513

Appendix B: Time Domain Theory 522B.1 Hilbert Spaces and the Projection Theorem . . . . . . . . . . . 522B.2 Causal Conditions for ARMA Models . . . . . . . . . . . . . . 526B.3 Large Sample Distribution of the AR(p)

Conditional Least Squares Estimators . . . . . . . . . . . . . . 528B.4 The Wold Decomposition . . . . . . . . . . . . . . . . . . . . . 532

Appendix C: Spectral Domain Theory 534C.1 Spectral Representation Theorem . . . . . . . . . . . . . . . . . 534C.2 Large Sample Distribution of the DFT and

Smoothed Periodogram . . . . . . . . . . . . . . . . . . . . . . 539C.3 The Complex Multivariate Normal Distribution . . . . . . . . . 550

References 555

Index 569

Chapter 1

Characteristics of TimeSeries

1.1 Introduction

The analysis of experimental data that have been observed at different pointsin time leads to new and unique problems in statistical modeling and infer-ence. The obvious correlation introduced by the sampling of adjacent pointsin time can severely restrict the applicability of the many conventional statis-tical methods traditionally dependent on the assumption that these adjacentobservations are independent and identically distributed. The systematic ap-proach by which one goes about answering the mathematical and statisticalquestions posed by these time correlations is commonly referred to as timeseries analysis.

The impact of time series analysis on scientific applications can be partiallydocumented by producing an abbreviated listing of the diverse fields in whichimportant time series problems may arise. For example, many familiar timeseries occur in the field of economics, where we are continually exposed to dailystock market quotations or monthly unemployment figures. Social scientistsfollow populations series, such as birthrates or school enrollments. An epi-demiologist might be interested in the number of influenza cases observed oversome time period. In medicine, blood pressure measurements traced over timecould be useful for evaluating drugs used in treating hypertension. Functionalmagnetic resonance imaging of brain-wave time series patterns might be usedto study how the brain reacts to certain stimuli under various experimentalconditions.

Many of the most intensive and sophisticated applications of time seriesmethods have been to problems in the physical and environmental sciences.This fact accounts for the basic engineering flavor permeating the language of

2 Characteristics of Time Series

time series analysis. One of the earliest recorded series is the monthly sunspotnumbers studied by Schuster (1906). More modern investigations may cen-ter on whether a warming is present in global temperature measurements orwhether levels of pollution may influence daily mortality in Los Angeles. Themodeling of speech series is an important problem related to the efficient trans-mission of voice recordings. Common features in a time series characteristicknown as the power spectrum are used to help computers recognize and trans-late speech. Geophysical time series such those produced by yearly depositionsof various kinds can provide long-range proxies for temperature and rainfall.Seismic recordings can aid in mapping fault lines or in distinguishing betweenearthquakes and nuclear explosions.

The above series are only examples of experimental databases that can beused to illustrate the process by which classical statistical methodology can beapplied in the correlated time series framework. In our view, the first step inany time series investigation always involves careful scrutiny of the recordeddata plotted over time. This scrutiny often suggests the method of analysisas well as statistics that will be of use in summarizing the information inthe data. Before looking more closely at the particular statistical methods,it is appropriate to mention that two separate, but not necessarily mutuallyexclusive, approaches to time series analysis exist, commonly identified as thetime domain approach and the frequency domain approach.

The time domain approach is generally motivated by the presumption thatcorrelation between adjacent points in time is best explained in terms of adependence of the current value on past values. The time domain approachfocuses on modeling some future value of a time series as a parametric functionof the current and past values. In this scenario, we begin with linear regressionsof the present value of a time series on its own past values and on the past valuesof other series. This modeling leads one to use the results of the time domainapproach as a forecasting tool and is particularly popular with economists forthis reason.

One approach, advocated in the landmark work of Box and Jenkins (1970;see also Box et al., 1994), develops a systematic class of models called autore-gressive integrated moving average (ARIMA) models to handle time-correlatedmodeling and forecasting. The approach includes a provision for treating morethan one input series through multivariate ARIMA or through transfer functionmodeling. The defining feature of these models is that they are multiplicativemodels, meaning that the observed data are assumed to result from productsof factors involving differential or difference equation operators responding toa white noise input.

A more recent approach to the same problem uses additive models morefamiliar to statisticians. In this approach, the observed data are assumed toresult from sums of series, each with a specified time series structure; for exam-ple, in economics, assume a series is generated as the sum of trend, a seasonaleffect, and error. The state-space model that results is then treated by makingjudicious use of the celebrated Kalman filters and smoothers, developed origi-

1.1: Introduction 3

nally for estimation and control in space applications. Two relatively completepresentations from this point of view are in Harvey (1991) and Kitagawa andGersch (1996). Time series regression is introduced in Chapter 2, and ARIMAand related time domain models are studied in Chapter 3, with the emphasis onclassical, statistical, univariate linear regression. Special topics on time domainanalysis are covered in Chapter 5; these topics include modern treatments of,for example, time series with long memory and GARCH models for the analy-sis of volatility. The state-space model, Kalman filtering and smoothing, andrelated topics are developed in Chapter 5.

Conversely, the frequency domain approach assumes the primary charac-teristics of interest in time series analyses relate to periodic or systematicsinusoidal variations found naturally in most data. These periodic variationsare often caused by biological, physical, or environmental phenomena of inter-est. A series of periodic shocks may influence certain areas of the brain; windmay affect vibrations on an airplane wing; sea surface temperatures caused byEl Nino oscillations may affect the number of fish in the ocean. The studyof periodicity extends to economics and social sciences, where one may beinterested in yearly periodicities in such series as monthly unemployment ormonthly birth rates.

In spectral analysis, the partition of the various kinds of periodic variationin a time series is accomplished by evaluating separately the variance asso-ciated with each periodicity of interest. This variance profile over frequencyis called the power spectrum. In our view, no schism divides time domainand frequency domain methodology, although cliques are often formed thatadvocate primarily one or the other of the approaches to analyzing data. Inmany cases, the two approaches may produce similar answers for long series,but the comparative performance over short samples is better done in the timedomain. In some cases, the frequency domain formulation simply provides aconvenient means for carrying out what is conceptually a time domain calcu-lation. Hopefully, this book will demonstrate that the best path to analyzingmany data sets is to use the two approaches in a complementary fashion. Ex-positions emphasizing primarily the frequency domain approach can be foundin Bloomfield (1976), Priestley (1981), or Jenkins and Watts (1968). On amore advanced level, Hannan (1970), Brillinger (1981), Brockwell and Davis(1991), and Fuller (1995) are available as theoretical sources. Our coverage ofthe frequency domain is given in Chapters 4 and 7.

The objective of this book is to provide a unified and reasonably completeexposition of statistical methods used in time series analysis, giving seriousconsideration to both the time and frequency domain approaches. Because amyriad of possible methods for analyzing any particular experimental seriescan exist, we have integrated real data from a number of subject fields into theexposition and have suggested methods for analyzing these data.


Figure 1.1 Johnson & Johnson quarterly earnings per share, 84 quarters,1960-I to 1980-IV.

1.2 The Nature of Time Series Data

Some of the problems and questions of interest to the prospective time seriesanalyst can best be exposed by considering real experimental data taken fromdifferent subject areas. The following cases illustrate some of the common kindsof experimental time series data as well as some of the statistical questions thatmight be asked about such data.

Example 1.1 Johnson & Johnson Quarterly Earnings

Figure 1.1 shows quarterly earnings per share for the U.S. company John-son & Johnson, furnished by Professor Paul Griffin (personal communi-cation) of the Graduate School of Management, University of California,Davis. There are 84 quarters (21 years) measured from the first quarterof 1960 to the last quarter of 1980. Modeling such series begins by ob-serving the primary patterns in the time history. In this case, note thegradually increasing underlying trend and the rather regular variationsuperimposed on the trend that seems to repeat over quarters. Methodsfor analyzing data such as these are explored in Chapter 2 (see Problem2.1) using regression techniques, and in Chapter 6, §6.5, using structuralequation modeling.

To plot the data using the R statistical package, suppose you saved thedata as jj.dat in the directory mydata. Then use the following steps toread in the data and plot the time series (the > below are prompts, you

1.2: The Nature of Time Series Data 5

Figure 1.2 Yearly average global temperature deviations (1900–1997) indegrees centigrade.

do not type them):

> jj = scan("/mydata/jj.dat") # yes forward slash> jj=ts(jj,start=1960, frequency=4)> plot(jj, ylab="Quarterly Earnings per Share")

You can replace scan with read.table in this example.

Example 1.2 Global Warming

Consider a global temperature series record, discussed in Jones (1994)and Parker et al. (1994, 1995). The data in Figure 1.2 are a combina-tion of land-air average temperature anomalies (from 1961-1990 average),measured in degrees centigrade, for the years 1900-1997. We note an ap-parent upward trend in the series that has been used as an argument forthe global warming hypothesis. Note also the leveling off at about 1935and then another rather sharp upward trend at about 1970. The ques-tion of interest for global warming proponents and opponents is whetherthe overall trend is natural or whether it is caused by some human-induced interface. Problem 2.8 examines 634 years of glacial sedimentdata that might be taken as a long-term temperature proxy. Such per-centage changes in temperature do not seem to be unusual over a timeperiod of 100 years. Again, the question of trend is of more interest thanparticular periodicities.


Figure 1.3 Speech recording of the syllable aaa · · ·hhh sampled at 10,000points per second with n = 1020 points.

Example 1.3 Speech Data

More involved questions develop in applications to the physical sciences.Figure 1.3 shows a small .1 second (1000 point) sample of recorded speechfor the phrase aaa · · ·hhh, and we note the repetitive nature of the sig-nal and the rather regular periodicities. One current problem of greatinterest is computer recognition of speech, which would require convert-ing this particular signal into the recorded phrase aaa · · ·hhh. Spectralanalysis can be used in this context to produce a signature of this phrasethat can be compared with signatures of various library syllables to lookfor a match. One can immediately notice the rather regular repetitionof small wavelets. The separation between the packets is known as thepitch period and represents the response of the vocal tract filter to aperiodic sequence of pulses stimulated by the opening and closing of theglottis.

Example 1.4 New York Stock Exchange

As an example of financial time series data, Figure 1.4 shows the dailyreturns (or percent change) of the New York Stock Exchange (NYSE)from February 2, 1984 to December 31, 1991. It is easy to spot thecrash of October 19, 1987 in the figure. The data shown in Figure 1.4are typical of return data. The mean of the series appears to be stable


Figure 1.4 Returns of the NYSE. The data are daily value weighted marketreturns from February 2, 1984 to December 31, 1991 (2000 trading days). Thecrash of October 19, 1987 occurs at t = 938.

with an average return of approximately zero, however, the volatility (orvariability) of data changes over time. In fact, the data show volatilityclustering; that is, highly volatile periods tend to be clustered together.A problem in the analysis of these type of financial data is to forecastthe volatility of future returns. Models such as ARCH and GARCHmodels (Engle, 1982; Bollerslev, 1986) and stochastic volatility models(Harvey, Ruiz and Shephard, 1994) have been developed to handle theseproblems. We will discuss these models and the analysis of financial datain Chapters 5 and 6.

Example 1.5 El Nino and Fish Population

We may also be interested in analyzing several time series at once. Figure1.5 shows monthly values of an environmental series called the SouthernOscillation Index (SOI) and associated Recruitment (number of new fish)furnished by Dr. Roy Mendelssohn of the Pacific Environmental FisheriesGroup (personal communication). Both series are for a period of 453months ranging over the years 1950-1987. The SOI measures changes inair pressure, related to sea surface temperatures in the central Pacific.The central Pacific Ocean warms every three to seven years due to the ElNino effect, which has been blamed, in particular, for the 1997 floods inthe midwestern portions of the U.S. Both series in Figure 1.5 tend to ex-hibit repetitive behavior, with regularly repeating cycles that are easily


Figure 1.5 Monthly SOI and Recruitment (Estimated new fish), 1950-1987.

visible. This periodic behavior is of interest because underlying processesof interest may be regular and the rate or frequency of oscillation char-acterizing the behavior of the underlying series would help to identifythem. One can also remark that the cycles of the SOI are repeating at afaster rate than those of the Recruitment series. The Recruitment seriesalso shows several kinds of oscillations, a faster frequency that seems torepeat about every 12 months and a slower frequency that seems to re-peat about every 50 months. The study of the kinds of cycles and theirstrengths is the subject of Chapter 4. The two series also tend to be some-what related; it is easy to imagine that somehow the fish population isdependent on the SOI. Perhaps, even a lagged relation exists, with theSOI signaling changes in the fish population. This possibility suggeststrying some version of regression analysis as a procedure for relating thetwo series. Transfer function modeling, as considered in Chapter 5, canbe applied in this case to obtain a model relating Recruitment to its ownpast and the past values of the SOI Index.


Figure 1.6 fMRI data from various locations in the cortex, thalamus, andcerebellum; n = 128 points, one observation taken every 2 seconds.

Example 1.6 fMRI Imaging

A fundamental problem in classical statistics occurs when we are givena collection of independent series or vectors of series, generated undervarying experimental conditions or treatment configurations. Such a setof series is shown in Figure 1.6, where we observe data collected fromvarious locations in the brain via functional magnetic resonance imaging(fMRI). In this example, five subjects were given periodic brushing onthe hand. The stimulus was applied for 32 seconds and then stoppedfor 32 seconds; thus, the signal period is 64 seconds. The sampling ratewas one observation every 2 seconds for 256 seconds (n = 128). For thisexample, we averaged the results over subjects (these were evoked re-sponses, and all subjects were in phase). The series shown in Figure 1.6are consecutive measures of blood oxygenation-level dependent (bold)signal intensity, which measures areas of activation in the brain. Noticethat the periodicities appear strongly in the motor cortex series and lessstrongly in the thalamus and cerebellum. The fact that one has seriesfrom different areas of the brain suggests testing whether the areas are


Figure 1.7 Arrival phases from an earthquake (top) and explosion (bottom)at 40 points per second.

responding differently to the brush stimulus. Analysis of variance tech-niques accomplish this in classical statistics, and we show in Chapter 7how these classical techniques extend to the time series case, leading toa spectral analysis of variance.

The data are in a file called fmri.dat, which consists of nine columns; thefirst column represents time, whereas the second through ninth columnsrepresent the bold signals at eight locations. Assuming the data arelocated in the directory mydata, use the following commands in R toplot the data as in this example.

> fmri = read.table("/mydata/fmri.dat")> par(mfrow=c(2,1)) # sets up the graphics> ts.plot(fmri[,2:5], lty=c(1,4), ylab="BOLD")> ts.plot(fmri[,6:9], lty=c(1,4), ylab="BOLD")

Example 1.7 Earthquakes and Explosions

As a final example, the series in Figure 1.7 represent two phases or ar-rivals along the surface, denoted by P (t = 1, . . . , 1024) and S (t =

1.3: Time Series Statistical Models 11

1025, . . . , 2048), at a seismic recording station. The recording instru-ments in Scandinavia are observing earthquakes and mining explosionswith one of each shown in Figure 1.7. The general problem of inter-est is in distinguishing or discriminating between waveforms generatedby earthquakes and those generated by explosions. Features that maybe important are the rough amplitude ratios of the first phase P to thesecond phase S, which tend to be smaller for earthquakes than for explo-sions. In the case of the two events in Figure 1.7, the ratio of maximumamplitudes appears to be somewhat less than .5 for the earthquake andabout 1 for the explosion. Otherwise, note a subtle difference exists inthe periodic nature of the S phase for the earthquake. We can again thinkabout spectral analysis of variance for testing the equality of the periodiccomponents of earthquakes and explosions. We would also like to be ableto classify future P and S components from events of unknown origin,leading to the time series discriminant analysis developed in Chapter 7.

The data are in the file eq5exp6.dat as one column with 4096 entries,the first 2048 observations correspond to an earthquake and the next2048 observations correspond to an explosion. To read and plot the dataas in this example, use the following commands in R:

> x = matrix(scan("/mydata/eq5exp6.dat"), ncol=2)> par(mfrow=c(2,1))> plot.ts(x[,1], main="Earthquake", ylab="EQ5")> plot.ts(x[,2], main="Explosion", ylab="EXP6")

1.3 Time Series Statistical Models

The primary objective of time series analysis is to develop mathematical modelsthat provide plausible descriptions for sample data, like that encountered inthe previous section. In order to provide a statistical setting for describingthe character of data that seemingly fluctuate in a random fashion over time,we assume a time series can be defined as a collection of random variablesindexed according to the order they are obtained in time. For example, wemay consider a time series as a sequence of random variables, x1, x2, x3, . . . ,where the random variable x1 denotes the value taken by the series at the firsttime point, the variable x2 denotes the value for the second time period, x3denotes the value for the third time period, and so on. In general, a collectionof random variables, {xt}, indexed by t is referred to as a stochastic process. Inthis text, t will typically be discrete and vary over the integers t = 0,±1,±2, ...,or some subset of the integers. The observed values of a stochastic process arereferred to as a realization of the stochastic process. Because it will be clearfrom the context of our discussions, we use the term time series whether weare referring generically to the process or to a particular realization and makeno notational distinction between the two concepts.


It is conventional to display a sample time series graphically by plotting thevalues of the random variables on the vertical axis, or ordinate, with the timescale as the abscissa. It is usually convenient to connect the values at adja-cent time periods to reconstruct visually some original hypothetical continuoustime series that might have produced these values as a discrete sample. Manyof the series discussed in the previous section, for example, could have beenobserved at any continuous point in time and are conceptually more properlytreated as continuous time series. The approximation of these series by discretetime parameter series sampled at equally spaced points in time is simply anacknowledgment that sampled data will, for the most part, be discrete becauseof restrictions inherent in the method of collection. Furthermore, the analysistechniques are then feasible using computers, which are limited to digital com-putations. Theoretical developments also rest on the idea that a continuousparameter time series should be specified in terms of finite-dimensional distri-bution functions defined over a finite number of points in time. This is notto say that the selection of the sampling interval or rate is not an extremelyimportant consideration. The appearance of data can be changed completelyby adopting an insufficient sampling rate. We have all seen wagon wheels inmovies appear to be turning backwards because of the insufficient number offrames sampled by the camera. This phenomenon leads to a distortion calledaliasing.

The fundamental visual characteristic distinguishing the different seriesshown in Examples 1.1–1.7 is their differing degrees of smoothness. One pos-sible explanation for this smoothness is that it is being induced by the suppo-sition that adjacent points in time are correlated, so the value of the series attime t, say, xt, depends in some way on the past values xt−1, xt−2, . . .. Thismodel expresses a fundamental way in which we might think about generat-ing realistic-looking time series. To begin to develop an approach to usingcollections of random variables to model time series, consider Example 1.8.

Example 1.8 White Noise

A simple kind of generated series might be a collection of uncorrelatedrandom variables, wt, with mean 0 and finite variance σ2

w. The time seriesgenerated from uncorrelated variables is used as a model for noise in en-gineering applications, where it is called white noise; we shall sometimesdenote this process as wt ∼ wn(0, σ2

w). The designation white originatesfrom the analogy with white light and indicates that all possible periodicoscillations are present with equal strength.We will, at times, also require the noise to be iid random variables withmean 0 and variance σ2

w. We shall distinguish this case by saying whiteindependent noise, or by writing wt ∼ iid(0, σ2

w). A particularly usefulwhite noise series is Gaussian white noise, wherein the wt are indepen-dent normal random variables, with mean 0 and variance σ2

w; or moresuccinctly, wt ∼ iid N(0, σ2

w). Figure 1.8 shows in the upper panel a col-lection of 500 such random variables, with σ2

w = 1, plotted in the order in


Figure 1.8 Gaussian white noise series (top) and three-point moving averageof the Gaussian white noise series (bottom).

which they were drawn. The resulting series bears a slight resemblanceto the explosion in Figure 1.7 but is not smooth enough to serve as aplausible model for any of the other experimental series. The plot tendsto show visually a mixture of many different kinds of oscillations in thewhite noise series.

If the stochastic behavior of all time series could be explained in terms ofthe white noise model, classical statistical methods would suffice. Two waysof introducing serial correlation and more smoothness into time series modelsare given in Examples 1.9 and 1.10.

Example 1.9 Moving Averages

We might replace the white noise series wt by a moving average thatsmoothes the series. For example, consider replacing wt in Example 1.8by an average of its current value and its immediate neighbors in thepast and future. That is, let

vt =13(wt−1 + wt + wt+1

), (1.1)

which leads to the series shown in the lower panel of Figure 1.8. Inspect-ing the series shows a smoother version of the first series, reflecting thefact that the slower oscillations are more apparent and some of the faster


Figure 1.9 Autoregressive series generated from model (1.2).

oscillations are taken out. We begin to notice a similarity to the SOI inFigure 1.5, or perhaps, to some of the fMRI series in Figure 1.6.

To reproduce Figure 1.8 in R use the following commands:1

> w = rnorm(500,0,1) # 500 N(0,1) variates> v = filter(w, sides=2, rep(1,3)/3) # moving average> par(mfrow=c(2,1))> plot.ts(w)> plot.ts(v)

The speech series in Figure 1.3 and the Recruitment series in Figure 1.5,as well as some of the MRI series in Figure 1.6, differ from the moving averageseries because one particular kind of oscillatory behavior seems to predomi-nate, producing a sinusoidal type of behavior. A number of methods exist forgenerating series with this quasi-periodic behavior; we illustrate a popular onebased on the autoregressive model considered in Chapter 3.

Example 1.10 Autoregressions

Suppose we consider the white noise series wt of Example 1.8 as inputand calculate the output using the second-order equation

xt = xt−1 − .90xt−2 + wt (1.2)

successively for t = 1, 2, . . . , 500. Equation (1.2) represents a regressionor prediction of the current value xt of a time series as a function of

1A linear combination of values in a time series such as in (1.1) is referred to, generically,as a filtered series; hence the command filter.


Time

0 50 100 150 200

010

2030

4050

Figure 1.10 Random walk, σw = 1, with drift δ = .2 (upper jagged line),without drift, δ = 0 (lower jagged line), and a straight line with slope .2(dashed line).

the past two values of the series, and, hence, the term autoregressionis suggested for this model. A problem with startup values exists herebecause (1.2) also depends on the initial conditions x0 and x−1, but,for now, we assume that we are given these values and generate thesucceeding values by substituting into (1.2). The resulting output seriesis shown in Figure 1.9, and we note the periodic behavior of the series,which is similar to that displayed by the speech series in Figure 1.3. Theautoregressive model above and its generalizations can be used as anunderlying model for many observed series and will be studied in detailin Chapter 3.

One way to simulate and plot data from the model (1.2) in R is to usethe following commands (another way is to use arima.sim).> w = rnorm(550,0,1) # 50 extra to avoid startup problems> x = filter(w, filter=c(1,-.9), method="recursive")> plot.ts(x[51:550])

Example 1.11 Random Walk

A model for analyzing trend is the random walk with drift model givenby

xt = δ + xt−1 + wt (1.3)

for t = 1, 2, . . ., with initial condition x0 = 0, and where wt is white noise.The constant δ is called the drift, and when δ = 0, (1.3) is called simply


a random walk. The term random walk comes from the fact that, whenδ = 0, the value of the time series at time t is the value of the series attime t − 1 plus a completely random movement determined by wt. Notethat we may rewrite (1.3) as a cumulative sum of white noise variates.That is,

xt = δ t +t∑

j=1

wj (1.4)

for t = 1, 2, . . .; either use induction, or plug (1.4) into (1.3) to verifythis statement. Figure 1.10 shows 200 observations generated from themodel with δ = 0 and .2, and with σw = 1. For comparison, we alsosuperimposed the straight line .2t on the graph.

To reproduce Figure 1.10 in R:

> set.seed(154)> w = rnorm(200,0,1); x = cumsum(w)> wd = w +.2; xd = cumsum(wd)> plot.ts(xd, ylim=c(-5,55))> lines(x)> lines(.2*(1:200), lty="dashed")

Example 1.12 Signal in Noise

Many realistic models for generating time series assume an underlyingsignal with some consistent periodic variation, contaminated by addinga random noise. For example, it is easy to detect the regular cycle fMRIseries displayed on the top of Figure 1.6. Consider the model

xt = 2 cos(2πt/50 + .6π) + wt (1.5)

for t = 1, 2, . . . , 500, where the first term is regarded as the signal, shownin the upper panel of Figure 1.11. We note that a sinusoidal waveformcan be written as

A cos(2πωt + φ), (1.6)

where A is the amplitude, ω is the frequency of oscillation, and φ is aphase shift. In (1.5), A = 2, ω = 1/50 (one cycle every 50 time points),and φ = .6π.

An additive noise term was taken to be white noise with σw = 1 (middlepanel) and σw = 5 (bottom panel), drawn from a normal distribution.Adding the two together obscures the signal, as shown in the lower panelsof Figure 1.11. Of course, the degree to which the signal is obscureddepends on the amplitude of the signal and the size of σw. The ratioof the amplitude of the signal to σw (or some function of the ratio) issometimes called the signal-to-noise ratio (SNR); the larger the SNR, theeasier it is to detect the signal. Note that the signal is easily discernible


0 50 100 150 200 250 300 350 400 450 500−2

−1

0

1

2

0 50 100 150 200 250 300 350 400 450 500−5

0

5

0 50 100 150 200 250 300 350 400 450 500−20

−10

0

10

20

Figure 1.11 Cosine wave with period 50 points (top panel) compared with thecosine wave contaminated with additive white Gaussian noise, σw = 1 (middlepanel) and σw = 5 (bottom panel); see (1.5).

in the middle panel of Figure 1.11, whereas the signal is obscured in thebottom panel. Typically, we will not observe the signal, but the signalobscured by noise.

To reproduce Figure 1.11 in R, use the following commands:> t = 1:500> c = 2*cos(2*pi*t/50 + .6*pi)> w = rnorm(500,0,1)> par(mfrow=c(3,1))> plot.ts(c)> plot.ts(c + w)> plot.ts(c + 5*w)

In Chapter 4, we will study the use of spectral analysis as a possible tech-nique for detecting regular or periodic signals, such as the one described inExample 1.12. In general, we would emphasize the importance of simple ad-ditive models such as given above in the form

xt = st + vt, (1.7)

where st denotes some unknown signal and vt denotes a time series that maybe white or correlated over time. The problems of detecting a signal and then


in estimating or extracting the waveform of st are of great interest in manyareas of engineering and the physical and biological sciences. In economics,the underlying signal may be a trend or it may be a seasonal component of aseries. Models such as (1.7), where the signal has an autoregressive structure,form the motivation for the state-space model of Chapter 6.

In the above examples, we have tried to motivate the use of various com-binations of random variables emulating real time series data. Smoothnesscharacteristics of observed time series were introduced by combining the ran-dom variables in various ways. Averaging independent random variables overadjacent time points, as in Example 1.9, or looking at the output of differ-ence equations that respond to white noise inputs, as in Example 1.10, arecommon ways of generating correlated data. In the next section, we introducevarious theoretical measures used for describing how time series behave. Asis usual in statistics, the complete description involves the multivariate dis-tribution function of the jointly sampled values x1, x2, . . . , xn, whereas moreeconomical descriptions can be had in terms of the mean and autocorrelationfunctions. Because correlation is an essential feature of time series analysis,the most useful descriptive measures are those expressed in terms of covarianceand correlation functions.

1.4 Measures of Dependence: Autocorrelationand Cross-Correlation

A complete description of a time series, observed as a collection of n randomvariables at arbitrary integer time points t1, t2, . . . , tn, for any positive integern, is provided by the joint distribution function, evaluated as the probabilitythat the values of the series are jointly less than the n constants, c1, c2, . . . , cn,i.e.,

F (c1, c2, . . . , cn) = P(xt1 ≤ c1, xt2 ≤ c2, . . . , xtn ≤ cn

). (1.8)

Unfortunately, the multidimensional distribution function cannot usually bewritten easily unless the random variables are jointly normal, in which case,expression (1.8) comes from the usual multivariate normal distribution (seeAnderson, 1984, or Johnson and Wichern, 1992). A particular case in whichthe multidimensional distribution function is easy would be for independentand identically distributed standard normal random variables, for which thejoint distribution function can be expressed as the product of the marginals,say,

F (c1, c2, . . . , cn) =n∏

t=1

Φ(ct), (1.9)

where

Φ(x) =1√2π

∫ x

−∞exp{−z2

2}

dz (1.10)

1.4: Measures of Dependence 19

is the cumulative distribution function of the standard normal.Although the multidimensional distribution function describes the data

completely, it is an unwieldy tool for displaying and analyzing time seriesdata. The distribution function (1.8) must be evaluated as a function of narguments, so any plotting of the corresponding multivariate density functionsis virtually impossible. The one-dimensional distribution functions

Ft(x) = P{xt ≤ x}

or the corresponding one-dimensional density functions

ft(x) =∂Ft(x)

∂x,

when they exist, are often informative for determining whether a particularcoordinate of the time series has a well-known density function, like the normal(Gaussian) distribution.

Definition 1.1 The mean function is defined as

µxt = E(xt) =∫ ∞

−∞xft(x) dx, (1.11)

provided it exists, where E denotes the usual expected value operator. Whenno confusion exists about which time series we are referring to, we will drop asubscript and write µxt as µt.

The important thing to realize about µt is that it is a theoretical meanfor the series at one particular time point, where the mean is taken over allpossible events that could have produced xt.

Example 1.13 Mean Function of a Moving Average Series

If wt denotes a white noise series, then µwt = E(wt) = 0 for all t. Thetop series in Figure 1.8 reflects this, as the series clearly fluctuates arounda mean value of zero. Smoothing the series as in Example 1.9 does notchange the mean because we can write

µvt = E(vt) =13[E(wt−1) + E(wt) + E(wt+1)] = 0.

Example 1.14 Mean Function of a Random Walk with Drift

Consider the random walk with drift model given in (1.4),

xt = δ t +t∑

j=1

wj , t = 1, 2, . . . .


As in the previous example, because E(wt) = 0 for all t, and δ is aconstant, we have

µxt = E(xt) = δ t +t∑

j=1

E(wj) = δ t

which is a straight line with slope δ. A realization of a random walk withdrift can be compared to its mean function in Figure 1.10.

Example 1.15 Mean Function of Signal Plus Noise

A great many practical applications depend on assuming the observeddata have been generated by a fixed signal waveform superimposed on azero-mean noise process, leading to an additive signal model of the form(1.5). It is clear, because the signal in (1.5) is a fixed function of time,we will have

µxt = E(xt) = E[2 cos(2πt/50 + .6π) + wt

]= 2 cos(2πt/50 + .6π) + E(wt)= 2 cos(2πt/50 + .6π),

and the mean function is just the cosine wave.

The lack of independence between two adjacent values xs and xt can beassessed numerically, as in classical statistics, using the notions of covarianceand correlation. Assuming the variance of xt is finite, we have the followingdefinition.

Definition 1.2 The autocovariance function is defined as the second mo-ment product

γx(s, t) = E[(xs − µs)(xt − µt)], (1.12)

for all s and t. When no possible confusion exists about which time series weare referring to, we will drop the subscript and write γx(s, t) as γ(s, t).

Note that γx(s, t) = γx(t, s) for all time points s and t. The autocovariancemeasures the linear dependence between two points on the same series observedat different times. Very smooth series exhibit autocovariance functions thatstay large even when the t and s are far apart, whereas choppy series tend tohave autocovariance functions that are nearly zero for large separations. Theautocovariance (1.12) is the average cross-product relative to the joint densityF (xs, xt). Recall from classical statistics that if γx(s, t) = 0, xs and xt arenot linearly related, but there still may be some dependence structure betweenthem. If, however, xs and xt are bivariate normal, γx(s, t) = 0 ensures theirindependence. It is clear that, for s = t, the autocovariance reduces to the(assumed finite) variance, because

γx(t, t) = E[(xt − µt)2]. (1.13)

1.4: Measures of Dependence 21

Example 1.16 Autocovariance of White Noise

The white noise series wt, shown in the top panel of Figure 1.8, hasE(wt) = 0 and

γw(s, t) = E(wswt) ={

σ2w, s = t

0, s �= t

where, in this example, σ2w = 1. Noting that ws and wt are uncorrelated

for s �= t, we would have E(wswt) = E(ws)E(wt) = 0 because the meanvalues of the white noise variates are zero.

Example 1.17 Autocovariance of a Moving Average

Consider applying a three-point moving average to the white noise serieswt of the previous example, as in Example 1.9 (σ2

w = 1). Because vt in(1.1) has mean zero, we have

γv(s, t) = E[(vs − 0)(vt − 0)]

=19E[(ws−1 + ws + ws+1)(wt−1 + wt + wt+1)].

It is convenient to calculate it as a function of the separation, s − t = h,say, for h = 0,±1,±2, . . .. For example, with h = 0,

γv(t, t) =19E[(wt−1 + wt + wt+1)(wt−1 + wt + wt+1)]

=19[E(wt−1wt−1) + E(wtwt) + E(wt+1wt+1)]

=39.

When h = 1,

γv(t + 1, t) =19E[(wt + wt+1 + wt+2)(wt−1 + wt + wt+1)]

=19[E(wtwt) + E(wt+1wt+1)]

=29,

using the fact that we may drop terms with unequal subscripts. Similarcomputations give γv(t−1, t) = 2/9, γv(t+2, t) = γv(t−2, t) = 1/9, and0 for larger separations. We summarize the values for all s and t as

γv(s, t) =

⎧⎪⎨⎪⎩3/9, s = t2/9, |s − t| = 11/9, |s − t| = 20, |s − t| ≥ 3.

(1.14)


Example 1.17 shows clearly that the smoothing operation introduces a co-variance function that decreases as the separation between the two time pointsincreases and disappears completely when the time points are separated bythree or more time points. This particular autocovariance is interesting be-cause it only depends on the time separation or lag and not on the absolutelocation of the points along the series. We shall see later that this dependencesuggests a mathematical model for the concept of weak stationarity.

Example 1.18 Autocovariance of a Random Walk

For the random walk model, xt =∑t

j=1 wj , we have

γx(s, t) = cov(xs, xt) = cov

⎛⎝ s∑j=1

wj ,

t∑k=1

wk

⎞⎠ = min{s, t} σ2w,

because the wt are uncorrelated random variables. Note that, as opposedto the previous examples, the autocovariance function of a random walkdepends on the particular time values s and t, and not on the timeseparation or lag. Also, notice that the variance of the random walk,var(xt) = γx(t, t) = t σ2

w, increases without bound as time t increases.The effect of this variance increase can be seen in Figure 1.10 as theprocesses starting to move away from their mean functions δ t (note,δ = 0 and .2 in that example).

As in classical statistics, it is more convenient to deal with a measure ofassociation between −1 and 1, and this leads to the following definition.

Definition 1.3 The autocorrelation function (ACF) is defined as

ρ(s, t) =γ(s, t)√

γ(s, s)γ(t, t). (1.15)

The ACF measures the linear predictability of the series at time t, say, xt,using only the value xs. We can show easily that −1 ≤ ρ(s, t) ≤ 1 using theCauchy–Schwarz inequality.2 If we can predict xt perfectly from xs through alinear relationship, xt = β0 +β1xs, then the correlation will be 1 when β1 > 0,and −1 when β1 < 0. Hence, we have a rough measure of the ability to forecastthe series at time t from the value at time s.

Often, we would like to measure the predictability of another series yt fromthe series xs. Assuming both series have finite variances, we have

Definition 1.4 The cross-covariance function between two series xt andyt is

γxy(s, t) = E[(xs − µxs)(yt − µyt)]. (1.16)

2Note, the Cauchy–Schwarz inequality implies |γ(s, t)|2 ≤ γ(s, s)γ(t, t).

1.5: Stationary Time Series 23

The scaled version of the cross-covariance function is called

Definition 1.5 The cross-correlation function (CCF)

ρxy(s, t) =γxy(s, t)√

γx(s, s)γy(t, t). (1.17)

We may easily extend the above ideas to the case of more than two series,say, xt1, xt2, . . . , xtr; that is, multivariate time series with r components. Forexample, the extension of (1.12) in this case is

γjk(s, t) = E[(xsj − µsj)(xtk − µtk)] j, k = 1, 2, . . . , r. (1.18)

In the definitions above, the autocovariance and cross-covariance functionsmay change as one moves along the series because the values depend on both sand t, the locations of the points in time. In Example 1.17, the autocovariancefunction depends on the separation of xs and xt, say, h = |s − t|, and not onwhere the points are located in time. As long as the points are separated byh units, the location of the two points does not matter. This notion, calledweak stationarity, when the mean is constant, is fundamental in allowing usto analyze sample time series data when only a single series is available.

1.5 Stationary Time Series

The preceding definitions of the mean and autocovariance functions are com-pletely general. Although we have not made any special assumptions aboutthe behavior of the time series, many of the preceding examples have hintedthat a sort of regularity may exist over time in the behavior of a time series.We introduce the notion of regularity using a concept called stationarity.

Definition 1.6 A strictly stationary time series is one for which the prob-abilistic behavior of every collection of values

{xt1 , xt2 , . . . , xtk}

is identical to that of the time shifted set

{xt1+h, xt2+h, . . . , xtk+h}.

That is,

P{xt1 ≤ c1, . . . , xtk≤ ck} = P{xt1+h ≤ c1, . . . , xtk+h ≤ ck} (1.19)

for all k = 1, 2, ..., all time points t1, t2, . . . , tk, all numbers c1, c2, . . . , ck, andall time shifts h = 0,±1,±2, ... .


If a time series is strictly stationary, then all of the multivariate distributionfunctions for subsets of variables must agree with their counterparts in theshifted set for all values of the shift parameter h. For example, when k = 1,(1.19) implies that

P{xs ≤ c} = P{xt ≤ c} (1.20)

for any time points s and t. This statement implies, e.g., that the probabilitythe value of a time series sampled hourly is negative at 1am is the same asat 10am. In addition, if the mean function, µt, of the series xt exists, (1.20)implies that µs = µt for all s and t, and hence µt must be constant. Note,e.g., that a random walk process with drift is not strictly stationary becauseits mean function changes with time (see Example 1.14).

When k = 2, we can write (1.19) as

P{xs ≤ c1, xt ≤ c2} = P{xs+h ≤ c1, xt+h ≤ c2} (1.21)

for any time points s and t and shift h. Thus, if the variance function of theprocess exists, (1.21) implies that the autocovariance function of the series xt

satisfiesγ(s, t) = γ(s + h, t + h)

for all s and t and h. We may interpret this result by saying the autocovariancefunction of the process depends only on the time difference between s and t,and not on the actual times.

The version of stationarity in (1.19) is too strong for most applications.Moreover, it is difficult to assess strict stationarity from a single data set.Rather than impose conditions on all possible distributions of a time series, wewill use a milder version that imposes conditions only on the first two momentsof the series. We now have the following definition.

Definition 1.7 A weakly stationary time series, xt, is a finite varianceprocess such that

(i) the mean value function, µt, defined in (1.11) is constant and does notdepend on time t, and

(ii) the covariance function, γ(s, t), defined in (1.12) depends on s and t onlythrough their difference |s − t|.

Henceforth, we will use the term stationary to mean weak stationarity; if aprocess is stationary in the strict sense, we will use the term strictly stationary.

It should be clear from the discussion of strict stationarity following Defi-nition 1.6 that a strictly stationary, finite variance, time series is also station-ary. The converse is not true unless there are further conditions. One impor-tant case where stationarity implies strict stationarity is if the time series isGaussian [meaning all finite distributions, (1.19), of the series are Gaussian].We will make this concept more precise at the end of this section.


Because the mean function, E(xt) = µt, of a stationary time series isindependent of time t, we will write

µt = µ. (1.22)

Also, because the covariance function of a stationary time series, γ(s, t), de-pends on s and t only through their difference |s − t|, we may simplify thenotation. Let s = t + h, where h represents the time shift or lag, then

γ(t + h, t) = E[(xt+h − µ)(xt − µ)]= E[(xh − µ)(x0 − µ)] (1.23)= γ(h, 0)

does not depend on the time argument t; we have assumed that var(xt) =γ(0, 0) < ∞. Henceforth, for convenience, we will drop the second argumentof γ(h, 0).

Definition 1.8 The autocovariance function of a stationary time serieswill be written as

γ(h) = E[(xt+h − µ)(xt − µ)]. (1.24)

Definition 1.9 The autocorrelation function (ACF) of a stationarytime series will be written using (1.15) as

ρ(h) =γ(t + h, t)√

γ(t + h, t + h)γ(t, t)=

γ(h)γ(0)

. (1.25)

The Cauchy–Schwarz inequality shows again that −1 ≤ ρ(h) ≤ 1 for all h,enabling one to assess the relative importance of a given autocorrelation valueby comparing with the extreme values −1 and 1.

Example 1.19 Stationarity of White Noise

The autocovariance function of the white noise series of Examples 1.8and 1.16 is easily evaluated as

γw(h) = E(wt+hwt) ={

σ2w, h = 0

0, h �= 0,

where, in these examples, σ2w = 1. This means that the series is weakly

stationary or stationary. If the white noise variates are also normallydistributed or Gaussian, the series is also strictly stationary, as can beseen by evaluating (1.19) using the relationship (1.9).


Figure 1.12 Autocovariance function of a three-point moving average.

Example 1.20 Stationarity of a Moving Average

The three-point moving average process used in Examples 1.9 and 1.17is stationary because we may write the autocovariance function obtainedin (1.14) as

γv(h) =

⎧⎪⎨⎪⎩3/9, h = 02/9, h = ±11/9, h = ±20, |h| ≥ 3.

Figure 1.12 shows a plot of the autocovariance as a function of lag h.Interestingly, the autocovariance is symmetric and decays as a functionof lag.

The autocovariance function of a stationary process has several useful prop-erties. First, the value at h = 0, namely

γ(0) = E[(xt − µ)2] (1.26)

is the variance of the time series; note that the Cauchy–Schwarz inequalityimplies

|γ(h)| ≤ γ(0).

A final useful property, noted in the previous example, is that autocovariancefunction of a stationary series is symmetric around the origin, that is,

γ(h) = γ(−h) (1.27)

for all h. This property follows because shifting the series by h means that

γ(h) = γ(t + h − t)= E[(xt+h − µ)(xt − µ)]= E[(xt − µ)(xt+h − µ)]= γ(t − (t + h))= γ(−h),


which shows how to use the notation as well as proving the result.When several series are available, a notion of stationarity still applies with

additional conditions.

Definition 1.10 Two time series, say, xt and yt, are said to be jointly sta-tionary if they are each stationary, and the cross-covariance function

γxy(h) = E[(xt+h − µx)(yt − µy)] (1.28)

is a function only of lag h.

Definition 1.11 The cross-correlation function (CCF) of jointly station-ary time series xt and yt is defined as

ρxy(h) =γxy(h)√

γx(0)γy(0). (1.29)

Again, we have the result −1 ≤ ρxy(h) ≤ 1 which enables comparison withthe extreme values −1 and 1 when looking at the relation between xt+h andyt. The cross-correlation function satisfies

ρxy(h) = ρyx(−h), (1.30)

which can be shown by manipulations similar to those used to show (1.27).

Example 1.21 Joint Stationarity

Consider the two series, xt and yt, formed from the sum and differenceof two successive values of a white noise process, say,

xt = wt + wt−1

andyt = wt − wt−1,

where wt are independent random variables with zero means and varianceσ2

w. It is easy to show that γx(0) = γy(0) = 2σ2w and γx(1) = γx(−1) =

σ2w, γy(1) = γy(−1) = −σ2

w. Also,

γxy(1) = E[(xt+1 − 0)(yt − 0)]= E[(wt+1 + wt)(wt − wt−1)]= σ2

w

because only one product is nonzero. Similarly, γxy(0) = 0, γxy(−1) =−σ2

w. We obtain, using (1.29),

ρxy(h) =

⎧⎪⎨⎪⎩0, h = 01/2, h = 1

−1/2, h = −10, |h| ≥ 2.

Clearly, the autocovariance and cross-covariance functions depend onlyon the lag separation, h, so the series are jointly stationary.


Example 1.22 Prediction Using Cross-Correlation

As a simple example of cross-correlation, consider the problem of deter-mining possible leading or lagging relations between two series xt and yt.If the model

yt = Axt−� + wt

holds, the series xt is said to lead yt for > 0 and is said to lag yt for < 0.Hence, the analysis of leading and lagging relations might be importantin predicting the value of yt from xt. Assuming, for convenience, thatxt and yt have zero means, and the noise wt is uncorrelated with the xt

series, the cross-covariance function can be computed as

γyx(h) = E(yt+hxt)= AE(xt+h−�xt) + E(wt+hxt)= Aγx(h − ).

The cross-covariance function will look like the autocovariance of theinput series xt, with a peak on the positive side if xt leads yt and a peakon the negative side if xt lags yt.

The concept of weak stationarity forms the basis for much of the analy-sis performed with time series. The fundamental properties of the mean andautocovariance functions (1.22) and (1.24) are satisfied by many theoreticalmodels that appear to generate plausible sample realizations. In Examples 1.9and 1.10, two series were generated that produced stationary looking realiza-tions, and in Example 1.20, we showed that the series in Example 1.9 was, infact, weakly stationary. Both examples are special cases of the so-called linearprocess.

Definition 1.12 A linear process, xt, is defined to be a linear combinationof white noise variates wt, and is given by

xt = µ +∞∑

j=−∞ψjwt−j (1.31)

where the coefficients satisfy

∞∑j=−∞

|ψj | < ∞. (1.32)

For the linear process (see Problem 1.11), we may show that the autoco-variance function is given by

γ(h) = σ2w

∞∑j=−∞

ψj+hψj (1.33)

1.6: Estimation of Correlation 29

for h ≥ 0; recall that γ(−h) = γ(h). This method exhibits the autocovariancefunction of the process in terms of the lagged products of the coefficients. Notethat, for Example 1.9, we have ψ0 = ψ−1 = ψ1 = 1/3 and the result in Ex-ample 1.20 comes out immediately. The autoregressive series in Example 1.10can also be put in this form, as can the general autoregressive moving averageprocesses considered in Chapter 3.

Finally, as previously mentioned, an important case in which a weaklystationary series is also strictly stationary is the normal or Gaussian series.

Definition 1.13 A process, {xt}, is said to be a Gaussian process if the k-dimensional vectors xxx = (xt1 , xt2 , . . . , xtk

)′, for every collection of time pointst1, t2, . . . , tk, and every positive integer k, have a multivariate normal distrib-ution.

Defining the k×1 mean vector E(xxx) ≡ µµµ = (µt1 , µt2 , . . . , µtk)′ and the k×k

covariance matrix as cov(xxx) ≡ Γ = {γ(ti, tj); i, j = 1, . . . , k}, the multivariatenormal density function can be written as

f(xxx) = (2π)−n/2|Γ|−1/2 exp{

−12(xxx − µµµ)′Γ−1(xxx − µµµ)

}, (1.34)

where |·| denotes the determinant. This distribution forms the basis for solvingproblems involving statistical inference for time series. If a Gaussian timeseries, {xt}, is weakly stationary, then µt = µ and γ(ti, tj) = γ(|ti − tj |),so that the vector µµµ and the matrix Γ are independent of time. These factsimply that all the finite distributions, (1.34), of the series {xt} depend onlyon time lag and not on the actual times, and hence the series must be strictlystationary. We use the multivariate normal density in the form given above aswell as in a modified version, applicable to complex random variables in thesequel.

1.6 Estimation of Correlation

Although the theoretical autocorrelation and cross-correlation functions areuseful for describing the properties of certain hypothesized models, most ofthe analyses must be performed using sampled data. This limitation meansthe sampled points x1, x2, . . . , xn only are available for estimating the mean,autocovariance, and autocorrelation functions. From the point of view of classi-cal statistics, this poses a problem because we will typically not have iid copiesof xt that are available for estimating the covariance and correlation functions.In the usual situation with only one realization, however, the assumption ofstationarity becomes critical. Somehow, we must use averages over this singlerealization to estimate the population means and covariance functions.


Accordingly, if a time series is stationary, the mean function, (1.22), µt = µis constant so that we can estimate it by the sample mean,

x =1n

n∑t=1

xt. (1.35)

The theoretical autocovariance function, (1.24), is estimated by the sampleautocovariance function defined as follows.Definition 1.14 The sample autocovariance function is defined as

γ(h) = n−1n−h∑t=1

(xt+h − x)(xt − x), (1.36)

with γ(−h) = γ(h) for h = 0, 1, . . . , n − 1.

The sum in (1.36) runs over a restricted range because xt+h is not availablefor t + h > n. The estimator in (1.36) is generally preferred to the one thatwould be obtained by dividing by n−h because (1.36) is a non-negative definitefunction. This means that if we let Γ = {γ(i − j); i, j = 1, ..., n} be the n × n

sample covariance matrix of the data xxx = (x1, . . . , xn)′, then Γ is a non-negativedefinite matrix. So, if we let aaa = (a1, . . . , an)′ be an n × 1 vector of constants,then var(aaa′xxx) = aaa′Γaaa ≥ 0. Thus, the non-negative definite property ensuressample variances of linear combinations of the variates xt will always be non-negative. Note that neither dividing by n nor n−h in (1.36) yields an unbiasedestimate of γ(h).

Definition 1.15 The sample autocorrelation function is defined, analo-gously to (1.25), as

ρ(h) =γ(h)γ(0)

. (1.37)

The sample autocorrelation function has a sampling distribution that allowsus to assess whether the data comes from a completely random or white seriesor whether correlations are statistically significant at some lags. Precise detailsare given in Theorem A.7 in Appendix A. We have

Property P1.1: Large Sample Distribution of the ACFUnder general conditions, if xt is white noise, then for n large, the sample ACF,ρx(h), for h = 1, 2, . . . , H, where H is fixed but arbitrary, is approximatelynormally distsributed with zero mean and standard deviation given by

σρx(h) =1√n

. (1.38)

Based on the above result, we obtain a rough method of assessing whetherpeaks in ρ(h) are significant by determining whether the observed peak is


outside the interval ±2/√

n (or plus/minus two standard errors); for a whitenoise sequence, approximately 95% of the sample ACFs should be within theselimits. The applications of this property develop because many statisticalmodeling procedures depend on reducing a time series to a white noise seriesby various kinds of transformations. After such a procedure is applied, theplotted ACFs of the residuals should then lie roughly within the limits givenabove.

Definition 1.16 The estimators for the cross-covariance function, γxy(h), asgiven in (1.28) and the cross-correlation, ρxy(h), in (1.29), are given, respec-tively, by the sample cross-covariance function

γxy(h) = n−1n−h∑t=1

(xt+h − x)(yt − y), (1.39)

where γxy(−h) = γyx(h) determines the function for negative lags, and thesample cross-correlation function

ρxy(h) =γxy(h)√

γx(0)γy(0). (1.40)

The sample cross-correlation function can be examined graphically as afunction of lag h to search for leading or lagging relations in the data usingthe property mentioned in Example 1.22 for the theoretical cross-covariancefunction. Because −1 ≤ ρxy(h) ≤ 1, the practical importance of peaks canbe assessed by comparing their magnitudes with their theoretical maximumvalues. Furthermore, for xt and yt independent linear processes of the form(1.31), we have

Property P1.2: Large Sample Distribution of the Cross-CorrelationUnder IndependenceThe large sample distribution of ρxy(h) is normal with mean zero and

σρxy=

1√n

(1.41)

if at least one of the processes is white independent noise (see Theorem A.8 inAppendix A).

Example 1.23 A Simulated Time Series

To give an example of the procedure for calculating numerically the au-tocovariance and cross-covariance functions, consider a contrived set ofdata generated by tossing a fair coin, letting xt = 1 when a head isobtained and xt = −1 when a tail is obtained. Construct yt as

yt = 5 + xt − .7xt−1. (1.42)


Table 1.1 Sample Realization of the Contrived Series yt.t 1 2 3 4 5 6 7 8 9 10

Coin H H T H T T T H T Hxt 1 1 −1 1 −1 −1 −1 1 −1 1yt 6.7 5.3 3.3 6.7 3.3 4.7 4.7 6.7 3.3 6.7

yt − y 1.56 .16 −1.84 1.56 −1.84 −.44 −.44 1.56 −1.84 1.56

Table 1.1 shows sample realizations of the appropriate processes withx0 = −1 and n = 10.

The sample autocorrelation for the series yt can be calculated using (1.36)and (1.37) for h = 0, 1, 2, . . .. It is not necessary to calculate for negativevalues because of the symmetry. For example, for h = 3, the autocorre-lation becomes the ratio of

γy(3) = 10−17∑

t=1

(yt+3 − y)(yt − y)

= 10−1[(1.56)(1.56) + (−1.84)(.16) + (−.44)(−1.84)

+ (−.44)(1.56) + (1.56)(−1.84) + (−1.84)(−.44)

+ (1.56)(−.44)]

= −.04848

to

γy(0) =110

[(1.56)2 + (.16)2 + · · · + (1.56)2] = 2.0304

so that

ρy(3) =−.048482.0304

= −.02388.

The theoretical ACF can be obtained from the model (1.42) using thefact that the mean of xt is zero and the variance of xt is one. It can beshown that

ρy(1) =−.7

1 + .72 = −.47

and ρy(h) = 0 for |h| > 1 (Problem 1.23). Table 1.2 compares thetheoretical ACF with sample ACFs for a realization where n = 10 andanother realization where n = 100; we note the increased variability inthe smaller size sample.


Table 1.2 Theoretical and Sample ACFsfor n = 10 and n = 100

h ρy(h) ρy(h) ρy(h)n = 10 n = 100

0 1.00 1.00 1.00±1 −.47 −.55 −.45±2 .00 .17 −.12±3 .00 −.02 .14±4 .00 .15 .01±5 .00 −.46 −.01

Example 1.24 ACF of Speech Signal

Computing the sample ACF as in the previous example can be thoughtof as matching the time series h units in the future, say, xt+h againstitself, xt. Figure 1.13 shows the ACF of the speech series of Figure 1.3.The original series appears to contain a sequence of repeating short sig-nals. The ACF confirms this behavior, showing repeating peaks spacedat about 106-109 points. Autocorrelation functions of the short signalsappear, spaced at the intervals mentioned above. The distance betweenthe repeating signals is known as the pitch period and is a fundamentalparameter of interest in systems that encode and decipher speech. Be-cause the series is sampled at 10,000 points per second, the pitch periodappears to be between .0106 and .0109 seconds.

To compute the sample ACF in R, use

> speech = scan("/mydata/speech.dat")> acf(speech,250)

Example 1.25 Correlation Analysis of SOI and Recruitment Data

The autocorrelation and cross-correlation functions are also useful foranalyzing the joint behavior of two stationary series whose behavior maybe related in some unspecified way. In Example 1.5 (see Figure 1.5), wehave considered simultaneous monthly readings of the SOI and the num-ber of new fish (Recruitment) computed from a model. Figure 1.14 showsthe autocorrelation and cross-correlation functions (ACFs and CCF) forthese two series. Both of the ACFs exhibit periodicities correspondingto the correlation between values separated by 12 units. Observations 12months or one year apart are strongly positively correlated, as are obser-vations at multiples such as 24, 36, 48, . . . Observations separated by sixmonths are negatively correlated, showing that positive excursions tendto be associated with negative excursions six months removed. This ap-pearance is rather characteristic of the pattern that would be produced by


Figure 1.13 ACF of the speech series.

a sinusoidal component with a period of 12 months. The cross-correlationfunction peaks at h = −6, showing that the SOI measured at time t − 6months is associated with the Recruitment series at time t. We couldsay the SOI leads the Recruitment series by six months. The sign of theACF is negative, leading to the conclusion that the two series move indifferent directions, i.e., increases in SOI lead to decreases in Recruit-ment and vice versa. Again, note the periodicity of 12 months in theCCF. The flat lines shown on the plots indicate ±2/

√453, so that upper

values would be exceeded about 2.5% of the time if the noise were white[see (1.38) and (1.41)].

To reproduce Figure 1.14 in R, use the following commands.

> soi=scan("/mydata/soi.dat")> rec=scan("/mydata/recruit.dat")> par(mfrow=c(3,1))> acf(soi, 50)> acf(rec, 50)> ccf(soi, rec, 50)

1.7 Vector-Valued and Multidimensional Series

We frequently encounter situations in which the relationships between a num-ber of jointly measured time series are of interest. For example, in the previoussections, we considered discovering the relationships between the SOI and Re-cruitment series. Hence, it will be useful to consider the notion of a vector time

1.7: Vector-Valued and Multidimensional Series 35

Figure 1.14 Sample ACFs of the SOI series (top) and of the Recruitmentseries (middle), and the sample CCF of the two series (bottom); negative lagsindicate SOI leads Recruitment.

series xxxt = (xt1, xt2, . . . , xtp)′, which contains as its components p univariatetime series. We denote the p × 1 column vector of the observed series as xxxt.The row vector xxx′

t is its transpose. For the stationary case, the p × 1 meanvector

µµµ = E(xxxt) (1.43)

of the form µµµ = (µt1, µt2, . . . , µtp)′ and the p × p autocovariance matrix

Γ(h) = E[(xxxt+h − µµµ)(xxxt − µµµ)′] (1.44)

can be defined, where the elements of the matrix Γ(h) are the cross-covariance


functionsγij(h) = E[(xt+h,i − µi)(xtj − µj)] (1.45)

for i, j = 1, . . . , p. Because γij(h) = γji(−h), it follows that

Γ(−h) = Γ′(h). (1.46)

Now, the sample autocovariance matrix of the vector series xxxt is the p × pmatrix of sample cross-covariances, defined as

Γ(h) = n−1n−h∑t=1

(xxxt+h − xxx)(xxxt − xxx)′, (1.47)

where

xxx = n−1n∑

t=1

xxxt (1.48)

denotes the p× 1 sample mean vector. The symmetry property of the theoret-ical autocovariance (1.46) extends to the sample autocovariance (1.47), whichis defined for negative values by taking

Γ(−h) = Γ(h)′. (1.49)

In many applied problems, an observed series may be indexed by morethan time alone. For example, the position in space of an experimental unitmight be described by two coordinates, say, s1 and s2. We may proceed inthese cases by defining a multidimensional process xsss as a function of the r×1vector sss = (s1, s2, . . . , sr)′ where si denotes the coordinate of the ith index.

Example 1.26 Soil Surface Temperatures

As an example, the two-dimensional (r = 2) temperature series xs1,s2 inFigure 1.15 is indexed by a row number s1 and a column number s2 thatrepresent positions on a 64 × 36 spatial grid set out on an agriculturalfield. The value of the temperature measured at row s1 and column s2, isdenoted by xsss = xs1,s2. We can note from the two-dimensional plot thata distinct change occurs in the character of the two-dimensional surfacestarting at about row 40, where the oscillations along the row axis becomefairly stable and periodic. For example, averaging over the 36 columns,we may compute an average value for each s1 as in Figure 1.16. It isclear that the noise present in the first part of the two-dimensional seriesis nicely averaged out, and we see a clear and consistent temperaturesignal.

The autocovariance function of a stationary multidimensional process, xsss,can be defined as a function of the multidimensional lag vector, say, hhh =(h1, h2, . . . , hr)′, as

γ(hhh) = E[(xsss+hhh − µ)(xsss − µ)], (1.50)


Figure 1.15 Two-dimensional time series of temperature measurements takenon a rectangular field (64 × 36 with 17-foot spacing). Data are from Bazza etal. (1988).

whereµ = E(xsss) (1.51)

does not depend on the spatial coordinate sss. For the two dimensional temper-ature process, (1.50) becomes

γ(h1, h2) = E[(xs1+h1,s2+h2 − µ)(xs1,s2 − µ)], (1.52)

which is a function of lag, both in the row (h1) and column (h2) directions.The multidimensional sample autocovariance function is defined as

γ(hhh) = (S1S2 · · ·Sr)−1∑s1

∑s2

· · ·∑sr

(xsss+hhh − x)(xsss − x), (1.53)

where sss = (s1, s2, . . . , sr)′ and the range of summation for each argument is1 ≤ si ≤ Si−hi, for i = 1, . . . , r. The mean is computed over the r-dimensionalarray, that is,

x = (S1S2 · · ·Sr)−1∑s1

∑s2

· · ·∑sr

xs1,s2,···,sr , (1.54)

where the arguments si are summed over 1 ≤ si ≤ Si. The multidimensionalsample autocorrelation function follows, as usual, by taking the scaled ratio

ρ(hhh) =γ(hhh)γ(0)

. (1.55)


Figure 1.16 Row averages of the two-dimensional soil temperature profile.xs1 =

∑s2

xs1,s2/36.

Example 1.27 Sample ACF of the Soil Temperature Series

The autocorrelation function of the two-dimensional temperature processcan be written in the form

ρ(h1, h2) =γ(h1, h2)γ(0, 0)

,

where

γ(h1, h2) = (S1S2)−1∑s1

∑s2

(xs1+h1,s2+h2 − x)(xs1,s2 − x)

Figure 1.17 shows the autocorrelation function for the temperature data,and we note the systematic periodic variation that appears along therows. The autocovariance over columns seems to be strongest for h1 = 0,implying columns may form replicates of some underlying process thathas a periodicity over the rows. This idea can be investigated by exam-ining the mean series over columns as shown in Figure 1.16.

The sampling requirements for multidimensional processes are rather severebecause values must be available over some uniform grid in order to computethe ACF. In some areas of application, such as in soil science, we may preferto sample a limited number of rows or transects and hope these are essentiallyreplicates of the basic underlying phenomenon of interest. One-dimensionalmethods can then be applied. When observations are irregular in time space,modifications to the estimators need to be made. Systematic approaches to the


Figure 1.17 Two-dimensional autocorrelation function for the soil tempera-ture data.

problems introduced by irregularly spaced observations have been developedby Journel and Huijbregts (1978) or Cressie (1993). We shall not pursue suchmethods in detail here, but it is worth noting that the introduction of thevariogram

2Vx(hhh) = var{xsss+hhh − xsss} (1.56)

and its sample estimator

2Vx(hhh) =1

N(hhh)

∑sss

(xsss+hhh − xsss)2 (1.57)

play key roles, where N(hhh) denotes both the number of points located withinhhh, and the sum runs over the points in the neighborhood. Clearly, substan-tial indexing difficulties will develop from estimators of the kind, and oftenit will be difficult to find non-negative definite estimators for the covariancefunction. Problem 1.26 investigates the relation between the variogram andthe autocovariance function in the stationary case.


Problems

Section 1.2

1.1 To compare the earthquake and explosion signals, plot the data displayedin Figure 1.7 on the same graph using different colors or different linetypes and comment on the results.

1.2 Consider a signal plus noise model of the general form xt = st + wt,where wt is Gaussian white noise with σ2

w = 1. Simulate and plot n =200 observations from each of the following two models (Save the datagenerated here for use in Problem 1.21 ):

(a) xt = st + wt, for t = 1, ..., 200, where

st ={

0, t = 1, ..., 10010 exp{− (t−100)

20 } cos(2πt/4), t = 101, ..., 200.

(b) xt = st + wt, for t = 1, ..., 200, where

st ={

0, t = 1, ..., 10010 exp{− (t−100)

200 } cos(2πt/4), t = 101, ..., 200.

(c) Compare the general appearance of the series (a) and (b) with theearthquake series and the explosion series shown in Figure 1.7. Inaddition, plot (or sketch) and compare the signal modulators (a)exp{−t/20} and (b) exp{−t/200}, for t = 1, 2, ..., 100.

Section 1.3

1.3 (a) Generate n = 100 observations from the autoregression

xt = −.9xt−2 + wt

with σw = 1, using the method described in Example 1.10. Next,apply the moving average filter

vt = (xt + xt−1 + xt−2 + xt−3)/4

to xt, the data you generated. Now plot xt as a line and superim-pose vt as a dashed line. Comment on the behavior of xt and howapplying the moving average filter changes that behavior.

(b) Repeat (a) but withxt = cos(2πt/4).

(c) Repeat (b) but with added N(0, 1) noise,

xt = cos(2πt/4) + wt.

(d) Compare and contrast (a)–(c).

Problems 41

Section 1.4

1.4 Show that the autocovariance function can be written as

γ(s, t) = E[(xs − µs)(xt − µt)] = E(xsxt) − µsµt,

where E[xt] = µt.

1.5 For the two series, xt, in Problem 1.2 (a) and (b):

(a) compute and sketch the mean functions µx(t); for t = 1, . . . , 200.

(b) calculate the autocovariance functions, γx(s, t), for s, t = 1, . . . , 200.

Section 1.5

1.6 Consider the time series

xt = β1 + β2t + wt,

where β1 and β2 are known constants and wt is a white noise processwith variance σ2

w.

(a) Determine whether xt is stationary.

(b) Show that the process yt = xt − xt−1 is stationary.

(c) Show that the mean of the moving average

vt =1

2q + 1

q∑j=−q

xt−j

is β1 + β2t, and give a simplified expression for the autocovariancefunction.

1.7 For a moving average process of the form

xt = wt−1 + 2wt + wt+1,

where wt are independent with zero means and variance σ2w, determine

the autocovariance and autocorrelation functions as a function of lagh = s − t and plot.

1.8 Consider the randow walk with drift model

xt = δ + xt−1 + wt,

for t = 1, 2, . . . , with x0 = 0, where wt is white noise with variance σ2w.

(a) Show that the model can be written as xt = δt +∑t

k=1 wk.


(b) Find the mean function and the autocovariance function of xt

(c) Show ρx(t − 1, t) =√

t−1t → 1 as t → ∞. What is the implication

of this result?

(d) Show that the series is not stationary.

(e) Suggest a transformation to make the series stationary, and provethat the transformed series is stationary. (Hint: See Problem 1.6b.)

1.9 A time series with a periodic component can be constructed from

xt = U1 sin(2πω0t) + U2 cos(2πω0t),

where U1 and U2 are independent random variables with zero means andE(U2

1 ) = E(U22 ) = σ2. The constant ω0 determines the period or time

it takes the process to make one complete cycle. Show that this series isweakly stationary with autocovariance function

γ(h) = σ2 cos(2πω0h).

1.10 Suppose we would like to predict a single stationary series xt with zeromean and autocorrelation function γ(h) at some time in the future, say,t + , for > 0.

(a) If we predict using only xt and some scale multiplier A, show thatthe mean-square prediction error

MSE(A) = E[(xt+� − Axt)2]

is minimized by the value

A = ρ().

(b) Show that the minimum mean-square prediction error is

MSE(A) = γ(0)[1 − ρ2()].

(c) Show that if xt+� = Axt, then ρ() = 1 if A > 0, and ρ() = −1 ifA < 0.

1.11 Consider the linear process defined in (1.31).

(a) Verify that the autocovariance function of the process is given by(1.33). Use the result to verify your answer to Problem 1.7.

(b) Show that xt exists as a limit in mean square (see Appendix A) if(1.32) holds.

1.12 For two weakly stationary series xt and yt, verify (1.30).

Problems 43

1.13 Consider the two seriesxt = wt

yt = wt − θwt−1 + ut,

where wt and ut are independent white noise series with variances σ2w

and σ2u, respectively, and θ is an unspecified constant.

(a) Express the ACF, ρy(h), for h = 0,±1,±2, . . . of the series yt as afunction of σ2

w, σ2u, and θ.

(b) Determine the CCF, ρxy(h) relating xt and yt.(c) Show that xt and yt are jointly stationary.

1.14 Let xt be a stationary normal process with mean µx and autocovariancefunction γ(h). Define the nonlinear time series

yt = exp{xt}.

(a) Express the mean function E(yt) in terms of µx and γ(0). Themoment generating function of a normal random variable x withmean µ and variance σ2 is

Mx(λ) = E[exp{λx}] = exp{

µλ +12σ2λ2

}.

(b) Determine the autocovariance function of yt. The sum of the twonormal random variables xt+h+xt is still a normal random variable.

1.15 Let wt, for t = 0,±1,±2, . . . be a normal white noise process, and con-sider the series

xt = wtwt−1.

Determine the mean and autocovariance function of xt, and state whetherit is stationary.

1.16 Consider the seriesxt = sin(2πUt),

t = 1, 2, . . ., where U has a uniform distribution on the interval (0, 1).

(a) Prove xt is weakly stationary.(b) Prove xt is not strictly stationary. [Hint: consider the joint bivariate

cdf (1.19) at the points t = 1, s = 2 with h = 1, and find values ofct, cs where strict stationarity does not hold.]

1.17 Suppose we have the linear process xt generated by

xt = wt − θwt−1,

t = 0, 1, 2, . . ., where {wt} is independent and identically distributed withcharacteristic function φw(·), and θ is a fixed constant.


(a) Express the joint characteristic function of x1, x2, . . . , xn, say,

φx1,x2,...,xn(λ1, λ2, . . . , λn),

in terms of φw(·).(b) Deduce from (a) that xt is strictly stationary.

1.18 Suppose that xt is a linear process of the form (1.31) satisfying the ab-solute summability condition (1.32). Prove

∞∑h=−∞

|γ(h)| < ∞.

Section 1.6

1.19 (a) Simulate a series of n = 500 Gaussian white noise observations as inExample 1.8 and compute the sample ACF, ρ(h), to lag 20. Com-pare the sample ACF you obtain to the actual ACF, ρ(h). [RecallExample 1.19.]

(b) Repeat part (a) using only n = 50. How does changing n affect theresults?

1.20 (a) Simulate a series of n = 500 moving average observations as inExample 1.9 and compute the sample ACF, ρ(h), to lag 20. Com-pare the sample ACF you obtain to the actual ACF, ρ(h). [RecallExample 1.20.]

(b) Repeat part (a) using only n = 50. How does changing n affect theresults?

1.21 Although the model in Problem 1.2 is not stationary (Why?), the sampleACF can be informative. For the data you generated in that problem,calculate and plot the sample ACF, and then comment.

1.22 Simulate a series of n = 500 observations from the signal-plus-noise modelpresented in Example 1.12 with σ2

w = 1. Compute the sample ACF tolag 100 of the data you generated and comment.

1.23 For the time series yt described in Example 1.23, verify the stated resultthat ρy(1) = −.47 and ρy(h) = 0 for h > 1.

1.24 A real-valued function g(t), defined on the integers, is non-negative def-inite if and only if

n∑s=1

n∑t=1

asg(s − t)at ≥ 0

for all positive integers n and for all vectors aaa = (a1, a2, . . . , an)′. Forthe matrix G = {g(s − t), s, t = 1, 2, . . . , n}, this implies that aaa′Gaaa ≥ 0for all vectors aaa.

Problems 45

(a) Prove that γ(h), the autocovariance function of a stationary process,is a non-negative definite function.

(b) Verify that the sample autocovariance γ(h) is a non-negative definitefunction.

Section 1.7

1.25 Consider a collection of time series x1t, x2t, . . . , xNt that are observingsome common signal µt observed in noise processes e1t, e2t, . . . , eNt, witha model for the j-th observed series given by

xjt = µt + ejt.

Suppose the noise series have zero means and are uncorrelated for dif-ferent j. The common autocovariance functions of all series are given byγe(s, t). Define the sample mean

xt =1N

N∑j=1

xjt.

(a) Show that E[xt] = µt.

(b) Show that E[(xt − µ)2)] = N−1γe(t, t).

(c) How can we use the results in estimating the common signal?

1.26 A concept used in geostatistics, see Journel and Huijbregts (1978) orCressie (1993), is that of the variogram, defined for a spatial process xsss,sss = (s1, s2), for s1, s2 = 0,±1,±2, ..., as

Vx(hhh) =12E[(xsss+hhh − xsss)2],

where hhh = (h1, h2), for h1, h2 = 0,±1,±2, ... Show that, for a station-ary process, the variogram and autocovariance functions can be relatedthrough

Vx(hhh) = γ(000) − γ(hhh),

where γ(hhh) is the usual lag hhh covariance function and 000 = (0, 0). Notethe easy extension to any spatial dimension.

The following problems require the supplemental material given in Appendix A

1.27 Suppose xt = β0 +β1t, where β0 and β1 are constants. Prove as n → ∞,ρx(h) → 1 for fixed h, where ρx(h) is the ACF (1.37).


1.28 (a) Suppose xt is a weakly stationary time series with mean zero andwith absolutely summable autocovariance function, γ(h), such that

∞∑h=−∞

γ(h) = 0.

Prove that√

n xp→ 0, where x is the sample mean (1.35).

(b) Give an example of a process that satisfies the conditions of part(a). What is special about this process?

1.29 Let xt be a linear process of the form (A.44)–(A.45). If we define

γ(h) = n−1n∑

t=1

(xt+h − µx)(xt − µx),

show thatn1/2(γ(h) − γ(h)

)= op(1).

Hint: The Markov Inequality

P{|x| ≥ ε} <E|x|

ε

can be helpful for the cross-product terms.

1.30 For a linear process of the form

xt =∞∑

j=0

φjwt−j ,

where {wt} satisfies the conditions of Theorem A.7 and |φ| < 1, showthat √

n(ρx(1) − ρx(1))√

1 − ρ2x(1)

d→ N(0, 1),

and construct a 95% confidence interval for φ when ρx(1) = .64 andn = 100.

1.31 Let {xt; t = 0,±1,±2, . . .} be iid (0, σ2).

(a) For h ≥ 1 and k ≥ 1, show that xtxt+h and xsxs+k are uncorrelatedfor all s �= t.

(b) For fixed h ≥ 1, show that the h × 1 vector

σ−2n−1/2n∑

t=1

(xtxt+1, . . . , xtxt+h)′ d→ (z1, . . . , zh)′

Problems 47

where z1, . . . , zh are iid N(0, 1) random variables. [Note: the se-quence {xtxt+h; t = 1, 2, . . .} is h-dependent and white noise (0, σ4).Also, recall the Cramer-Wold device.]

(c) Show, for each h ≥ 1,

n−1/2

[n∑

t=1

xtxt+h −n−h∑t=1

(xt − x)(xt+h − x)

]p→ 0 as n → ∞

where x = n−1∑nt=1 xt.

(d) Noting that n−1∑nt=1 x2

tp→ σ2, conclude that

n1/2 [ρ(1), . . . , ρ(h)]′ d→ (z1, . . . , zh)′

where ρ(h) is the sample ACF of the data x1, . . . , xn.

Chapter 2

Time Series Regression andExploratory Data Analysis

2.1 Introduction

The linear model and its applications are at least as dominant in the timeseries context as in classical statistics. Regression models are important fortime domain models discussed in Chapters 3, 5, and 6, and in the frequencydomain models considered in Chapters 4 and 7. The primary ideas dependon being able to express a response series, say xt, as a linear combinationof inputs, say zt1, zt2, . . . , ztq. Estimating the coefficients β1, β2, . . . , βq in thelinear combinations by least squares provides a method for modeling xt interms of the inputs.

In the time domain applications of Chapter 3, for example, we will expressxt as a linear combination of previous values xt−1, xt−2, . . . , xp, of the currentlyobserved series. The outputs xt may also depend on lagged values of anotherseries, say yt−1, yt−2, . . . , yt−q, that have influence. It is easy to see that fore-casting becomes an option when prediction models can be formulated in thisform. Time series smoothing and filtering can be expressed in terms of localregression models. Polynomials and regression splines also provide importanttechniques for smoothing.

If one admits sines and cosines as inputs, the frequency domain ideas thatlead to the periodogram and spectrum of Chapter 4 follow from a regressionmodel. Extensions to filters of infinite extent can be handled using regressionin the frequency domain. In particular, many regression problems in the fre-quency domain can be carried out as a function of the periodic components ofthe input and output series, providing useful scientific intuition into fields likeacoustics, oceanographics, engineering, biomedicine, and geophysics.

The above considerations motivate us to include a separate chapter on re-

48

2.2: Classical Regression 49

gression and some of its applications that is written on an elementary level andis formulated in terms of time series. The assumption of linearity, stationarity,and homogeneity of variances over time is critical in the regression context, andtherefore we include some material on transformations and other techniquesuseful in exploratory data analysis.

2.2 Classical Regression in the Time SeriesContext

We begin our discussion of linear regression in the time series context by as-suming some output or dependent time series, say, xt, for t = 1, . . . , n, isbeing influenced by a collection of possible inputs or independent series, say,zt1, zt2, . . . , ztq, where we first regard the inputs as fixed and known. This as-sumption, necessary for applying conventional linear regression, will be relaxedlater on. We express this relation through the linear regression model

xt = β1zt1 + β2zt2 + · · · + βqztq + wt, (2.1)

where β1, β2, . . . , βq are unknown fixed regression coefficients, and {wt} is arandom error or noise process consisting of independent and identically dis-tributed (iid) normal variables with mean zero and variance σ2

w; we will relaxthe iid assumption later. A more general setting within which to embed meansquare estimation and linear regression is given in Appendix B, where we in-troduce Hilbert spaces and the Projection Theorem.

Example 2.1 Estimating a Trend

Consider the global temperature data, say xt, shown in Figure 1.2. Asdiscussed in Example 1.2, there is an apparent upward trend in the seriesthat has been used to argue the global warming hypothesis. We mightuse simple linear regression to estimate that trend by fitting the model

xt = β1 + β2t + wt, t = 1900, 1901, . . . , 1997.

This is in the form of the regression model (2.1) when we make theidentification q = 2, zt1 = 1, zt2 = t. Note that we are making theassumption that the errors, wt, are an iid normal sequence, which maynot be true. We will address this problem further in §2.3; the problemof autocorrelated errors is discussed in detail in §5.5. Also note that wecould have used, e.g., t = 0, . . . , 97, without affecting the interpretationof the slope coefficient, β2; only the intercept, β1, would be affected.

Using simple linear regression, we obtained the estimated coefficientsβ1 = −12.186, and β2 = .006 (with a standard error of .0005) yielding asignificant estimated increase of .6 degrees centigrade per 100 years. We

50 Regression and Exploratory Data Analysis

1900 1920 1940 1960 1980 2000

−0.4

−0.2

0.0

0.2

0.4

Figure 2.1 Global temperature deviations shown in Figure 1.2 with fittedlinear trend line.

discuss the precise way in which the solution was accomplished below.Finally, Figure 2.1 shows the global temperature data, say xt, with theestimated trend, say xt = −12.186 + .006t, superimposed. It is apparentthat the estimated trend line obtained via simple linear regression doesnot quite capture the trend of the data and better models will be needed.

To perform this analysis in R, we note that the data file globtemp.dathas 142 observations starting from the year 1856. We are only using thefinal 98 observations corresponding to the years 1900 to 1997.

> gtemp = scan("/mydata/globtemp.dat")> x = gtemp[45:142]> t = 1900:1997> fit=lm(x˜t) # regress x on t> summary(fit) # regression output> plot(t,x, type="o", xlab="year", ylab="temp deviation")> abline(fit) # add regression line to the plot

The linear model described by (2.1) above can be conveniently written ina more general notation by defining the column vectors zzzt = (zt1, zt2, . . . , ztq)′

and βββ = (β1, β2, . . . , βq)′, where ′ denotes transpose, so (2.1) can be written inthe alternate form

xt = βββ′zzzt + wt. (2.2)

where wt ∼ iid(0, σ2w). It is natural to consider estimating the unknown coef-


ficient vector βββ by minimizing the residual sum of squares

RSS =n∑

t=1

(xt − βββ′zzzt)2, (2.3)

with respect to β1, β2, . . . , βq. Minimizing RSS yields the ordinary least squaresestimator. This minimization can be accomplished by differentiating (2.3) withrespect to the vector βββ or by using the properties of projections. In the notationabove, this procedure gives the normal equations( n∑

t=1

zzztzzz′t

)βββ =

n∑t=1

zzztxt. (2.4)

A further simplification of notation results from defining the matrix Z =(zzz1, zzz2, . . . , zzzn)′ as the n × q matrix composed of the n samples of the inputvariables and the observed n×1 vector xxx = (x1, x2, . . . , xn)′. This identificationyields

(Z ′Z) βββ = Z ′xxx (2.5)

and the solutionβββ = (Z ′Z)−1Z ′xxx (2.6)

when the matrix Z ′Z is of rank q. The minimized residual sum of squares (2.3)has the equivalent matrix forms

RSS = (xxx − Zβββ)′(xxx − Zβββ)

= xxx′xxx − βββ′Z ′xxx

= xxx′xxx − xxx′Z(Z ′Z)−1Z ′xxx, (2.7)

to give some useful versions for later reference. The ordinary least squaresestimators are unbiased, i.e., E(βββ) = βββ, and have the smallest variance withinthe class of linear unbiased estimators.

If the errors wt are normally distributed (Gaussian), βββ is also the maximumlikelihood estimator for βββ and is normally distributed with

cov(βββ) = σ2w

( n∑t=1

zzztzzz′t

)−1

= σ2w(Z ′Z)−1

= σ2wC, (2.8)

whereC = (Z ′Z)−1 (2.9)

is a convenient notation for later equations. An unbiased estimator for thevariance σ2

w is

s2w =

RSS

n − q, (2.10)


Table 2.1 Analysis of Variance for Regression

Source df Sum of Squares Mean Square

zt,q1+1, . . . , zt,q q − q1 SSreg = RSS1 − RSS MSreg = SSreg/(q − q1)Error n − q RSS s2

w = RSS/(n − q)Total n − q1 RSS1

contrasted with the maximum likelihood estimator σ2w = RSS/n, which has

the divisor n. Under the normal assumption, s2w is distributed proportionally

to a chi-squared random variable with n − q degrees of freedom, denoted byχ2

n−q, and independently of β. It follows that

tn−q =(βi − βi)sw

√cii

(2.11)

has the t-distribution with n−q degrees of freedom; cii denotes the ith diagonalelement of C, as defined in (2.9).

Various competing models are of interest to isolate or select the best subsetof independent variables. Suppose a proposed model specifies that only asubset q1 < q independent variables, say, zzz1t = (zt1, zt2, . . . , ztq1)

′ is influencingthe dependent variable xt, so the model

xt = βββ′1zzz1t + wt (2.12)

becomes the null hypothesis, where βββ1 = (β1, β2, . . . , βq1)′ is a subset of coeffi-

cients of the original q variables. We can test the reduced model (2.12) againstthe full model (2.2) by comparing the residual sums of squares under the twomodels using the F-statistic

Fq−q1,n−q =RSS1 − RSS

RSS

n − q

q − q1, (2.13)

which has the central F -distribution with q − q1 and n − q degrees of freedomwhen (2.12) is the correct model. The statistic, which follows from applyingthe likelihood ratio criterion, has the improvement per number of parametersadded in the numerator compared with the error sum of squares under thefull model in the denominator. The information involved in the test procedureis often summarized in an Analysis of Variance (ANOVA) table as given inTable 2.1 for this particular case. The difference in the numerator is oftencalled the regression sum of squares

In terms of Table 2.1, it is conventional to write the F -statistic (2.13) asthe ratio of the two mean squares, obtaining

Fq−q1,n−q =MSreg

s2w

. (2.14)


A special case of interest is q1 = 1 and z1t = 1, so the model in (2.12) becomes

xt = β1 + wt,

and we may measure the proportion of variation accounted for by the othervariables using

R2xz =

RSS0 − RSS

RSS0, (2.15)

where the residual sum of squares under the reduced model

RSS0 =n∑

t=1

(xt − x)2, (2.16)

in this case is just the sum of squared deviations from the mean x. The mea-sure R2

xz is also the squared multiple correlation between xt and the variableszt2, zt3, . . . , ztq.

The techniques discussed in the previous paragraph can be used to testvarious models against one another using the F test given in (2.13), (2.14),and the ANOVA table. These tests have been used in the past in a stepwisemanner, where variables are added or deleted when the values from the F -testeither exceed or fail to exceed some predetermined levels. The procedure, calledstepwise multiple regression, is useful in arriving at a set of useful variables. Analternative is to focus on a procedure for model selection that does not proceedsequentially, but simply evaluates each model on its own merits. Supposewe consider a regression model with k coefficients and denote the maximumlikelihood estimator for the variance as

σ2k =

RSSk

n, (2.17)

where RSSk denotes the residual sum of squares under the model with kregression coefficients. Then, Akaike (1969, 1973, 1974) suggested measuringthe goodness of fit for this particular model by balancing the error of the fitagainst the number of parameters in the model; we define

Definition 2.1 Akaike’s Information Criterion (AIC)

AIC = ln σ2k +

n + 2k

n, (2.18)

where σ2k is given by (2.17) and k is the number of parameters in the model.

The value of k yielding the minimum AIC specifies the best model. The ideais roughly that minimizing σ2

k would be a reasonable objective, except that itdecreases monotonically as k increases. Therefore, we ought penalize the errorvariance by a term proportional to the number of parameters. The choicefor the penalty term given by (2.18) is not the only one, and a considerable


literature is available advocating different penalty terms. A corrected form,suggested by Sugiura (1978), and expanded by Hurvich and Tsai (1989), canbe based on small-sample distributional results for the linear regression model(details are provided in Problems 2.4 and 2.5). The corrected form is definedas

Definition 2.2 AIC, Bias Corrected (AICc)

AICc = ln σ2k +

n + k

n − k − 2, (2.19)

where σ2k is given by (2.17), k is the number of parameters in the model, and

n is the sample size.

We may also derive a correction term based on Bayesian arguments, as inSchwarz (1978), which leads to

Definition 2.3 Schwarz’s Information Criterion (SIC)

SIC = ln σ2k +

k lnn

n, (2.20)

using the same notation as in Definition 2.2.

SIC is also called the Bayesian Information Criterion (BIC) (see also Ris-sanen, 1978, for an approach yielding the same statistic based on a minimumdescription length argument). Various simulation studies have tended to ver-ify that SIC does well at getting the correct order in large samples, whereasAICc tends to be superior in smaller samples where the relative number ofparameters is large (see McQuarrie and Tsai, 1998, for detailed comparisons).In fitting regression models, two measures that have been used in the past areadjusted R-squared, which is essentially s2

w, and Mallows Cp, Mallows (1973),which we do not consider in this context.

Example 2.2 Pollution, Temperature and Mortality

The data shown in Figure 2.2 are extracted series from a study byShumway et al. (1988) of the possible effects of temperature and pollu-tion on daily mortality in Los Angeles County. Note the strong seasonalcomponents in all of the series, corresponding to winter-summer varia-tions and the downward trend in the cardiovascular mortality over the10-year period.

A scatterplot matrix, shown in Figure 2.3, indicates a possible linearrelation between mortality and the pollutant particulates and a possiblerelation to temperature. Note the curvilinear shape of the temperaturemortality curve, indicating that higher temperatures as well as lowertemperatures are associated with increases in cardiovascular mortality.


50 100 150 200 250 300 350 400 450 50060

80

100

120

140Cardiovascular Mortality

50 100 150 200 250 300 350 400 450 50040

60

80

100Temperature

50 100 150 200 250 300 350 400 450 5000

50

100Particulates

week

Figure 2.2 Average daily cardiovascular mortality (top), temperature (mid-dle) and particulate pollution (bottom) in Los Angeles County. There are 508six-day smoothed averages obtained by filtering daily values over the 10 yearperiod 1970-1979.

Based on the scatterplot matrix, we entertain, tentatively, four modelswhere Mt denotes cardiovascular mortality, Tt denotes temperature andPt denotes the particulate levels. They are

Mt = β0 + β1t + wt (2.21)Mt = β0 + β1t + β2(Tt − T·) + wt (2.22)Mt = β0 + β1t + β2(Tt − T·) + β3(Tt − T·)2 + wt (2.23)Mt = β0 + β1t + β2(Tt − T·) + β3(Tt − T·)2 + β4Pt + wt (2.24)

where we adjust temperature for its mean, T· = 74.6, to avoid scalingproblems. It is clear that (2.21) is a trend only model, (2.22) is lineartemperature, (2.23) is curvilinear temperature and (2.24) is curvilineartemperature and pollution. We summarize some the statistics given forthis particular case in Table 2.2. The values of R2 were computed by


mortality

50 60 70 80 90 100

oo

oo

o

o

ooo

o

oo

o

o

o

o

o o

o

o

o

o

oooo

oo

ooo

o

oo

oo

oooo

o

ooo

o

o

o

oo

oo

o

oo

o

o

o

oo

ooo

o

o

o

oo

o

o ooo

oo

oo

o

o

o

oo

o

ooo

o

o

o

o

o

o

ooo

o o

o

oo o

o

oo

oo

oo

o

o

oooo

o o

o

o

o

o

oo

oo

ooo

o

oo

ooo

ooo ooo

o

o

o

oo

o

ooo

o

oo

ooo

oo

ooo

o

o

oo o

o

o

o

o

oo

ooo o

o

o

oo

oo

oo

oo

oooo

o

o

oooo

o

o

ooo

o

o

ooo

o

ooo

oo

o

o

o

oo

o

oo

o

o

o

o

oo

o o oo

o

o

o oo

o

o

oo

o oo

oo

oo ooo o

o

o

o

oo

o

oo

o

o

o

o o

o o

ooooo

o

o

o

ooo

o

o

oo

o

ooo

o

o

o

oo

oo

ooo

oo

oo

oo

o

o

oo

oo

o oo

o

o

o

oo

o

oooo

o

o

oooo o o

oo

ooo

o

o

o

o o

o oo

o

oo ooo

o oo

oooo o

o

oo

o

o

o

o

o

ooo

o

o o

o

o

oo

ooo o

o oo

o

o

o

oo

o oo ooo

oo

oo

ooo

o

oo

o

o

oooo

o

o

oo

oo

oooo

oo

oo

o

ooo ooooo

oo

o

oo

o

ooo

o

oo

o

o oo

o

oo

ooo

oo

o

o

o

ooo

o

o

oo

o

oo

oo

oo

oo

oo oo

ooo

oo

oo

o

o

o

o

o

oo

o

oo

o

o

o

o

oo oo

ooo

oo

o

o

ooo

140

160

180

200

220

oo

oo

o

o

ooo

o

oo

o

o

o

o

o o

o

o

o

o

oooo

oo

o oo

o

oooo

oooo

o

ooo

o

o

o

oo

oo

o

oo

o

o

o

oo

ooo

o

o

o

oo

o

o ooo

oo

oo

o

o

o

oo

o

o oo

o

o

o

o

o

o

ooo

o o

o

ooo

o

oo

oo

oo

o

o

oooo

oo

o

o

o

o

ooo

o

ooo

o

oo

o oooo

oooo

o

o

o

ooo

ooo

o

oo

ooo

oo

oo

o

o

o

ooo

o

o

o

o

oo

ooo o

o

o

oo

oo

oo

oo

oooo

o

o

ooo

oo

o

oo o

o

o

ooo

o

o oo

oo

o

o

o

oo

o

oo

o

o

o

o

oo

oo oo

o

o

o oo

o

o

oo

o oo

oo

o o oo o o

o

o

o

oo

o

oo

o

o

o

oo

oo

ooooo

o

o

o

o oo

o

o

oo

o

oooo

o

o

oo

oo

ooo

ooo

o

oo

o

o

o o

oo

o oo

o

o

o

oo

o

oooo

o

o

oo o

oooo

o

ooo

o

o

o

o o

o oo

o

oo oo

oo oo

ooo

oo

o

oo

o

o

o

o

o

ooo

o

oo

o

o

ooo

o ooo o

oo

o

o

oo

oooo oo

oo

oo

ooo

o

oo

o

o

oooo

o

o

oo

oo

oo oo

oo

oo

o

oo ooooooo

oo

oo

o

oo

oo

oo

o

ooo

o

oo

ooo

o o

o

o

o

o oo

o

o

oo

o

oo

o o

oo

oo

oo oo

ooo

oo

ooo

o

o

o

o

oo

o

oo

o

o

o

o

oo oo

ooo

oo

o

o

oo o

5060

7080

9010

0

o

o

o

oo

o

o o

oo

o

o

o

o

oo

o

oo

o

o

o

o

ooo

oo

ooo

o

o

oo

o

o

o

o

o

o

oo

o

o o

o

o

o

o

o

o

oo

oo

o

o

o

o

oo

oo

oo

o

oo

o

oo

ooo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

oooo

o

o

o

oo

o

o

oo

oo

o

o

o ooo

oo o

o

o

o

o

oo

o

o

o

ooo

o

o

ooo

oo

oo

ooo

ooo o

oo

oo

o

o

o

o

o

o

o

oo

o

o oo

o

o

oo

o

o

o

o

ooo

o

o

o

o

o

o ooo

o

oo

o

ooo oo

o

o

oo

o

o o

o oo

o

oo

o

o

ooo o

o

o

o

o

o

o

oo

ooo

oo o

o

o

o

o

oo

o

o

o

o

oo

o oo

oo

o

o

ooo

o

oo

o

oo

ooo o

o

oo

o

o

o

oo

oo

oo

ooooo

oo

o

o o

oo

o

o

ooo

o

ooo

o

o oo

o

o

oo

o

o

o

o

o

oo

o

o

ooo

o

o

o

o

o

o

ooo

o

o

oo

oo

o

oo

o

o

ooo o

o

o

o

o

o

oo

ooo

oo

o

ooo

o

oo

o

o

o

o

o

oo o

o

o

o

ooo

o

oo

o

oo

ooo

oo

o

o

oo

oo

o

o

ooooo

o

o

oo oo

oo

o

o o

oo

oooo

o

o

o

oo o

ooo o

oo

o

oo

oo

o

oo

oo

oo

o

o

oo o

ooo

oo

o

oo o

ooo

o

o

o

o

o

o

o

oo

o

o o

o

o

o

o

oooo

oo

o

o

ooo

o

o

oo

oooo

o

o

oo

o

oo

o

oo

oo oo o

o

o

o

ooo

oo

oo

ooo

temperatureo

o

o

oo

o

oo

oo

o

o

o

o

oo

o

oo

o

o

o

o

ooo

oo

o oo

o

o

oo

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

oo

oo

o

o

o

o

oo

oo

oo

o

oo

o

oo

ooo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

oo

oo

o

o

o

o o

o

o

oo

oo

o

o

ooooo

o o

o

o

o

o

oo

o

o

o

ooo

o

o

oo ooo

o o

ooo

oo o o

oo

oo

o

o

o

o

o

o

o

oo

o

ooo

o

o

oo

o

o

o

o

ooo

o

o

o

o

o

oooo

o

oo

o

ooooo

o

o

oo

o

o o

o oo

o

oo

o

o

oo oo

o

o

o

o

o

o

oo

oooo

ooo

o

o

o

oo

o

o

o

o

oo

o ooo

o

o

o

oo

o

o

oo

o

oo

oo

o o

o

oo

o

o

o

oo

oo

oo

oooo

oo

o

o

oo

ooo

o

ooo

o

ooo

o

o oo

o

o

oo

o

o

o

o

o

oo

o

o

oo

o

o

o

o

o

o

o

ooo

o

o

oo

oo

o

oo

o

o

ooo o

o

o

o

o

o

oo

ooo

oo

o

ooo

o

oo

o

o

o

o

o

ooo

o

o

o

ooo

o

oo

o

oooo o

oo

o

o

oo

oo

o

o

o oo oo

o

o

o ooo

oo

o

oo

oo

oooo

o

o

o

ooo

oooo

oo

o

oo

o o

o

oo

oo

oo

o

o

ooo

ooo

oo

o

ooo

o oo

o

o

o

o

o

o

o

oo

o

o o

o

o

o

o

ooo o

oo

o

o

ooo

o

o

oo

oo oo

o

o

oo

o

o o

o

oo

o ooo o

o

o

o

oo

o

oo

oo

oo

o

140 160 180 200 220

o

ooo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

oo

o

ooo

oo

o

o

o

o

o

ooo

o

o

oo

o

o

o

o

oo

o

oo

o

oo

oo

o

o

o

o

o

o

oooooo

o

oo

o

oo

o

o

oo

o

oo

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

oo

o

oo

ooo

oo o

ooo

ooo

o

o

o

oooo

o

oooo

o

ooo

o o

o

oooo

oo

o

o

oo

oo

o

o o

o

o o

o

oo

o

ooo

o

o

oo

oo

oo oo

oo

o

o

o

o

o o

oo

o

oo

o

ooo

oo

o

o

o

oo

o

o

o

o

o

oo

o o

o

ooo

oo

o oo o

o

oo

oo

o o

o oo

o

oo

oo

o

o

o

oo

o

o

oo o

o

o

o

o

o

oo

o

o

o oo o

o

o

o

o

o

o

o

o

o

ooo

o

oo

o

o

o

ooo

oo

o oooo

o

o

oo

o

oo

o

o

o

oo

o

o

oo

o

o

o

oo o

o

oo

o

o

o

o

o

o

o

o

o

ooo

o

o

o

ooo

oo

oo

o

ooo

oo

o

o

o

o

o

o

oo

o

o

oo

ooo

o

oo

oo

o

ooo

oo

o

o

o

o

o

o

o

oo

o

o

oo

ooo

o

o

oooo

o

o

oo

o

ooo

o

o

o

oo

oo

o

o

o o

o

o

oo

o

o

o

o

oo

o

oooo

o

o

o

o

o

o

o

ooo

ooo

oo

o ooo o

o

oo

o

o

ooo o

o

oo o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

oo

o

o

oooo

o

o

oooo

oo o

o

o

oo

oo

ooo

o

o

o

ooo

o

o

oo

o

oo

o

oo o

o

ooo

o

o

o

o

o

o

oo

o

o

o

o

oo

o

ooo

oo

o

o

o

o

o

ooo

o

o

oo

o

o

o

o

oo

o

oo

o

oo

oo

o

o

o

o

o

o

oooo oo

o

o o

o

oo

o

o

oo

o

oo

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o o

o

oo

ooo

ooo

ooo

oo

o

o

o

o

o oo o

o

ooo

o

o

ooooo

o

o ooo

ooo

o

oo

oo

o

o o

o

o o

o

oo

o

ooo

o

o

o o

oo

oo oo

oo

o

o

o

o

oo

oo

o

oo

o

oo

oo

o

o

o

o

oo

o

o

o

o

o

oo

oo

o

oooo

o

o ooo

o

o o

oo

oo

ooo

o

o o

oo

o

o

o

oo

o

o

ooo

o

o

o

o

o

oo

o

o

oooo

o

o

o

o

o

o

o

o

o

ooo

o

oo

o

o

o

oo o

oo

ooooo

o

o

oo

o

oo

o

o

o

oo

o

o

oo

o

o

o

oo o

o

oo

o

o

o

o

o

o

o

o

o

ooo

o

o

o

oooo

oo

o

o

oo o

o o

o

o

o

o

o

o

ooo

o

oo

ooo

o

oo

oo

o

ooo

oo

o

o

o

o

o

o

o

oo

o

o

oo

oo o

o

o

oooo

o

o

oo

o

ooo

o

o

o

oo

oo

o

o

oo

o

o

oo

o

o

o

o

oo

o

ooooo

o

o

o

o

o

o

o oo

ooo

oo

o oooo

o

oo

o

o

oo oo

o

oo o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

oo

o

o

o oooo

o

oooo

oo o

o

o

oo

oo

ooo

o

o

o

ooo

o

o

oo

o

oo

20 40 60 80 10020

4060

8010

0

particulates

Figure 2.3 Scatterplot matrix showing plausible relations between mortality,temperature, and pollution.

Table 2.2 Summary Statistics for Mortality Models

Model RSS (2.3) s2w (2.10) R2 (2.15) AICc (2.19)

(2.21) 40,020 79.09 .21 5.38(2.22) 31,413 62.20 .38 5.14(2.23) 27,985 55.52 .45 5.03(2.24) 20,509 40.77 .60 4.72

noting that RSS0 = 50, 687 using (2.16).

We note that each model does substantially better than the one beforeit and that the model including both temperature, temperature squaredand particulates does the best, accounting for some 60% of the variability

2.3: Exploratory Data Analysis 57

and with the best value for AICc. Note that one can compare any twomodels using the residual sums of squares and (2.13). Hence, a modelwith only trend could be compared to the full model using q = 5, q1 =2, n = 508, so

F3,503 =(40, 020 − 20, 509)

20, 5095033

= 160,

which exceeds F3,∞(.001) = 5.42. We obtain the best prediction model,

Mt = 81.59 − .027(.002)t − .473(.032)(Tt − 74.6)

+ .023(.003)(Tt − 74.6)2 + .255(.019)Pt,

for mortality, where the standard errors, computed from (2.8)-(2.10), aregiven in parentheses. As expected, a negative trend is present in timeas well as a negative coefficient for adjusted temperature. The quadraticeffect of temperature can clearly be seen in the scatterplots of Figure 2.3.Pollution weights positively and can be interpreted as the incrementalcontribution to daily deaths per unit of particulate pollution. It wouldstill be essential to check the residuals wt = Mt −Mt for autocorrelation,but we defer this question to the section on correlated least squares,in which the incorporation of time correlation changes the estimatedstandard errors.

To display the scatterplot matrix, perform the final regression and com-pute AIC in R, use the following commands:

> mort = scan("/mydata/cmort.dat")> temp = scan("/mydata/temp.dat")> part = scan("/mydata/part.dat")> temp = temp - mean(temp)> temp2 = tempˆ2> t = 1:length(mort)> fit = lm(mort˜t + temp + temp2 + part)> summary(fit) # Results> AIC(fit)/508 # R gives n*AIC> pairs(cbind(mort, temp, part)) # scatterplot matrix

2.3 Exploratory Data Analysis

In general, it is necessary for time series data to be stationary, so averaginglagged products over time, as in the previous section, will be a sensible thingto do. With time series data, it is the dependence between the values of theseries that is important to measure; we must, at least, be able to estimate au-tocorrelations with precision. It would be difficult to measure that dependenceif the dependence structure is not regular or is changing at every time point.


Hence, to achieve any meaningful statistical analysis of time series data, it willbe crucial that, if nothing else, the mean and the autocovariance functionssatisfy the conditions of stationarity (for at least some reasonable stretch oftime) stated in Definition 1.7. Often, this is not the case, and we will mentionsome methods in this section for playing down the effects of nonstationarity sothe stationary properties of the series may be studied.

A number of our examples came from clearly nonstationary series. TheJohnson & Johnson series in Figure 1.1 has a mean that increases exponentiallyover time, and the increase in the magnitude of the fluctuations around thistrend causes changes in the covariance function; the variance of the process,for example, clearly increases as one progresses over the length of the series.Also, the global temperature series shown in Figure 1.2 contains some evi-dence of a trend over time; human-induced global warming advocates seize onthis as empirical evidence to advance their hypothesis that temperatures areincreasing.

Perhaps the easiest form of nonstationarity to work with is the trend sta-tionary model wherein the process has stationary behavior around a trend. Wemay write this type of model as

xt = µt + yt (2.25)

where xt are the observations, µt denotes the trend, and yt is a stationaryprocess. Quite often, strong trend, µt, will obscure the behavior of the sta-tionary process, yt, as we shall see in numerous examples in Chapter 3. Hence,there is some advantage to removing the trend as a first step in an exploratoryanalysis of such time series. The steps involved are to obtain a reasonableestimate of the trend component, say µt, and then work with the residuals

yt = xt − µt. (2.26)

Consider the following example.

Example 2.3 Detrending Global Temperature

Here we suppose the model is of the form of (2.25),

xt = µt + yt,

where, as we suggested in the analysis of the global temperature datapresented in Example 2.1, a straight line might be a reasonable modelfor the trend, i.e.,

µt = β1 + β2 t.

In that example, we estimated the trend using ordinary least squares1

1Because the error term, yt, is not assumed to be iid, the reader may feel that weightedleast squares is called for in this case. The problem is, we do not know the behavior ofyt, and that is precisely what we are trying to assess at this stage. A notable result byGrenander and Rosenblatt (1957, Ch 7), however, is that under mild conditions on yt,for polynomial regression or periodic regression, asymptotically, ordinary least squares isequivalent to weighted least squares.


Figure 2.4 Detrended (top) and differenced (bottom) global temperature se-ries. The original data are shown in Figures 1.2 and 2.1.

and foundµt = −12.186 + .006 t.

Figure 2.1 shows the data with the estimated trend line superimposed. Toobtain the detrended series we simply subtract µt from the observations,xt, to obtain the detrended series

yt = xt + 12.186 − .006 t.

The top graph of Figure 2.4 shows the detrended series. Figure 2.5shows the ACF of the original data (top panel) as well as the ACF of thedetrended data (middle panel).

To detrend in R, assuming the data are in gtemp:> x = gtemp[45:142] # use only 1900 to 1997> t = 1900:1997> fit = lm(x˜t) # detrended series in fit$resid> plot(t, fit$resid, type="o", ylab="detrended gtemp")

In Example 1.11 and the corresponding Figure 1.10 we saw that a randomwalk might also be a good model for trend. That is, rather than modeling trendas fixed (as in Example 2.3), we might model trend as a stochastic componentusing the random walk with drift model,

µt = δ + µt−1 + wt, (2.27)


Figure 2.5 Sample ACFs of the global temperature (top), and of the detrended(middle) and the differenced (bottom) series.

where wt is white noise and is independent of yt. If the appropriate model is(2.25), then differencing the data, xt, yields a stationary process; that is,

xt − xt−1 = (µt + yt) − (µt−1 + yt−1)= δ + wt + yt − yt−1. (2.28)

We leave it as an exercise (Problem 2.7) to show (2.28) is stationary.2

One advantage of differencing over detrending to remove trend is that noparameters are estimated in the differencing operation. One disadvantage,however, is that differencing does not yield an estimate of the stationary processyt as can be seen in (2.28). If an estimate of yt is essential, then detrendingmay be more appropriate. If the goal is to coerce the data to stationarity, then

2The key to establishing the stationarity of these types of processes is to recall that ifU =

∑m

j=1 ajXj and V =∑r

k=1 bkYk are linear combinations of random variables {Xj}and {Yk}, respectively, then cov(U, V ) =

∑m

j=1

∑r

k=1 ajbkcov(Xj , Yk).


differencing may be more appropriate. Differencing is also a viable tool if thetrend is fixed, as in Example 2.3. That is, e.g., if µt = β1 + β2 t in the model(2.25), differencing the data produces stationarity (see Problem 2.6):

xt − xt−1 = (µt + yt) − (µt−1 + yt−1) = β2 + yt − yt−1.

Because differencing plays a central role in time series analysis, it receivesits own notation. The first difference is denoted as

∇xt = xt − xt−1. (2.29)

As we have seen, the first difference eliminates a linear trend. A second differ-ence, that is, the difference of (2.29), can eliminate a quadratic trend, and soon. In order to define higher differences, we need a variation in notation thatwe use, for the first time here, and often in our discussion of ARIMA modelsin Chapter 3.

Definition 2.4 We define the backshift operator by

Bxt = xt−1

and extend it to powers B2xt = B(Bxt) = Bxt−1 = xt−2, and so on. Thus,

Bkxt = xt−k. (2.30)

It is clear that we may then rewrite (2.29) as

∇xt = (1 − B)xt, (2.31)

and we may extend the notion further. For example, the second differencebecomes

∇2xt = (1 − B)2xt = (1 − 2B + B2)xt

= xt − 2xt−1 + xt−2

by the linearity of the operator. To check, just take the difference of the firstdifference ∇(∇xt) = ∇(xt − xt−1) = (xt − xt−1) − (xt−1 − xt−2).

Definition 2.5 Differences of order d are defined as

∇d = (1 − B)d, (2.32)

where we may expand the operator (1−B)d algebraically to evaluate for higherinteger values of d. When d = 1, we drop it from the notation.

The first difference (2.29) is an example of a linear filter applied to eliminatea trend. Other filters, formed by averaging values near xt, can produce adjustedseries that eliminate other kinds of unwanted fluctuations, as in Chapter 3. Thedifferencing technique is an important component of the ARIMA model of Boxand Jenkins (1970) (see also Box et al., 1994), to be discussed in Chapter 3.


Example 2.4 Differencing Global Temperature

The first difference of the global temperature series, also shown in Fig-ure 2.4, does not contain the long middle cycle we observe in the de-trended series. The ACF of this series is also shown in Figure 2.5. In thiscase it appears that the differenced process may be white noise, whichimplies that the global temperature series is a random walk. Finally,notice that removing trend by detrending (i.e., regression techniques)produces different results than removing trend by differencing.

Continuing from Example 2.3, to difference and plot the data in R:

> x = gtemp[44:142] # start at 1899> plot(1900:1997, diff(x), type="o", xlab="year")

An alternative to differencing is a less-severe operation that still assumesstationarity of the underlying time series. This alternative, called fractionaldifferencing, extends the notion of the difference operator (2.32) to fractionalpowers −.5 < d < .5, which still define stationary processes. Granger andJoyeux (1980) and Hosking (1981) introduced long memory time series, whichcorresponds to the case when 0 < d < .5. This model is often used for environ-mental time series arising in hydrology. We will discuss long memory processesin more detail in §5.2.

Often, obvious aberrations are present that can contribute nonstationaryas well as nonlinear behavior in observed time series. In such cases, transfor-mations may be useful to equalize the variability over the length of a singleseries. A particularly useful transformation is

yt = lnxt, (2.33)

which tends to suppress larger fluctuations that occur over portions of theseries where the underlying values are larger. Other possibilities are powertransformations in the Box–Cox family of the form

yt =

⎧⎨⎩ (xλt − 1)/λ, λ �= 0

lnxt, λ = 0.

(2.34)

Methods for choosing the power λ are available (see Johnson and Wichern,1992, §4.7) but we do not pursue them here. Often, transformations are alsoused to improve the approximation to normality or to improve linearity inpredicting the value of one series from another.

Example 2.5 Paleoclimatic Glacial Varves

Melting glaciers deposit yearly layers of sand and silt during the springmelting seasons, which can be reconstructed yearly over a period rangingfrom the time deglaciation began in New England (about 12,600 years


Figure 2.6 Glacial varve thicknesses (top) from Massachusetts for n = 634years compared with log transformed thicknesses (bottom).

ago) to the time it ended (about 6,000 years ago). Such sedimentarydeposits, called varves, can be used as proxies for paleoclimatic parame-ters, such as temperature, because, in a warm year, more sand and siltare deposited from the receding glacier. Figure 2.6 shows the thicknessesof the yearly varves collected from one location in Massachusetts for 634years, beginning 11,834 years ago. For further information, see Shumwayand Verosub (1992). Because the variation in thicknesses increases inproportion to the amount deposited, a logarithmic transformation couldremove the nonstationarity observable in the variance as a function oftime. Figure 2.6 shows the original and transformed varves, and it isclear that this improvement has occurred. We may also plot the his-togram of the original and transformed data, as in Problem 2.8, to arguethat the approximation to normality is improved. The ordinary firstdifferences (2.31) are also computed in Problem 2.8, and we note thatthe first differences have a significant negative correlation at lag h = 1.Later, in Chapter 5, we will show that perhaps the varve series has longmemory and will propose using fractional differencing.

Next, we consider another preliminary data processing technique that is


lag 1

soi

−1.

0−

0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.0

lag 2

soi

lag 3

soi

−1.0 −0.5 0.0 0.5 1.0

lag 4

soi

lag 5

soi

lag 6

soi

lag 7

soi

lag 8

soi

−1.

0−

0.5

0.0

0.5

1.0

lag 9

soi

−1.

0−

0.5

0.0

0.5

1.0

lag 10

soi

−1.0 −0.5 0.0 0.5 1.0lag 11

soi

lag 12

soi

−1.0 −0.5 0.0 0.5 1.0

Figure 2.7 Scatterplot matrix relating current SOI values (xt) to past SOIvalues (xt−h) at lags h = 1, 2, ..., 12.

used for the purpose of visualizing the relations between series at differentlags, namely, scatterplot matrices. In the definition of the ACF, we are essen-tially interested in relations between xt and xt−h; the autocorrelation functiontells us whether a substantial linear relation exists between the series and itsown lagged values. The ACF gives a profile of the linear correlation at allpossible lags and shows which values of h lead to the best predictability. Therestriction of this idea to linear predictability, however, may mask a possiblenonlinear relation between current values, xt, and past values, xt−h. To checkfor nonlinear relations of this form, it is convenient to display a lagged scat-terplot matrix, as in Figure 2.7, that displays values of xt on the vertical axisplotted against xt−h on the horizontal axis for the SOI xt. Similarly, we mightwant to look at values of one series yt plotted against another series at various


−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0soi(t−0)

rec(

t)

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0

soi(t−1)

rec(

t)−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0

soi(t−2)

rec(

t)−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0

soi(t−3)

rec(

t)

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0soi(t−4)

rec(

t)

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0

soi(t−5)

rec(

t)

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0

soi(t−6)

rec(

t)

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0

soi(t−7)

rec(

t)

−1.0 −0.5 0.0 0.5 1.0

020

4060

8010

0soi(t−8)

rec(

t)

Figure 2.8 Scatterplot matrix of the Recruitment series, yt, on the verticalaxis plotted against the SOI series, xt−h, on the horizontal axis at lags h =0, 1, . . . , 8.

lags, xt−h, to look for possible nonlinear relations between the two series. Be-cause, for example, we might wish to predict the Recruitment series, say, yt,from current or past values of the SOI series, xt−h, for h = 0, 1, 2, ... it wouldbe worthwhile to examine the scatterplot matrix. Figure 2.8 shows the laggedscatterplot of the Recruitment series yt on the vertical axis plotted against theSOI index xt−h on the horizontal axis.

Example 2.6 Scatterplot Matrices, SOI, and Recruitment Series

Consider the possibility of looking for nonlinear functional relations atlags in the SOI series, xt−h, for h = 0, 1, 2, ..., and the Recruitmentseries, yt. Noting first the top panel in Figure 2.7, we see strong posi-


tive and linear relations at lags h = 1, 2, 11, 12, that is, between xt andxt−1, xt−2, xt−11, xt−12, and a negative linear relation at lags h = 6, 7.These results match up well with peaks noticed in the ACF in Figure 1.14.Figure 2.8 shows linearity in relating Recruitment, yt, with the SOI seriesat xt−5, xt−6, xt−7, xt−8, indicating the SOI series tends to lead the Re-cruitment series and the coefficients are negative, implying that increasesin the SOI lead to decreases in the Recruitment, and vice versa. Somepossible nonlinear behavior shows as the relation tends to flatten out atboth extremes, indicating a logistic type transformation may be useful.

To reproduce Figure 2.7 in R assuming the data are in soi and rec asbefore:

> lag.plot(soi, lags=12, layout=c(3,4), diag=F)

Reproducing Figure 2.8 in R is not as easy, but here is how the figurewas generated:

> soi=ts(soi) # make the series> rec=ts(rec) # time series objects> par(mfrow=c(3,3), mar=c(2.5, 4, 4, 1)) # set up plot area> for(h in 0:8){ # loop through lags 0-8> plot(lag(soi,-h),rec, main=paste("soi(t-",h,")",sep=""),+ ylab="rec(t)",xlab="")> }

As a final exploratory tool, we discuss assessing periodic behavior in timeseries data using regression analysis and the periodogram; this material maybe thought of as an introduction to spectral analysis, which we discuss indetail in Chapter 4. In Example 1.12, we briefly discussed the problem ofidentifying cyclic or periodic signals in time series. A number of the timeseries we have seen so far exhibit periodic behavior. For example, the datafrom the pollution study example shown in Figure 2.2 exhibit strong yearlycycles. Also, the Johnson & Johnson data shown in Figure 1.1 make one cycleevery year (four quarters) on top of an increasing trend and the speech datain Figure 1.2 is highly repetitive. The monthly SOI and Recruitment series inFigure 1.6 show strong yearly cycles, but hidden in the series are clues to theEl Nino cycle.

Example 2.7 Using Regression to Discover a Signal in Noise

Recall, in Example 1.12 we generated n = 500 observations from themodel

xt = A cos(2πωt + φ) + wt, (2.35)

where ω = 1/50, A = 2, φ = .6π, and σw = 5; the data are shown on thebottom panel of Figure 1.11. At this point we assume the frequency ofoscillation ω = 1/50 is known, but A and φ are unknown parameters. In


0 100 200 300 400 500

−20

−10

010

Figure 2.9 Data generated by (2.35) [dashed line] with the fitted [solid] line,(2.37), superimposed.

this case the parameters appear in (2.35) in a nonlinear way, so we usea trigonometric identity and write

A cos(2πωt + φ) = A cos(φ) cos(2πωt) − A sin(φ) sin(2πωt)= β1 cos(2πωt) + β2 sin(2πωt),

where β1 = A cos(φ) and β2 = −A sin(φ). Now the model (2.35) can bewritten in the usual linear regression form given by (no intercept term isneeded here)

xt = β1 cos(2πt/50) + β2 sin(2πt/50) + wt. (2.36)

Using linear regression on the generated data, the fitted model is

xt = −.84(.32) cos(2πt/50) − 1.99(.32) sin(2πt/50) (2.37)

with σw = 5.08, where the values in parentheses are the standard errors.We note the actual values of the coefficients for this example are β1 =2 cos(.6π) = −.62 and β2 = −2 sin(.6π) = −1.90. Because the parameterestimates are significant and close to the actual values, it is clear that weare able to detect the signal in the noise using regression, even thoughthe signal appears to be obscured by the noise in the bottom panel ofFigure 1.11. Figure 2.9 shows data generated by (2.35) with the fittedline, (2.37), superimposed.

Example 2.8 Using the Periodogram to Discover a Signal in Noise

The analysis in Example 2.7 may seem like cheating because we assumedwe knew the value of the frequency parameter ω. If we do not know ω,


we could try to fit the model (2.35) using nonlinear regression with ω as aparameter. Another method is to try various values of ω in a systematicway. Using the regression results of §2.2 (also, see Problem 4.10), we canshow the estimated regression coefficients in Example 2.7 take on thespecial form3 given by

β1 =∑n

t=1 xt cos(2πt/50)∑nt=1 cos2(2πt/50)

=2n

n∑t=1

xt cos(2πt/50); (2.38)

β2 =∑n

t=1 xt sin(2πt/50)∑nt=1 sin2(2πt/50)

=2n

n∑t=1

xt sin(2πt/50). (2.39)

This suggests looking at all possible regression parameter estimates, say

β1(j/n) =2n

n∑t=1

xt cos(2πt j/n); (2.40)

β2(j/n) =2n

n∑t=1

xt sin(2πt j/n), (2.41)

where, n = 500 and j = 1, . . . , n2 − 1, and inspecting the results for large

values. For the endpoints, j = 0, n/2, we have β1(0) = n−1∑nt=1 xt,

β1(n2 ) = n−1∑n

t=1(−1)txt and β2(0) = β2(n2 ) = 0.

For this particular example, the values calculated in (2.38) and (2.39) areβ1(10/500) and β2(10/500). By doing this, we have regressed a series, xt,of length n using n regression parameters, so that we will have a perfectfit. The point, however, is that if the data contain any cyclic behaviorwe are likely to catch it by performing these saturated regressions.

Next, note that the regression coefficients β1(j/n) and β2(j/n), for eachj, are essentially measuring the correlation of the data with a sinusoidoscillating at j cycles in n time points.4 Hence, an appropriate measureof the presence of a frequency of oscillation of j cycles in n time pointsin the data would be

P (j/n) = β21(j/n) + β2

2(j/n), (2.42)

which is basically a measure of squared correlation. The quantity (2.42) issometimes called the periodogram, but we will call P (j/n) the scaled pe-riodogram and we will investigate its properties in Chapter 4. Figure 2.10shows the scaled periodogram for the data generated by (2.35), and it

3In the notation of §2.2, the estimates are∑n

t=1 xtzt

/∑n

t=1 z2t . Here, zt = cos(2πt/50)

or zt = sin(2πt/50).4In the notation of §2.2, the regression coefficients (2.40) and (2.41) are of the form∑txtzt/

∑tz2t whereas sample correlations are of the form

∑txtzt

/(∑tx2

t

∑tz2t

)1/2.


0.0 0.1 0.2 0.3 0.4 0.5

frequency

01

23

45

6

Figure 2.10 The scaled periodogram, (2.42), of the 500 observations generatedby (2.35). The data are displayed in Figures 1.11 and 2.9.

easily discovers the periodic component with frequency ω = .02 = 10/500even though it is difficult to visually notice that component in Figure 1.11due to the noise.

Finally, we mention that it is not necessary to run a large regression

xt =n/2∑j=0

β1(j/n) cos(2πtj/n) + β2(j/n) sin(2πtj/n) (2.43)

to obtain the values of β1(j/n) and β2(j/n) [with β2(0) = β2(1/2) = 0]because they can be computed quickly if n (assumed even here) is ahighly composite integer. There is no error in (2.43) because there aren observations and n parameters; the regression fit will be perfect. Thediscrete Fourier transform (DFT) is a complex-valued weighted averageof the data given by

d(j/n) = n−1/2n∑

t=1

xt exp(−2πitj/n), (2.44)

and values j/n are called the Fourier or fundamental frequencies. Be-cause of a large number of redundancies in the calculation, (2.44) maybe computed quickly using the fast Fourier transform (FFT), which isavailable in many computing packages such as Matlab, S-PLUS and R.We note that5

|d(j/n)|2 =1n

(n∑

t=1

xt cos(2πtj/n)

)2

+1n

(n∑

t=1

xt sin(2πtj/n)

)2

(2.45)

5e−iα = cos(α) − i sin(α) and if z = a − ib, then |z|2 = zz = (a − ib)(a + ib) = a2 + b2.


and it is this quantity that is called the periodogram; we will write

I(j/n) = |d(j/n)|2.

So, we may calculate the scaled periodogram, (2.42), using the peri-odogram as

P (j/n) =4n

I(j/n). (2.46)

We will discuss this approach in more detail and provide examples withdata in Chapter 4.

A figure similar to Figure 2.10 can be created in R using the followingcommands6:

> t = 1:500> x = 2*cos(2*pi*t/50 + .6*pi) + rnorm(500,0,5)> I = abs(fft(x)/sqrt(500))ˆ2 # the periodogram> P = (4/500)*I # the scaled periodogram> f = 0:250/500> plot(f, P[1:251], type="l", xlab="frequency", ylab=" ")> abline(v=seq(0,.5,.02), lty="dotted")

Example 2.9 The Periodogram as a Matchmaker

Another way of understanding the results of the previous example is toconsider the problem of matching the data with sinusoids oscillating atvarious frequency. For example, Figure 2.11 shows n = 100 observations(as a solid line) generated by the model

xt = cos (2πt [2/100]) + wt, (2.47)

where wt is Gaussian white noise with σw = 1. Superimposed on xt

are cosines oscillating at frequency 1/100, 2/100, and 3/100 (shown asdashed lines). Also included in the figure are correlations of xt withthe particular cosine, cos(2πtj/100), for j = 1, 2, 3. Note that the datamatch up well with the cosine oscillating at 2 cycles every 100 points(with a correlation of .57), whereas the data do not match up well withthe other two cosines. For example, in the top panel of Figure 2.11, thereis a decreasing trend in the data until observation 25, and then the datastart an increasing trend to observation 50, whereas the cosine makingone cycle (1/100) continues to decrease until observation 50.

6Different packages scale the FFT differently; consult the documentation. R calculates(2.44) without scaling by n−1/2.

2.4: Smoothing 71

0 10 20 30 40 50 60 70 80 90 100

−2

0

2

One Cycle j/n = 1/100; Correlation = .02

0 10 20 30 40 50 60 70 80 90 100

−2

0

2

Two Cycles j/n = 2/100; Correlation = .57

0 10 20 30 40 50 60 70 80 90 100

−2

0

2

Three Cycles j/n = 3/100; Correlation = −.00

Figure 2.11 Data generated by (2.47) represented as a solid line with cosinesoscillating at various frequencies superimposed (dashed lines). The correlationindicates the degree to which the two series line up.

2.4 Smoothing in the Time Series Context

In §1.4, we introduced the concept of smoothing a time series, and in Ex-ample 1.9, we discussed using a moving average to smooth white noise. Thismethod is useful in discovering certain traits in a time series, such as long-termtrend and seasonal components. In particular, if xt represents the observations,then

mt =k∑

j=−k

ajxt−j , (2.48)

where aj = a−j ≥ 0 and∑k

j=−k aj = 1 is a symmetric moving average of thedata.

Example 2.10 Moving Average Smoother

For example, Figure 2.12 shows the weekly mortality series discussed inExample 2.2, a five-point moving average (which is essentially a monthlyaverage with k = 2) that helps bring out the seasonal component and a53-point moving average (which is essentially a yearly average with k =


0 100 200 300 400 500

week

7080

9010

011

012

013

0

morta

lity

Figure 2.12 The weekly cardiovascular mortality series discussed in Exam-ple 2.2 smoothed using a five-week moving average and a 53-week movingaverage.

26) that helps bring out the (negative) trend in cardiovascular mortality.In both cases, the weights, a−k, . . . , a0, . . . , ak, we used were all the same,and equal to 1/(2k + 1).7

To reproduce Figure 2.12 in R assuming the mortality series is in mort:> t = 1:length(mort)> ma5 = filter(mort, sides=2, rep(1,5)/5)> ma53 = filter(mort, sides=2, rep(1,53)/53)> plot(t, mort, xlab="week", ylab="mortality")> lines(ma5)> lines(ma53)

Many other techniques are available for smoothing times series data basedon methods from scatterplot smoothers. The general setup for a time plot is

xt = ft + yt, (2.49)

where ft is some smooth function of time, and yt is a stationary process. Wemay think of the moving average smoother mt, given in (2.48), as an estimatorof ft. An obvious choice for ft in (2.49) is polynomial regression

ft = β0 + β1t + · · · + βptp. (2.50)

We have seen the results of a linear fit on the global temperature data inExample 2.1. For periodic data, one might employ periodic regression

ft = α0 + α1 cos(2πω1t) + β1 sin(2πω1t)+ · · · + αp cos(2πωpt) + βp sin(2πωpt), (2.51)

7Sometimes, the end weights, a−k and ak are set equal to half the value of the otherweights.

2.4: Smoothing 73

0 100 200 300 400 500

week

7080

9010

011

012

013

0

morta

lity

Figure 2.13 The weekly cardiovascular mortality series with a cubic trendand cubic trend plus periodic regression.

where ω1, . . . , ωp are distinct, specified frequencies. In addition, one mightconsider combining (2.50) and (2.51). These smoothers can be applied usingclassical linear regression.

Example 2.11 Polynomial and Periodic Regression Smoothers

Figure 2.13 shows the weekly mortality series with an estimated (viaordinary least squares) cubic smoother

ft = β0 + β1t + β2t2 + β3t

3

superimposed to emphasize the trend, and an estimated (via ordinaryleast squares) cubic smoother plus a periodic regression

ft = β0 + β1t + β2t2 + β3t

3 + α1 cos(2πt/52) + α2 sin(2πt/52)

superimposed to emphasize trend and seasonality.

The R commands for this example are:

> t = 1:length(mort)> t2 = tˆ2> t3 = tˆ3> c = cos(2*pi*t/52)> s = sin(2*pi*t/52)> fit1 = lm(mort˜t + t2 + t3)> fit2 = lm(mort˜t + t2 + t3 + c + s)> plot(t, mort)> lines(fit1$fit)> lines(fit2$fit)


0 100 200 300 400 500

week

7080

9010

011

012

013

0

morta

lity

Figure 2.14 Kernel smoothers of the mortality data.

Modern regression techniques can be used to fit general smoothers to thepairs of points (t, xt) where the estimate of ft is smooth. Many of the tech-niques can easily be applied to time series data using the R or S-PLUS sta-tistical packages; see Venables and Ripley (1994, Chapter 10) for details onapplying these methods in S-PLUS (R is similar). A problem with the tech-niques used in Example 2.11 is that they assume ft is the same function overthe range of time, t; we might say that the technique is global. The movingaverage smoothers in Example 2.10 fit the data better because the techniqueis local; that is, moving average smoothers allow for the possibility that ft isa different function over time. We describe some other local methods in thefollowing examples.

Example 2.12 Kernel Smoothing

Kernel smoothing is a moving average smoother that uses a weight func-tion, or kernel, to average the observations. Figure 2.14 shows kernelsmoothing of the mortality series, where ft in (2.49) is estimated by

ft =n∑

i=1

wt(i)xt, (2.52)

where

wt(i) = K

(t − i

b

)/ n∑j=1

K

(t − j

b

). (2.53)

This estimator is called the Naradaya–Watson estimator (Watson, 1966).In (2.53), K(·) is a kernel function; typically, the normal kernel, K(z) =

1√2π

exp(−z2/2), is used. To implement this in R, use the ksmooth func-tion. The wider the bandwidth, b, the smoother the result. In Fig-ure 2.14, the values of b for this example were b = 10 (roughly weighted

2.4: Smoothing 75

monthly averages; that is, b/2 is the inner quartile range of the kernel) forthe seasonal component, and b = 104 (roughly weighted yearly averages)for the trend component.

Figure 2.14 can be reproduced in R (or S-PLUS) as follows; we assumet and mort are available from the previous example:

> plot(t, mort)> lines(ksmooth(t, mort, "normal", bandwidth=5))> lines(ksmooth(t, mort, "normal", bandwidth=104))

Example 2.13 Nearest Neighbor and Locally Weighted Regression

Another approach to smoothing a time plot is nearest neighbor regres-sion. The technique is based on k-nearest neighbors linear regression,wherein one uses the data {xt−k/2, . . . , xt, . . . , xt+k/2} to predict xt us-ing linear regression; the result is ft. For example, Figure 2.15 showscardiovascular mortality and the nearest neighbor method using the R(or S-PLUS) smoother supsmu. We used k = n/2 to estimate the trendand k = n/100 to estimate the seasonal component. In general, supsmuuses a variable window for smoothing (see Friedman, 1984), but it canbe used for correlated data by fixing the smoothing window, as was donehere.

Lowess is a method of smoothing that is rather complex, but the basicidea is close to nearest neighbor regression. Figure 2.15 shows smooth-ing of mortality using the R or S-PLUS function lowess (see Cleve-land, 1979). First, a certain proportion of nearest neighbors to xt areincluded in a weighting scheme; values closer to xt in time get moreweight. Then, a robust weighted regression is used to predict xt andobtain the smoothed estimate of ft. The larger the fraction of nearestneighbors included, the smoother the estimate ft will be. In Figure 2.15,the smoother uses about two-thirds of the data to obtain an estimate ofthe trend component, and the seasonal component uses 2% of the data.

Figure 2.15 can be reproduced in R or S-PLUS as follows (assuming tand mort are available from the previous example):

> par(mfrow=c(2,1))> plot(t, mort, main="nearest neighbor")> lines(supsmu(t, mort, span=.5))> lines(supsmu(t, mort, span=.01))> plot(t, mort, main="lowess")> lines(lowess(t, mort, .02))> lines(lowess(t, mort, 2/3))


0 100 200 300 400 500

week

70

80

90

10

01

10

12

01

30

mo

rta

lity

nearest neighbor

0 100 200 300 400 500

week

70

80

90

100

110

120

130

mort

alit

y

lowess

Figure 2.15 Nearest neightbor (supsmu) and locally weighted least squares(lowess) smoothers of the mortality data.

Example 2.14 Smoothing Splines

An extension of polynomial regression is to first divide time t = 1, . . . , n,into k intervals, [t0 = 1, t1], [t1 +1, t2] , . . . , [tk−1 +1, tk = n]. The valuest0, t1, . . . , tk are called knots. Then, in each interval, one fits a regressionof the form (2.50); typically, p = 3, and this is called cubic splines.

A related method is smoothing splines, which minimizes a compromisebetween the fit and the degree of smoothness given by

n∑t=1

[xt − ft]2 + λ

∫ (f

′′t

)2dt, (2.54)

where ft is a cubic spline with a knot at each t. The degree of smoothnessis controlled by λ > 0. Figure 2.16 shows smoothing splines on mortalityusing λ = 10−7 for the seasonal component, and λ = 0.1 for the trend.

2.4: Smoothing 77

0 100 200 300 400 500

week

7080

9010

011

012

013

0

morta

lity

Figure 2.16 Smoothing splines fit to the mortality data.

Figure 2.16 can be reproduced in R or S-PLUS as follows (assuming tand mort are available from the previous example):

> plot(t, mort)> lines(smooth.spline(t, mort, spar=.0000001))> lines(smooth.spline(t, mort, spar=.1))

Example 2.15 Smoothing One Series as a Function of Another

In addition to smoothing time plots, smoothing techniques can be appliedto smoothing a time series as a function of another time series. In thisexample, we smooth the scatterplot of two contemporaneously measuredtime series, mortality as a function of temperature. In Example 2.2, wediscovered a nonlinear relationship between mortality and temperature.Continuing along these lines, Figure 2.17 shows scatterplots of mortality,Mt, and temperature, Tt, along with Mt is smoothed as a function ofTt using lowess and using smoothing splines. In both cases, mortalityincreases at extreme temperatures, but in an asymmetric way; mortal-ity is higher at colder temperatures than at hotter temperatures. Theminimum mortality rate seems to occur at approximately 80◦ F.

Figure 2.17 can be reproduced in R or S-PLUS as follows (assuming mortand temp contain the mortality and temperature data):

> par(mfrow=c(2,1))> plot(temp, mort, main="lowess")> lines(lowess(temp,mort))> plot(temp, mort, main="smoothing splines")> lines(smooth.spline(temp,mort))


o

o

oo

o

o

o

oo

o

oo

o

o

o

o

o o

o

o

o

o

oo

oo

o o

ooo

o

o

oo

o

oo

oo

o

ooo

o

o

o

oo

oo

o

o

oo

o

o

o

o

o

oo

o

o

o

oo

o

o oo

o

oo

oo

o

o

o

o

o

o

ooo

o

o

o

o

o

o

oo

o

o o

o

o

o oo

oo

o

o

o

o

o

o

oo

oo

o o

o

o

o

o

oo

oo

o

ooo

oo

oo

o

oo

o o

ooo

o

o

oo

o

o oo

o

oo

o

oo

oo

o

o

o

o

o

o

o o

o

o

o

o

oo

o

oo oo

o

o

o

oo

oo

oo

ooo

oo

o

o

oo

oo

o

oo o

o

o

o

oo

o

oo

o

o

o

o

o

o

oo

o

oo

o

o

o

o

oo

o o oo

o

o

o oo

o

o

oo

o o

oo

oo

o ooo

o

o

o

o

oo

o

oo

o

o

o

oo

oo

ooo

oo

o

o

o

oo

o

o

o

oo

o

ooo

o

o

o

o

o

oo

ooo

oo

o

o

oo

o

o

oo

oo

oo

o

o

o

o

oo

o

o ooo

o

o

oooo o

o

oo

ooo

o

o

o

oo

o oo

o

oo oo

oo oo

oooo

o

o

o

o

o

o

o

o

o

o

oo

o

o o

o

o

oo

ooo o

o oo

o

o

o

oo

o oo oo o

oo

o

o

oo

oo

o

o

o

o

o

o oo

o

o

oo

oo

oo oo

oo

oo

o

ooo oo

o

ooo

o

o

oo

o

o

oo

o

oo

o

o oo

o

oo

o

oo

oo

o

o

o

ooo

o

o

oo

o

o o

oo

o

o

o

oo

oo

oo

o o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo o

o

oo

o

oo

o

o

oo

o

temperature

mor

talit

y

50 60 70 80 90 100

140

160

180

200

220

lowess

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

o o

o

o

o

o

oo

oo

o o

ooo

o

o

oo

o

oo

oo

o

ooo

o

o

o

oo

oo

o

o

oo

o

o

o

o

o

oo

o

o

o

oo

o

o oo

o

oo

oo

o

o

o

o

o

o

ooo

o

o

o

o

o

o

oo

o

o o

o

o

o oo

oo

o

o

oo

o

o

oo

oo

o o

o

o

o

o

oo

oo

o

ooo

oo

oo

o

oo

o o

ooo

o

o

oo

o

o oo

o

oo

o

oo

oo

o

o

o

o

o

o

o o

o

o

o

o

oo

o

oo oo

o

o

o

oo

oo

oo

ooo

oo

o

o

oo

oo

o

o

o o

o

o

o

oo

o

oo

o

o

o

o

o

o

oo

o

oo

o

o

o

o

oo

o o oo

o

o

o oo

o

o

oo

o o

oo

oo

o ooo

o

o

o

o

oo

o

oo

o

o

o

oo

oo

ooo

oo

o

o

o

oo

o

o

o

oo

o

ooo

o

o

o

o

o

oo

ooo

oo

o

o

oo

o

o

oo

oo

oo

o

o

o

o

oo

o

o ooo

o

o

oooo o

o

oo

ooo

o

o

o

oo

o oo

o

oo oo

oo oo

oooo

o

o

o

o

o

o

o

o

o

o

oo

o

o o

o

o

oo

ooo o

o oo

o

o

o

oo

o oo oo o

oo

o

o

oo

oo

o

o

o

o

o

o oo

o

o

oo

oo

oo oo

oo

oo

o

ooo oo

o

ooo

o

o

oo

o

o

oo

o

oo

o

o oo

o

oo

o

oo

oo

o

o

o

ooo

o

o

oo

o

o o

oo

o

o

o

oo

oo

oo

o o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo o

o

oo

o

oo

o

o

oo

o

temperature

mor

talit

y

50 60 70 80 90 100

140

160

180

200

220

smoothing splines

Figure 2.17 Smoothers of mortality as a function of temperature using lowessand smoothing splines.

As a final word of caution, the methods mentioned above do not particularlytake into account the fact that the data are serially correlated, and most ofthe techniques mentioned have been designed for independent observations.That is, for example, the smoothers shown in Figure 2.17 are calculated underthe false assumption that the pairs (Mt, Tt), for t = 1, . . . , 508, are iid pairsof observations. In addition, the degree of smoothness used in the previousexamples were chosen arbitrarily to bring out what might be considered obviousfeatures in the data set.

Problems 79

Problems

Section 2.2

2.1 For the Johnson & Johnson data, say yt, for t = 1, . . . , 84, shown inFigure 1.1, let xt = ln(yt).

(a) Fit the regression model

xt = βt + α1Q1(t) + α2Q2(t) + α3Q3(t) + α4Q4(t) + wt

where Qi(t) = 1 if time t corresponds to quarter i = 1, 2, 3, 4, andzero otherwise. The Qi(t)’s are called indicator variables. We willassume for now that wt is a Gaussian white noise sequence. What isthe interpretation of the parameters β, α1, α2, α3, and α4? [Note:In R, to regress x on z without an intercept, use lm(x˜0+z); aneasy way to generate Q1(t) is Q1=rep(c(1,0,0,0),21).]

(b) What happens if you include an intercept term in the model in (a)?

(c) Graph the data, xt, and superimpose the fitted values, say xt, on thegraph. Examine the residuals, xt − xt, and state your conclusions.Does it appear that the model fits the data well?

2.2 For the mortality data examined in Example 2.2:

(a) Add another component to the regression in (2.24) that accountsfor the particulate count four weeks prior; that is, add Pt−4 to theregression in (2.24). State your conclusion. [Note: In R, makesure the data are time series objects by using the ts() command,e.g., mort=ts(mort). Center the temperature series and let t =ts(1:length(mort)). Then use ts.intersect(mort, t, temp,tempˆ2, part, lag(part,-4)) to combine the series into a timeseries matrix object with six columns and regress the first columnon the other columns.]

(b) Draw a scatterplot matrix of Mt, Tt, Pt and Pt−4 and then calculatethe pairwise correlations between the series. Compare the relation-ship between Mt and Pt versus Mt and Pt−4.

2.3 Generate a random walk with drift, (1.4), of length n = 500 with δ = .1and σw = 1. Call the data xt for t = 1, . . . , 500. Fit the regressionxt = βt + wt using least squares. Plot the data, the mean function (i.e.,µt = .1 t) and the fitted line, xt = β t, on the same graph. Discuss yourresults.

2.4 Kullback-Leibler Information. Given the random vector yyy, we define theinformation for discriminating between two densities in the same family,


indexed by a parameter θθθ, say f(yyy; θθθ1) and f(yyy; θθθ2), as

I(θθθ1; θθθ2) =1n

E1 lnf(yyy; θθθ1)f(yyy; θθθ2)

, (2.55)

where E1 denotes expectation with respect to the density determined byθθθ1. For the Gaussian regression model, the parameters are θθθ = (βββ′, σ2)′.Show that we obtain

I(θθθ1; θθθ2) =12

(σ2

1

σ22

− lnσ2

1

σ22

− 1)

+12

(βββ1 − βββ2)′Z ′Z(βββ1 − βββ2)nσ2

2(2.56)

in that case.

2.5 Model Selection. Both selection criteria (2.18) and (2.19) are derivedfrom information theoretic arguments, based on the well-known Kullback–Leibler discrimination information numbers (see Kullback and Leibler,1951, Kullback, 1978). We give an argument due to Hurvich and Tsai(1989). We think of the measure (2.56) as measuring the discrepancybetween the two densities, characterized by the parameter values θθθ′

1 =(βββ′

1, σ21)′ and θθθ′

2 = (βββ′2, σ

22)′. Now, if the true value of the parameter vec-

tor is θθθ1, we argue that the best model would be one that minimizes thediscrepancy between the theoretical value and the sample, say I(θθθ1; θθθ).Because θθθ1 will not be known, Hurvich and Tsai (1989) considered findingan unbiased estimator for E1[I(βββ1, σ

21 ; βββ,σ

2)], where

I(βββ1, σ21 ; βββ,σ

2) =12

(σ2

1

σ2 − lnσ2

1

σ2 − 1)

+12

(βββ1 − βββ)′Z ′Z(βββ1 − βββ)nσ2

and βββ is a k × 1 regression vector. Show that

E1[I(βββ1, σ21 ; βββ,σ

2)] =12

(− lnσ2

1 + E1 ln σ2 +n + k

n − k − 2− 1)

, (2.57)

using the distributional properties of the regression coefficients and errorvariance. An unbiased estimator for E1 log σ2 is log σ2. Hence, we haveshown that the expectation of the above discrimination information is asclaimed. As models with differing dimensions k are considered, only thesecond and third terms in (2.57) will vary and we only need unbiasedestimators for those two terms. This gives the form of AICc quoted in(2.19) in the chapter. You will need the two distributional results

nσ2

σ21

∼ χ2n−k

and(βββ − βββ1)′Z ′Z(βββ − βββ1)

σ21

∼ χ2k

Problems 81

The two quantities are distributed independently as chi-squared distri-butions with the indicated degrees of freedom. If x ∼ χ2

n, E(1/x) =1/(n − 2).

Section 2.3

2.6 Consider a process consisting of a linear trend with an additive noiseterm consisting of independent random variables wt with zero meansand variances σ2

w, that is,

xt = β0 + β1t + wt,

where β0, β1 are fixed constants.

(a) Prove xt is nonstationary.

(b) Prove that the first difference series ∇xt = xt − xt−1 is stationaryby finding its mean and autocovariance function.

(c) Repeat part (b) if wt is replaced by a general stationary process,say yt, with mean function µy and autocovariance function γy(h).

2.7 Show (2.28) is stationary.

2.8 The glacial varve record plotted in Figure 2.6 exhibits some nonstation-arity that can be improved by transforming to logarithms and some ad-ditional nonstationarity that can be corrected by differencing the loga-rithms.

(a) Verify that the untransformed glacial varves has intervals over whichγ(0) changes by computing the zero-lag autocovariance over two dif-ferent intervals. Argue that the transformation yt = lnxt stabilizesthe variance over the series. Plot the histograms of xt and yt to seewhether the approximation to normality is improved by transform-ing the data.

(b) Examine the sample ACF, ρy(h), of yt and comment. Do any timeintervals, of the order 100 years, exist where one can observe behav-ior comparable to that observed in the global temperature recordsin Figure 1.2?

(c) Compute the first difference ut = yt − yt−1 of the log transformedvarve records, and examine its time plot and autocorrelation func-tion, ρu(h), and argue that a first difference produces a reasonablystationary series. Can you think of a practical interpretation for ut?

(d) Based on the sample ACF of the differenced transformed series com-puted in (c), argue that a generalization of the model given by Ex-ample 1.23 might be reasonable. Assume

ut = µu + wt − θwt−1


is stationary when the inputs wt are assumed independent withmean 0 and variance σ2

w. Show that

γu(h) =

⎧⎨⎩σ2w(1 + θ2) if h = 0

−θ σ2w if h = ±1

0 if |h| ≥ 1.

Using the sample ACF and the printed autocovariance γu(0), deriveestimators for θ and σ2. This is an application of the method of mo-ments from classical statistics, where estimators of the parametersare derived by equating sample moments to theoretical moments.

2.9 Consider the two time series representing average wholesale U.S. gasand oil prices over 180 months, beginning in July 1973 and ending inDecember 1987. Analyze the data using some of the techniques in thischapter with the idea that we should be looking at how changes in oilprices influence changes in gas prices. For further reading, see Liu (1991).In particular,

(a) Plot the raw data, and look at the autocorrelation functions to arguethat the untransformed data series are nonstationary.

(b) It is often argued in economics that price changes are important,in particular, the percentage change in prices from one month tothe next. On this basis, argue that a transformation of the formyt = lnxt − lnxt−1 might be applied to the data, where xt is the oilor gas price series.

(c) Use lagged multiple scatterplots and the autocorrelation and cross-correlation functions of the transformed oil and gas price series toinvestigate the properties of these series. Is it possible to guesswhether gas prices are raised more quickly in response to increasingoil prices than they are decreased when oil prices are decreased? Usethe cross-correlation function over the first 100 months comparedwith the cross-correlation function over the last 80 months. Do youthink that it might be possible to predict log percentage changes ingas prices from log percentage changes in oil prices? Plot the twoseries on the same scale.

2.10 In this problem, we will explore the periodic nature of St, the SOI seriesdisplayed in Figure 1.5.

(a) Detrend the series by fitting a regression of St on time t. Is there asignificant trend in the sea surface temperature? Comment.

(b) Calculate the periodogram for the detrended series obtained in part(a). Identify the frequencies of the two main peaks (with an obviousone at the frequency of one cycle every 12 months). What is theprobable El Nino cycle indicated by the minor peak?

Problems 83

Section 2.4

2.11 For the data plotted in Figure 1.5, let St denote the SOI index series,and let Rt denote the Recruitment series.

(a) Draw a lag plot similar to the one in Figure 2.7 for Rt and comment.

(b) Reexamine the scatterplot matrix of Rt versus St−h shown in Fig-ure 2.8 and the CCF of the two series shown in Figure 1.14, and fitthe regression

Rt = α + β0St + β1St−1 + β2St−2 + β3St−3 + β4St−4

+ β5St−5 + β6St−6 + β7St−7 + β8St−8 + wt.

Compare the magnitudes and signs of the coefficients β0, . . . , β8 withthe scatterplots in Figure 2.8 and with the CCF in Figure 1.14.

(c) Use some of the smoothing techniques described in §2.4 to discoverwhether a trend exists in the Recruitment series, Rt, and to explorethe periodic behavior of the data.

(d) In Example 2.6, some nonlinear behavior exists between the currentvalue of Recruitment and past values of the SOI index. Use thesmoothing techniques described in §2.4 to explore this possibility,concentrating on the scatterplot of Rt versus St−6.

2.12 Use a smoothing technique described in §2.4 to estimate the trend in theglobal temperature series displayed in Figure 1.2. Use the entire data set(see Example 2.1 for details).

Chapter 3

ARIMA Models

3.1 Introduction

In Chapters 1 and 2, we introduced autocorrelation and cross-correlation func-tions (ACFs and CCFs) as tools for clarifying relations that may occur withinand between time series at various lags. In addition, we explained how tobuild linear models based on classical regression theory for exploiting the as-sociations indicated by large values of the ACF or CCF. The time domain, orregression, methods of this chapter are appropriate when we are dealing withpossibly nonstationary, shorter time series; these series are the rule rather thanthe exception in many applications. In addition, if the emphasis is on forecast-ing future values, then the problem is easily treated as a regression problem.This chapter develops a number of regression techniques for time series thatare all related to classical ordinary and weighted or correlated least squares.

Classical regression is often insufficient for explaining all of the interest-ing dynamics of a time series. For example, the ACF of the residuals of thesimple linear regression fit to the global temperature data (see Example 2.3 ofChapter 2) reveals additional structure in the data that the regression did notcapture. Instead, the introduction of correlation as a phenomenon that maybe generated through lagged linear relations leads to proposing the autore-gressive (AR) and autoregressive moving average (ARMA) models. Addingnonstationary models to the mix leads to the autoregressive integrated mov-ing average (ARIMA) model popularized in the landmark work by Box andJenkins (1970). The Box–Jenkins method for identifying a plausible ARIMAmodel is given in this chapter along with techniques for parameter estimationand forecasting for these models. A partial theoretical justification of the useof ARMA models is discussed in Appendix B, §B.4.

84

3.2: ARMA Models 85

3.2 Autoregressive Moving Average Models

The classical regression model of Chapter 2 was developed for the static case,namely, we only allow the dependent variable to be influenced by current valuesof the independent variables. In the time series case, it is desirable to allowthe dependent variable to be influenced by the past values of the independentvariables and possibly by its own past values. If the present can be plausiblymodeled in terms of only the past values of the independent inputs, we havethe enticing prospect that forecasting will be possible.

Introduction to Autoregressive Models

Autoregressive models are based on the idea that the current value of the series,xt, can be explained as a function of p past values, xt−1, xt−2, . . . , xt−p, wherep determines the number of steps into the past needed to forecast the currentvalue. As a typical case, recall Example 1.10 in which data were generatedusing the model

xt = xt−1 − .90xt−2 + wt,

where wt is white Gaussian noise with σ2w = 1. We have now assumed the

current value is a particular linear function of past values. The regularity thatpersists in Figure 1.9 gives an indication that forecasting for such a modelmight be a distinct possibility, say, through some version such as

xnn+1 = xn − .90xn−1,

where the quantity on the left-hand side denotes the forecast at the next periodn+1 based on the observed data, x1, x2, . . . , xn. We will make this notion moreprecise in our discussion of forecasting (§3.5).

The extent to which it might be possible to forecast a real data series fromits own past values can be assessed by looking at the autocorrelation functionand the lagged scatterplot matrices discussed in Chapter 2. For example,the lagged scatterplot matrix for the Southern Oscillation Index (SOI), shownin Figure 2.7, gives a distinct indication that lags 1 and 2, for example, arelinearly associated with the current value. The ACF shown in Figure 1.14shows relatively large positive values at lags 1, 2, 12, 24, and 36 and largenegative values at 18, 30, and 42. We note also the possible relation betweenthe SOI and Recruitment series indicated in the scatterplot matrix shown inFigure 2.8. We will indicate in later sections on transfer function and vectorAR modeling how to handle the dependence on values taken by other series.

The preceding discussion motivates the following definition.

Definition 3.1 An autoregressive model of order p, abbreviated AR(p),is of the form

xt = φ1xt−1 + φ2xt−2 + · · · + φpxt−p + wt, (3.1)

where xt is stationary, φ1, φ2, . . . , φp are constants (φp �= 0). Unless otherwisestated, we assume that wt is a Gaussian white noise series with mean zero and

86 ARIMA Models

variance σ2w. The mean of xt in (3.1) is zero. If the mean, µ, of xt is not zero,

replace xt by xt − µ in (3.1), i.e.,

xt − µ = φ1(xt−1 − µ) + φ2(xt−2 − µ) + · · · + φp(xt−p − µ) + wt,

or writext = α + φ1xt−1 + φ2xt−2 + · · · + φpxt−p + wt, (3.2)

where α = µ(1 − φ1 − · · · − φp).

We note that (3.2) is similar to the regression model of §2.2, and hence theterm auto (or self) regression. Some technical difficulties, however, developfrom applying that model because the regressors, xt−1, . . . , xt−p, are randomcomponents, whereas zzzt was assumed to be fixed. A useful form follows byusing the backshift operator (2.30) to write the AR(p) model, (3.1), as

(1 − φ1B − φ2B2 − · · · − φpB

p)xt = wt, (3.3)

or even more concisely asφ(B)xt = wt. (3.4)

The properties of φ(B) are important in solving (3.4) for xt. This leads to thefollowing definition.

Definition 3.2 The autoregressive operator is defined to be

φ(B) = 1 − φ1B − φ2B2 − · · · − φpB

p (3.5)

We initiate the investigation of AR models by considering the first-ordermodel, AR(1), given by xt = φxt−1 + wt. Iterating backwards k times, we get

xt = φxt−1 + wt = φ(φxt−2 + wt−1) + wt

= φ2xt−2 + φwt−1 + wt

...

= φkxt−k +k−1∑j=0

φjwt−j .

This method suggests that, by continuing to iterate backwards, and providedthat |φ| < 1 and xt is stationary, we can represent an AR(1) model as a linearprocess given by1

xt =∞∑

j=0

φjwt−j . (3.6)

1Note that limk→∞ E

(xt −

∑k−1j=0 φjwt−j

)2= limk→∞ φ2kE

(x2

t−k

)= 0, so (3.6) ex-

ists in the mean square sense (see Appendix A for a definition).

3.2: ARMA Models 87

The AR(1) process defined by (3.6) is stationary with mean

E(xt) =∞∑

j=0

φjE(wt−j) = 0,

and autocovariance function,

γ(h) = cov(xt+h, xt) = E

⎡⎣⎛⎝ ∞∑j=0

φjwt+h−j

⎞⎠( ∞∑k=0

φkwt−k

)⎤⎦= σ2

w

∞∑j=0

φjφj+h = σ2wφh

∞∑j=0

φ2j =σ2

wφh

1 − φ2 , h ≥ 0. (3.7)

Recall that γ(h) = γ(−h), so we will only exhibit the autocovariance functionfor h ≥ 0. From (3.7), the ACF of an AR(1) is

ρ(h) =γ(h)γ(0)

= φh, h ≥ 0, (3.8)

and ρ(h) satisfies the recursion

ρ(h) = φ ρ(h − 1), h = 1, 2, . . . . (3.9)

We will discuss the ACF of a general AR(p) model in §3.4.

Example 3.1 The Sample Path of an AR(1) Process

Figure 3.1 shows a time plot of two AR(1) processes, one with φ = .9 andone with φ = −.9; in both cases, σ2

w = 1. In the first case, ρ(h) = .9h,for h ≥ 0, so observations close together in time are positively correlatedwith each other. This result means that observations at contiguous timepoints will tend to be close in value to each other; this fact shows up inthe top of Figure 3.1 as a very smooth sample path for xt. Now, contrastthis to the case in which φ = −.9, so that ρ(h) = (−.9)h, for h ≥ 0.This result means that observations at contiguous time points are neg-atively correlated but observations two time points apart are positivelycorrelated. This fact shows up in the bottom of Figure 3.1, where, forexample, if an observation, xt, is positive, the next observation, xt+1, istypically negative, and the next observation, xt+2, is typically positive.Thus, in this case, the sample path is very choppy.

A figure similar to Figure 3.1 can be created in R using the followingcommands:> par(mfrow=c(2,1))> plot(arima.sim(list(order=c(1,0,0), ar=.9), n=100),+ ylab="x",main=(expression("AR(1) "*phi*" = +.9")))> plot(arima.sim(list(order=c(1,0,0), ar=-.9), n=100),+ ylab="x",main=(expression("AR(1) "*phi*" = -.9")))

88 ARIMA Models

Figure 3.1 Simulated AR(1) models: φ = .9 (top); φ = −.9 (bottom).

Example 3.2 Explosive AR Models and Causality

In Example 1.18, it was discovered that the random walk xt = xt−1 +wt

is not stationary. We might wonder whether there is a stationary AR(1)process with |φ| > 1. Such processes are called explosive because thevalues of the time series quickly become large in magnitude. Clearly,because |φ|j increases without bound as j → ∞,

∑k−1j=0 φjwt−j will not

converge (in mean square) as k → ∞, so the intuition used to get (3.6)will not work directly. We can, however, modify that argument to obtaina stationary model as follows. Write xt+1 = φxt + wt+1, in which case,

xt = φ−1xt+1 − φ−1wt+1 = φ−1 (φ−1xt+2 − φ−1wt+2)− φ−1wt+1

...

= φ−kxt+k −k−1∑j=1

φ−jwt+j , (3.10)

by iterating forward k steps. Because |φ|−1 < 1, this result suggests the

3.2: ARMA Models 89

stationary future dependent AR(1) model

xt = −∞∑

j=1

φ−jwt+j .

The reader can verify that this is stationary and of the AR(1) formxt = φxt−1 + wt. Unfortunately, this model is useless because it requiresus to know the future to be able to predict the future. When a processdoes not depend on the future, such as the AR(1) when |φ| < 1, wewill say the process is causal. In the explosive case of this example, theprocess is stationary, but it is also future dependent, and not causal.

The technique of iterating backwards to get an idea of the stationary so-lution of AR models works well when p = 1, but not for larger orders. Ageneral technique is that of matching coefficients. Consider the AR(1) modelin operator form

φ(B)xt = wt, (3.11)

where φ(B) = 1 − φB, and |φ| < 1. Also, write the model in equation (3.6)using operator form as

xt =∞∑

j=0

ψjwt−j = ψ(B)wt, (3.12)

where ψ(B) =∑∞

j=0 ψjBj and ψj = φj . Suppose we did not know that

ψj = φj . We could substitute ψ(B)wt from (3.12) for xt in (3.11) to obtain

φ(B)ψ(B)wt = wt. (3.13)

The coefficients of B on the left-hand side of (3.13) must be equal to those onright-hand side of (3.13), which means

(1 − φB)(1 + ψ1B + ψ2B2 + · · · + ψjB

j + · · ·) = 1. (3.14)

Reorganizing the coefficients in (3.14),

1 + (ψ1 − φ)B + (ψ2 − ψ1φ)B2 + · · · + (ψj − ψj−1φ)Bj + · · · = 1,

we see that for each j = 1, 2, . . ., the coefficient of Bj on the left must be zerobecause it is zero on the right. The coefficient of B on the left is (ψ1 −φ), andequating this to zero, ψ1 − φ = 0, leads to ψ1 = φ. Continuing, the coefficientof B2 is (ψ2 − ψ1φ), so ψ2 = φ2. In general,

ψj = ψj−1φ,

with ψ0 = 1, which leads to the general solution ψj = φj .

90 ARIMA Models

Another way to think about the operations we just performed is to considerthe AR(1) model in operator form, φ(B)xt = wt. Now multiply both sides byφ−1(B) (assuming the inverse operator exists) to get

φ−1(B)φ(B)xt = φ−1(B)wt,

orxt = φ−1(B)wt.

We know already that

φ−1(B) = 1 + φB + φ2B2 + · · · + φjBj + · · · ,that is, φ−1(B) is ψ(B) in (3.12). Thus, we notice that working with operatorsis like working with polynomials. That is, consider the polynomial φ(z) =1 − φz, where z is a complex number and |φ| < 1. Then,

φ−1(z) =1

(1 − φz)= 1 + φz + φ2z2 + · · · + φjzj + · · · , |z| ≤ 1,

and the coefficients of Bj in φ−1(B) are the same as the coefficients of zj

in φ−1(z). In other words, we may treat the backshift operator, B, as acomplex number, z. These results will be generalized in our discussion ofARMA models. We will find the polynomials corresponding to the operatorsuseful in exploring the general properties of ARMA models.

Introduction to Moving Average Models

As an alternative to the autoregressive representation in which the xt on theleft-hand side of the equation are assumed to be combined linearly, the movingaverage model of order q, abbreviated as MA(q), assumes the white noise wt

on the right-hand side of the defining equation are combined linearly to formthe observed data.

Definition 3.3 The moving average model of order q, or MA(q) model,is defined to be

xt = wt + θ1wt−1 + θ2wt−2 + · · · + θqwt−q (3.15)

where there are q lags in the moving average and θ1, θ2, . . . , θq (θq �= 0) areparameters.2 The noise wt is assumed to be Gaussian white noise.

The system is the same as the infinite moving average defined as the linearprocess (3.12), where ψ0 = 1, ψj = θj , for j = 1, . . . , q, and ψj = 0 for othervalues. We may also write the MA(q) process in the equivalent form

xt = θ(B)wt, (3.16)

using the following definition.2Some texts and software packages write the MA model with negative coefficients; that

is, xt = wt − θ1wt−1 − θ2wt−2 − · · · − θqwt−q .

3.2: ARMA Models 91

Definition 3.4 The moving average operator is

θ(B) = 1 + θ1B + θ2B2 + · · · + θqB

q (3.17)

Unlike the autoregressive process, the moving average process is stationaryfor any values of the parameters θ1, . . . , θq; details of this result are providedin §3.4.

Example 3.3 Autocorrelation and Sample Path of an MA(1) Process

Consider the MA(1) model xt = wt + θwt−1. Then,

γ(h) =

⎧⎨⎩ (1 + θ2)σ2w, h = 0

θσ2w, h = 1

0, h > 1,

and the autocorrelation function is

ρ(h) =

⎧⎨⎩θ

(1+θ2), h = 1

0, h > 1.

Note |ρ(1)| ≤ 1/2 for all values of θ (Problem 3.1). Also, xt is correlatedwith xt−1, but not with xt−2, xt−3, . . . . Contrast this with the case ofthe AR(1) model in which the correlation between xt and xt−k is neverzero. When θ = .5, for example, xt and xt−1 are positively correlated,and ρ(1) = .4. When θ = −.5, xt and xt−1 are negatively correlated,ρ(1) = −.4. Figure 3.2 shows a time plot of these two processes withσ2

w = 1. The series in Figure 3.2 where θ = .5 is smoother than the seriesin Figure 3.2, where θ = −.5.

A figure similar to Figure 3.2 can be created in R as follows:> par(mfrow=c(2,1))> plot(arima.sim(list(order=c(0,0,1), ma=.5), n=100),+ ylab="x",main=(expression("MA(1) "*theta*" = +.5")))> plot(arima.sim(list(order=c(0,0,1), ma=-.5), n=100),+ ylab="x",main=(expression("MA(1) "*theta*" = -.5")))

Example 3.4 Non-uniqueness of MA Models and Invertibility

Using Example 3.3, we note that for an MA(1) model, ρ(h) is the samefor θ and 1

θ ; try 5 and 15 , for example. In addition, the pair σ2

w = 1 andθ = 5 yield the same autocovariance function as the pair σ2

w = 25 andθ = 1/5, namely,

γ(h) =

{ 26, h = 05, h = 10, h > 1.

92 ARIMA Models

Figure 3.2 Simulated MA(1) models: θ = .5 (top); θ = −.5 (bottom).

Thus, the MA(1) processes

xt = wt +15wt−1, wt ∼ iid N(0, 25)

andxt = vt + 5vt−1, vt ∼ iid N(0, 1)

are the same because of normality (i.e., all finite distributions are thesame). We can only observe the time series xt and not the noise, wt orvt, so we cannot distinguish between the models. Hence, we will have tochoose only one of them. For convenience, by mimicking the criterion ofcausality for AR models, we will choose the model with an infinite ARrepresentation. Such a process is called an invertible process.

To discover which model is the invertible model, we can reverse the rolesof xt and wt (because we are mimicking the AR case) and write theMA(1) model as wt = −θwt−1 +xt. Following the steps that led to (3.6),if |θ| < 1, then wt =

∑∞j=0(−θ)jxt−j , which is the desired infinite AR

representation of the model. Hence, given a choice, we will choose themodel with σ2

w = 25 and θ = 1/5 because it is invertible.

As in the AR case, the polynomial, θ(z), corresponding to the movingaverage operators, θ(B), will be useful in exploring general properties of MAprocesses. For example, following the steps of equations (3.11)–(3.14), we canwrite the MA(1) model as xt = θ(B)wt, where θ(B) = 1 + θB. If |θ| < 1,then we can write the model as π(B)xt = wt, where π(B) = θ−1(B). Let

3.2: ARMA Models 93

θ(z) = 1 + θz, for |z| ≤ 1, then π(z) = θ−1(z) = 1/(1 + θz) =∑∞

j=0(−θ)jzj ,and we determine that π(B) =

∑∞j=0(−θ)jBj .

Autoregressive Moving Average Models

We now proceed with the general development of autoregressive, moving aver-age, and mixed autoregressive moving average (ARMA), models for stationarytime series.

Definition 3.5 A time series {xt; t = 0,±1,±2, . . .} is ARMA(p, q) if itis stationary and

xt = φ1xt−1 + · · · + φpxt−p + wt + θ1wt−1 + · · · + θqwt−q, (3.18)

with φp �= 0, θq �= 0, and σ2w > 0. The parameters p and q are called the

autoregressive and the moving average orders, respectively. If xt has a nonzeromean µ, we set α = µ(1 − φ1 − · · · − φp) and write the model as

xt = α + φ1xt−1 + · · · + φpxt−p + wt + θ1wt−1 + · · · + θqwt−q. (3.19)

Unless stated otherwise, {wt; t = 0,±1,±2, . . .} is a Gaussian white noisesequence.

As previously noted, when q = 0, the model is called an autoregressivemodel of order p, AR(p), and when p = 0, the model is called a moving averagemodel of order q, MA(q). To aid in the investigation of ARMA models, it willbe useful to write them using the AR operator, (3.5), and the MA operator,(3.17). In particular, the ARMA(p, q) model in (3.18) can then be written inconcise form as

φ(B)xt = θ(B)wt. (3.20)

Before we discuss the conditions under which (3.18) is causal and invertible,we point out a potential problem with the ARMA model.

Example 3.5 Parameter Redundancy

Consider a white noise process xt = wt. Equivalently, we can write thisas .5xt−1 = .5wt−1 by shifting back one unit of time and multiplying by.5. Now, subtract the two representations to obtain

xt − .5xt−1 = wt − .5wt−1,

orxt = .5xt−1 − .5wt−1 + wt, (3.21)

which looks like an ARMA(1, 1) model. Of course, xt is still white noise;nothing has changed in this regard [i.e., xt = wt is the solution to (3.21)],

94 ARIMA Models

but we have hidden the fact that xt is white noise because of the para-meter redundancy or over-parameterization. Write the parameter redun-dant model in operator form as φ(B)xt = θ(B)wt, or

(1 − .5B)xt = (1 − .5B)wt.

Apply the operator φ(B)−1 = (1 − .5B)−1 to both sides to obtain

xt = (1 − .5B)−1(1 − .5B)xt = (1 − .5B)−1(1 − .5B)wt = wt,

which is the original model. We can easily detect the problem of over-parameterization with the use of the operators or their associated poly-nomials. That is, write the AR polynomial φ(z) = (1 − .5z), the MApolynomial θ(z) = (1− .5z), and note that both polynomials have a com-mon factor, namely (1− .5z). This common factor immediately identifiesthe parameter redundancy. Discarding the common factor in each leavesφ(z) = 1 and θ(z) = 1, from which we conclude φ(B) = 1 and θ(B) = 1,and we deduce that the model is actually white noise. The considera-tion of parameter redundancy will be crucial when we discuss estimationfor general ARMA models. As this example points out, we might fit anARMA(1, 1) model to white noise data and find that the parameter es-timates are significant. If we were unaware of parameter redundancy, wemight claim the data are correlated when in fact they are not (Problem3.19).

Examples 3.2, 3.4, and 3.5 point to a number of problems with the generaldefinition of ARMA(p, q) models, as given by (3.18), or, equivalently, by (3.20).To summarize, we have seen the following problems:

(i) parameter redundant models,

(ii) stationary AR models that depend on the future, and

(iii) MA models that are not unique.

To overcome these problems, we will require some additional restrictionson the model parameters. First, we make the following definitions.

Definition 3.6 The AR and MA polynomials are defined as

φ(z) = 1 − φ1z − · · · − φpzp, φp �= 0, (3.22)

andθ(z) = 1 + θ1z + · · · + θqz

q, θq �= 0, (3.23)

respectively, where z is a complex number.

3.2: ARMA Models 95

To address the first problem, we will henceforth refer to an ARMA(p, q)model to mean that it is in its simplest form. That is, in addition to theoriginal definition given in equation (3.18), we will also require that φ(z) andθ(z) have no common factors. So, the process, xt = .5xt−1 − .5wt−1 + wt,discussed in Example 3.5 is not referred to as an ARMA(1, 1) process because,in its reduced form, xt is white noise.

To address the problem of future-dependent models, we formally introducethe concept of causality.

Definition 3.7 An ARMA(p, q) model, φ(B)xt = θ(B)wt, is said to be causal,if the time series {xt; t = 0,±1,±2, . . .} can be written as a one-sided linearprocess:

xt =∞∑

j=0

ψjwt−j = ψ(B)wt, (3.24)

where ψ(B) =∑∞

j=0 ψjBj, and

∑∞j=0 |ψj | < ∞; we set ψ0 = 1.

In Example 3.2, the AR(1) process, xt = φxt−1 + wt, is causal only when|φ| < 1. Equivalently, the process is causal only when the root of φ(z) = 1−φzis bigger than one in absolute value. That is, the root, say, z0, of φ(z) isz0 = 1/φ (because φ(z0) = 0) and |z0| > 1 because |φ| < 1. In general, wehave the following property.

Property P3.1: Causality of an ARMA(p, q) ProcessAn ARMA(p, q) model is causal if and only if φ(z) �= 0 for |z| ≤ 1. Thecoefficients of the linear process given in (3.24) can be determined by solving

ψ(z) =∞∑

j=0

ψjzj =

θ(z)φ(z)

, |z| ≤ 1.

Another way to phrase Property P3.1 is that an ARMA process is causalonly when the roots of φ(z) lie outside the unit circle; that is, φ(z) = 0 onlywhen |z| > 1. Finally, to address the problem of uniqueness discussed inExample 3.4, we choose the model that allows an infinite autoregressive rep-resentation.

Definition 3.8 An ARMA(p, q) model, φ(B)xt = θ(B)wt, is said to beinvertible, if the time series {xt; t = 0,±1,±2, . . .} can be written as

π(B)xt =∞∑

j=0

πjxt−j = wt, (3.25)

where π(B) =∑∞

j=0 πjBj, and

∑∞j=0 |πj | < ∞; we set π0 = 1.

Analogous to Property P3.1, we have the following property.

96 ARIMA Models

Property P3.2: Invertibility of an ARMA(p, q) ProcessAn ARMA(p, q) model is invertible if and only if θ(z) �= 0 for |z| ≤ 1. Thecoefficients πj of π(B) given in (3.25) can be determined by solving

π(z) =∞∑

j=0

πjzj =

φ(z)θ(z)

, |z| ≤ 1.

Another way to phrase Property P3.2 is that an ARMA process is invertibleonly when the roots of θ(z) lie outside the unit circle; that is, θ(z) = 0 onlywhen |z| > 1. The proof of Property P3.1 is given in Appendix B (the proof ofProperty P3.2 is similar and, hence, is not provided). The following examplesillustrate these concepts.

Example 3.6 Parameter Redundancy, Causality, and Invertibility

Consider the process

xt = .4xt−1 + .45xt−2 + wt + wt−1 + .25wt−2,

or, in operator form,

(1 − .4B − .45B2)xt = (1 + B + .25B2)wt.

At first, xt appears to be an ARMA(2, 2) process. But, the associatedpolynomials

φ(z) = 1 − .4z − .45z2 = (1 + .5z)(1 − .9z)

θ(z) = (1 + z + .25z2) = (1 + .5z)2

have a common factor that can be canceled. After cancellation, thepolynomials become φ(z) = (1 − .9z) and θ(z) = (1 + .5z), so the modelis an ARMA(1, 1) model, (1 − .9B)xt = (1 + .5B)wt, or

xt = .9xt−1 + .5wt−1 + wt. (3.26)

The model is causal because φ(z) = (1 − .9z) = 0 when z = 10/9, whichis outside the unit circle. The model is also invertible because the rootof θ(z) = (1 + .5z) is z = −2, which is outside the unit circle.

To write the model as a linear process, we can obtain the ψ-weights usingProperty P3.1:

ψ(z) =θ(z)φ(z)

=(1 + .5z)(1 − .9z)

= (1 + .5z)(1 + .9z + .92z2 + .93z3 + · · ·) |z| ≤ 1.

3.2: ARMA Models 97

The coefficient of zj in ψ(z) is ψj = (.5 + .9).9j−1, for j ≥ 1, so (3.26)can be written as

xt = wt + 1.4∞∑

j=1

.9j−1wt−j .

Similarly, to find the invertible representation using Property P3.2:

π(z) =φ(z)θ(z)

= (1 − .9z)(1 − .5z + .52z2 − .53z3 + · · ·) |z| ≤ 1.

In this case, the π-weights are given by πj = (−1)j(.9 + .5).5j−1, forj ≥ 1, and hence, we can also write (3.26) as

xt = 1.4∞∑

j=1

(−.5)j−1xt−j + wt.

Example 3.7 Causal Conditions for an AR(2) Process

For an AR(1) model, (1 − φB)xt = wt, to be causal, the root of φ(z) =1 − φz must lie outside of the unit circle. In this case, the root (or zero)occurs at z0 = 1/φ, i.e., φ(z0) = 0, so it is easy to go from the causalrequirement on the root, that is, |1/φ| > 1, to a requirement on theparameter, that is, |φ| < 1. It is not so easy to establish this relationshipfor higher order models.

For example, the AR(2) model, (1−φ1B −φ2B2)xt = wt, is causal when

the two roots of φ(z) = 1−φ1z−φ2z2 lie outside of the unit circle. Using

the quadratic formula, this requirement can be written as∣∣∣∣∣φ1 ±√

φ21 + 4φ2

−2φ2

∣∣∣∣∣ > 1.

The roots of φ(z) may be real and distinct, real and equal, or a complexconjugate pair. If we denote those roots by z1 and z2, we can writeφ(z) = (1−z−1

1 z)(1−z−12 z); note that φ(z1) = φ(z2) = 0. The model can

be written in operator form as (1 − z−11 B)(1 − z−1

2 B)xt = wt. From thisrepresentation, it follows that φ1 = (z−1

1 +z−12 ) and φ2 = −(z1z2)−1. This

relationship can be used to establish the following equivalent conditionfor causality:

φ1 + φ2 < 1, φ2 − φ1 < 1, and |φ2| < 1. (3.27)

This causality condition specifies a triangular region in the parameterspace. We leave the details of the equivalence to the reader (Problem 3.4).

98 ARIMA Models

3.3 Difference Equations

The study of the behavior of ARMA processes and their ACFs is greatly en-hanced by a basic knowledge of difference equations, simply because they aredifference equations. This topic is also useful in the study of time domainmodels and stochastic processes in general. We will give a brief and heuristicaccount of the topic along with some examples of the usefulness of the theory.For details, the reader is referred to Mickens (1987).

Suppose we have a sequence of numbers u0, u1, u2, . . . such that

un − αun−1 = 0, α �= 0, n = 1, 2, . . . . (3.28)

For example, recall (3.9) in which we showed that the ACF of an AR(1) processis a sequence, ρ(h), satisfying

ρ(h) − φρ(h − 1) = 0, h = 1, 2, . . . .

Equation (3.28) represents a homogeneous difference equation of order 1. Tosolve the equation, we write:

u1 = αu0

u2 = αu1 = α2u0

...un = αun−1 = αnu0.

Given an initial condition u0 = c, we may solve (3.28), namely, un = αnc.In operator notation, (3.28) can be written as (1 − αB)un = 0. The poly-

nomial associated with (3.28) is α(z) = 1 − αz, and the root, say, z0, of thispolynomial is z0 = 1/α; that is α(z0) = 0. We know the solution to (3.28),with initial condition u0 = c, is

un = αnc =(z−10

)nc.

That is, the solution to the difference equation (3.28) depends only on theinitial condition and the inverse of the root to the associated polynomial α(z).

Now suppose that the sequence satisfies

un − α1un−1 − α2un−2 = 0, α2 �= 0, n = 2, 3, . . . (3.29)

This equation is a homogeneous difference equation of order 2. The corre-sponding polynomial is

α(z) = 1 − α1z − α2z2,

which has two roots, say, z1 and z2; that is, α(z1) = α(z2) = 0. We willconsider two cases. First suppose z1 �= z2. Then the general solution to (3.29)is

un = c1z−n1 + c2z

−n2 , (3.30)

3.3: Difference Equations 99

where c1 and c2 depend on the initial conditions. This claim can be verifiedby direct substitution of (3.30) into (3.29):

c1z−n1 + c2z

−n2 − α1

(c1z

−(n−1)1 + c2z

−(n−1)2

)− α2

(c1z

−(n−2)1 + c2z

−(n−2)2

)= c1z

−n1

(1 − α1z1 − α2z

21)

+ c2z−n2

(1 − α1z2 − α2z

22)

= c1z−n1 α(z1) + c2z

−n2 α(z2)

= 0.

Given two initial conditions u0 and u1, we may solve for c1 and c2:

u0 = c1 + c2

u1 = c1z−11 + c2z

−12 ,

where z1 and z2 can be solved for in terms of α1 and α2 using the quadraticformula, for example.

When the roots are equal, z1 = z2 (= z0), the general solution to (3.29) is

un = z−n0 (c1 + c2n). (3.31)

This claim can also be verified by direct substitution of (3.31) into (3.29):

z−n0 (c1 + c2n) − α1

(z

−(n−1)0 [c1 + c2(n − 1)]

)− α2

(z

−(n−2)0 [c1 + c2(n − 2)]

)= z−n

0 (c1 + c2n)(1 − α1z0 − α2z

20)

+ c2z−n+10 (α1 + 2α2z0)

= c2z−n+10 (α1 + 2α2z0) .

To show that (α1 + 2α2z0) = 0, write 1 − α1z − α2z2 = (1 − z−1

0 z)2, andtake derivatives with respect to z on both sides of the equation to obtain(α1 + 2α2z) = 2z−1

0 (1 − z−10 z). Thus, (α1 + 2α2z0) = 2z−1

0 (1 − z−10 z0) = 0, as

was to be shown. Finally, given two initial conditions, u0 and u1, we can solvefor c1 and c2:

u0 = c1

u1 = (c1 + c2)z−10 .

To summarize these results, in the case of distinct roots, the solution tothe homogeneous difference equation of degree two was

un = z−n1 × (a polynomial in n of degree m1 − 1)

+ z−n2 × (a polynomial in n of degree m2 − 1),

where m1 is the multiplicity of the root z1 and m2 is the multiplicity of the rootz2. In this example, of course, m1 = m2 = 1, and we called the polynomialsof degree zero c1 and c2, respectively. In the case of the repeated root, thesolution was

un = z−n0 × (a polynomial in n of degree m0 − 1),

100 ARIMA Models

where m0 is the multiplicity of the root z0; that is, m0 = 2. In this case, wewrote the polynomial of degree one as c1 + c2n. In both cases, we solved forc1 and c2 given two initial conditions, u0 and u1.

Example 3.8 The ACF of an AR(2) Process

Suppose xt = φ1xt−1 + φ2xt−2 + wt is a causal AR(2) process. Multiplyeach side of the model by xt−h for h > 0, and take expectation:

E(xtxt−h) = φ1E(xt−1xt−h) + φ2E(xt−2xt−h) + E(wtxt−h).

The result is

γ(h) = φ1γ(h − 1) + φ2γ(h − 2), h = 1, 2, . . . . (3.32)

In (3.32), we used the fact that E(xt) = 0 and for h > 0,

E(wtxt−h) = E(wt

∞∑j=0

ψjwt−h−j

)= 0.

Divide (3.32) through by γ(0) to obtain the difference equation for theACF of the process:

ρ(h) − φ1ρ(h − 1) − φ2ρ(h − 2) = 0, h = 1, 2, . . . . (3.33)

The initial conditions are ρ(0) = 1 and ρ(−1) = φ1/(1 − φ2), which isobtained by evaluating (3.33) for h = 1 and noting that ρ(1) = ρ(−1).

Using the results for the homogeneous difference equation of order two,let z1 and z2 be the roots of the associated polynomial, φ(z) = 1−φ1z −φ2z

2. Because the model is causal, we know the roots are outside theunit circle: |z1| > 1 and |z2| > 1. Now, consider the solution for threecases:

(i) When z1 and z2 are real and distinct, then

ρ(h) = c1z−h1 + c2z

−h2 ,

so ρ(h) → 0 exponentially fast as h → ∞.

(ii) When z1 = z2(= z0) are real and equal, then

ρ(h) = z−h0 (c1 + c2h),

so ρ(h) → 0 exponentially fast as h → ∞.

(iii) When z1 = z2 are a complex conjugate pair, then c2 = c1 (becauseρ(h) is real), and

ρ(h) = c1z−h1 + c1z

−h1 .

Write c1 and z1 in polar coordinates, for example, z1 = |z1|eiθ,where θ is the angle whose tangent is the ratio of the imaginary

3.3: Difference Equations 101

part and the real part of z1 (sometimes called arg(z1); the range ofθ is [−π, π]). Then, using the fact that eiα + e−iα = 2 cos(α), thesolution has the form

ρ(h) = a|z1|−h cos(hθ + b),

where a and b are determined by the initial conditions. Again, ρ(h)dampens to zero exponentially fast as h → ∞, but it does so in asinusoidal fashion. The implication of this result is shown in thenext example.

Example 3.9 The Sample Path of an AR(2) with Complex Roots

Figure 3.3 shows n = 144 observations from the AR(2) model

xt = 1.5xt−1 − .75xt−2 + wt,

with σ2w = 1, and with complex roots chosen so the process exhibits

pseudo-cyclic behavior at the rate of one cycle every 12 time points. Theautoregressive polynomial for this model is φ(z) = 1 − 1.5z + .75z2. Theroots of φ(z) are 1 ± i/

√3, and θ = tan−1(1/

√3) = 2π/12 radians per

unit time. To convert the angle to cycles per unit time, divide by 2π toget 1/12 cycles per unit time. The ACF for this model is shown in §3.4,Figure 3.4.

To reproduce Figure 3.3 in R:

> set.seed(5)> ar2 = arima.sim(list(order = c(2,0,0), ar =c(1.5,-.75)),+ n = 144)> plot.ts(ar2, axes=F); box(); axis(2)> axis(1, seq(0,144,24))> abline(v=seq(0,144,12), lty="dotted")

To calculate and display the ACF for this model in R:

> acf = ARMAacf(ar=c(1.5,-.75), ma=0, 50)> plot(acf, type="h", xlab="lag")> abline(h=0)

We now exhibit the solution for the general homogeneous difference equa-tion of order p:

un − α1un−1 − · · · − αpun−p = 0, αp �= 0, n = p, p + 1, . . . . (3.34)

The associated polynomial is

α(z) = 1 − α1z − · · · − αpzp.

102 ARIMA Models

Time

ar2

−6−4

−20

24

68

0 24 48 72 96 120 144

Figure 3.3 Simulated AR(2) model, n = 144 with φ1 = 1.5 and φ2 = −.75.

Suppose α(z) has r distinct roots, z1 with multiplicity m1, z2 with multiplicitym2, . . . , and zr with multiplicity mr, such that m1 + m2 + · · · + mr = p. Thegeneral solution to the difference equation (3.34) is

un = z−n1 P1(n) + z−n

2 P2(n) + · · · + z−nr Pr(n), (3.35)

where Pj(n), for j = 1, 2, . . . , r, is a polynomial in n, of degree mj − 1. Givenp initial conditions u0, . . . , up−1, we can solve for the Pj(n) explicitly.

Example 3.10 Determining the ψψψ-weights for a Causal ARMA(p, qp, qp, q)

For a causal ARMA(p, q) model, φ(B)xt = θ(B)wt, where the zeros ofφ(z) are outside the unit circle, recall that we may write

xt =∞∑

j=0

ψjwt−j ,

where the ψ-weights are determined using Property P3.1.

For the pure MA(q) model, ψ0 = 1, ψj = θj , for j = 1, . . . , q, and ψj = 0,otherwise. For the general case of ARMA(p, q) models, the task of solvingfor the ψ-weights is much more complicated, as was demonstrated inExample 3.6. The use of the theory of homogeneous difference equationscan help here. To solve for the ψ-weights in general, we must match thecoefficients in ψ(z)φ(z) = θ(z):

(ψ0 + ψ1z + ψ2z2 + · · ·)(1 − φ1z − φ2z

2 − · · ·) = (1 + θ1z + θ2z2 + · · ·).

3.4: The ACF and PACF 103

The first few values are

ψ0 = 1ψ1 − φ1ψ0 = θ1

ψ2 − φ1ψ1 − φ2ψ0 = θ2ψ3 − φ1ψ2 − φ2ψ1 − φ3ψ0 = θ3

...

where we would take φj = 0 for j > p, and θj = 0 for j > q. Theψ-weights satisfy the homogeneous difference equation given by

ψj −p∑

k=1

φkψj−k = 0, j ≥ max(p, q + 1), (3.36)

with initial conditions

ψj −j∑

k=1

φkψj−k = θj , 0 ≤ j ≤ max(p, q + 1). (3.37)

The general solution depends on the roots of the AR polynomial φ(z) =1 − φ1z − · · · − φpz

p, as seen from (3.36). The specific solution will, ofcourse, depend on the initial conditions.

Consider the ARMA process given in (3.26), xt = .9xt−1 + .5wt−1 + wt.Because max(p, q+1) = 2, using (3.37), we have ψ0 = 1 and ψ1 = .9+.5 =1.4. By (3.36), for j = 2, 3, . . . , the ψ-weights satisfy ψj − .9ψj−1 = 0.The general solution is ψj = c .9j . To find the specific solution, usethe initial condition ψ1 = 1.4, so 1.4 = .9c or c = 1.4/.9. Finally,ψj = 1.4(.9)j−1, for j ≥ 1, as we saw in Example 3.6.

To view, for example, the first 50 ψ-weights in R, use:

> ARMAtoMA(ar=.9, ma=.5, 50) # for a list> plot(ARMAtoMA(ar=.9, ma=.5, 50)) # for a graph

3.4 Autocorrelation and Partial AutocorrelationFunctions

We begin by exhibiting the ACF of an MA(q) process, xt = θ(B)wt, whereθ(B) = 1+θ1B + · · ·+θqB

q. Because xt is a finite linear combination of whitenoise terms, the process is stationary with mean

E(xt) =q∑

j=0

θjE(wt−j) = 0,

104 ARIMA Models

where we have written θ0 = 1, and with autocovariance function

γ(h) = cov (xt+h, xt) = E

⎡⎣⎛⎝ q∑j=0

θjwt+h−j

⎞⎠( q∑k=0

θkwt−k

)⎤⎦=

⎧⎨⎩σ2w

∑q−hj=0 θjθj+h, 0 ≤ h ≤ q

0, h > q.(3.38)

Recall that γ(h) = γ(−h), so we will only display the values for h ≥ 0. Thecutting off of γ(h) after q lags is the signature of the MA(q) model. Dividing(3.38) by γ(0) yields the ACF of an MA(q):

ρ(h) =

⎧⎪⎨⎪⎩∑q−h

j=0θjθj+h

1+θ21+···+θ2

q

, 1 ≤ h ≤ q

0, h > q.

(3.39)

For a causal ARMA(p, q) model, φ(B)xt = θ(B)wt, where the zeros of φ(z)are outside the unit circle, write

xt =∞∑

j=0

ψjwt−j .

It follows immediately that E(xt) = 0. Also, the autocovariance function of xt

can be written as:

γ(h) = cov(xt+h, xt) = σ2w

∞∑j=0

ψjψj+h, h ≥ 0. (3.40)

We could then use (3.36) and (3.37) to solve for the ψ-weights. In turn, wecould solve for γ(h), and the ACF ρ(h) = γ(h)/γ(0). As in Example 3.8, it isalso possible to obtain a homogeneous difference equation directly in terms ofγ(h). First, we write

γ(h) = cov(xt+h, xt) = E

⎡⎣⎛⎝ p∑j=1

φjxt+h−j +q∑

j=0

θjwt+h−j

⎞⎠xt

⎤⎦=

p∑j=1

φjγ(h − j) + σ2w

q∑j=h

θjψj−h, h ≥ 0, (3.41)

where we have used the fact that xt =∑∞

k=0 ψkwt−k and for h ≥ 0,

E(wt+h−jxt) = E

[wt+h−j

( ∞∑k=0

ψkwt−k

)]= ψj−hσ2

w.


From (3.41), we can write a general homogeneous equation for the ACF of acausal ARMA process:

γ(h) − φ1γ(h − 1) − · · · − φpγ(h − p) = 0, h ≥ max(p, q + 1), (3.42)

with initial conditions

γ(h) −p∑

j=1

φjγ(h − j) = σ2w

q∑j=h

θjψj−h, 0 ≤ h < max(p, q + 1). (3.43)

Dividing (3.42) and (3.43) through by γ(0) will allow us to solve for the ACF,ρ(h) = γ(h)/γ(0).

Example 3.11 The ACF of an ARMA(1, 1)

Consider the causal ARMA(1, 1) process xt = φxt−1 +θwt−1 +wt, where|φ| < 1. Based on (3.42), the autocovariance function satisfies

γ(h) − φγ(h − 1) = 0, h = 2, 3, . . . ,

so the general solution is γ(h) = cφh, for h = 1, 2, . . . . To obtain theinitial conditions, we use (3.43):

γ(0) = φγ(1) + σ2w[1 + θφ + θ2]

γ(1) = φγ(0) + σ2wθ.

Solving for γ(0) and γ(1), we obtain:

γ(0) = σ2w

1 + 2θφ + θ2

1 − φ2

γ(1) = σ2w

(1 + θφ)(φ + θ)1 − φ2 .

To solve for c, note that γ(1) = cφ, in which case c = γ(1)/φ. Hence,the specific solution is

γ(h) = σ2w

(1 + θφ)(φ + θ)1 − φ2 φh−1.

Finally, dividing through by γ(0) yields the ACF

ρ(h) =(1 + θφ)(φ + θ)1 + 2θφ + θ2 φh−1, h ≥ 1. (3.44)

106 ARIMA Models

Example 3.12 The ACF of an AR(ppp)

For a causal AR(p), it follows immediately from (3.42) that

ρ(h) − φ1ρ(h − 1) − · · · − φpρ(h − p) = 0, h ≥ p. (3.45)

Let z1, . . . , zr denote the roots of φ(z), each with multiplicity m1, . . . , mr,respectively, where m1 + · · · + mr = p. Then, from (3.35), the generalsolution is

ρ(h) = z−h1 P1(h) + z−h

2 P2(h) + · · · + z−hr Pr(h), h ≥ p, (3.46)

where Pj(h) is a polynomial in h of degree mj − 1.

Recall that for a causal model, all of the roots are outside the unit circle,|zi| > 1, for i = 1, . . . , r. If all the roots are real, then ρ(h) dampensexponentially fast to zero as h → ∞. If some of the roots are complex,then they will be in conjugate pairs and ρ(h) will dampen, in a sinusoidalfashion, exponentially fast to zero as h → ∞. In the case of complex roots,the time series will appear to be cyclic in nature. This, of course, is alsotrue for ARMA models in which the AR part has complex roots.

The Partial Autocorrelation Function (PACF)

We have seen in (3.39), for MA(q) models, the ACF will be zero for lags greaterthan q. Moreover, because θq �= 0, the ACF will not be zero at lag q. Thus,the ACF provides a considerable amount of information about the order ofthe dependence when the process is a moving average process. If the process,however, is ARMA or AR, the ACF alone tells us little about the orders ofdependence. Hence, it is worthwhile pursuing a function that will behave likethe ACF of MA models, but for AR models, namely, the partial autocorrelationfunction (PACF).

To motivate the idea, consider a causal AR(1) model, xt = φxt−1 + wt.Then,

γ(2) = cov(xt, xt−2) = cov(φxt−1 + wt, xt−2)= cov(φ2xt−2 + φwt−1 + wt, xt−2) = φ2γ(0).

This result follows from causality because xt−2 involves {wt−2, wt−3, . . .}, whichare all uncorrelated with wt and wt−1. The correlation between xt and xt−2is not zero, as it would be for an MA(1), because xt is dependent on xt−2through xt−1. Suppose we break this chain of dependence by removing (orpartialing out) xt−1. That is, we consider the correlation between xt − φxt−1and xt−2 − φxt−1, because it is the correlation between xt and xt−2 with thelinear dependence of each on xt−1 removed. In this way, we have broken thedependence chain between xt and xt−2. In fact,

cov(xt − φxt−1, xt−2 − φxt−1) = cov(wt, xt−2 − φxt−1) = 0.


To formally define the PACF for mean-zero stationary time series, let xh−1h

denote the regression of xh on {xh−1, xh−2, . . . , x1}, which we write as3

xh−1h = β1xh−1 + β2xh−2 + · · · + βh−1x1. (3.47)

No intercept term is needed in (3.47) because the mean of xt is zero. Inaddition, let xh−1

0 denote the regression of x0 on {x1, x2, . . . , xh−1}, then

xh−10 = β1x1 + β2x2 + · · · + βh−1xh−1. (3.48)

The coefficients, β1, . . . , βh−1 are the same in (3.47) and (3.48); we will explainthis result in the next section.

Definition 3.9 The partial autocorrelation function (PACF) of a sta-tionary process, xt, denoted φhh, for h = 1, 2, . . . , is

φ11 = corr(x1, x0) = ρ(1) (3.49)

andφhh = corr(xh − xh−1

h , x0 − xh−10 ), h ≥ 2. (3.50)

Both (xh −xh−1h ) and (x0 −xh−1

0 ) are uncorrelated with {x1, x2, . . . , xh−1}.By stationarity, the PACF, φhh, is the correlation between xt and xt−h withthe linear dependence of {xt−1, . . . , xt−(h−1)}, on each, removed. If the processxt is Gaussian, then φhh = corr(xt, xt−h| xt−1, . . . , xt−(h−1)). That is, φhh isthe correlation coefficient between xt and xt−h in the bivariate distribution of(xt, xt−h) conditional on {xt−1, . . . , xt−(h−1)}.

Example 3.13 The PACF of a Causal AR(1)

Consider the PACF of the AR(1) process given by xt = φxt−1 +wt, with|φ| < 1 . By definition, φ11 = ρ(1) = φ. To calculate φ22, consider theregression of x2 on x1, say, x1

2 = βx1. We choose β to minimize

E(x2 − βx1)2 = γ(0) − 2βγ(1) + β2γ(0).

Taking derivatives and setting the result equal to zero, we have β =γ(1)/γ(0) = ρ(1) = φ. Thus, x1

2 = φx1. Next, consider the regression ofx0 on x1, say x1

0 = βx1. We choose β to minimize

E(x0 − βx1)2 = γ(0) − 2βγ(1) + β2γ(0).

This is the same equation as before, so β = φ and x10 = φx1. Hence,

φ22 = corr(x2 − φx1, x0 − φx1). But, note

cov(x2 − φx1, x0 − φx1) = γ(2) − 2φγ(1) + φ2γ(0) = 0

because γ(h) = γ(0)φh. Thus, φ22 = 0. In the next example, we will seethat in this case φhh = 0, for all h > 1.

3The term regression here refers to regression in the population sense. That is, xh−1h

is

the linear combination of {xh−1, xh−2, . . . , x1} that minimizes E(xh −∑h−1

j=1 αjxj)2.

108 ARIMA Models

Example 3.14 The PACF of a Causal AR(p)

Let xt =∑p

j=1 φjxt−j + wt, where the roots of φ(z) are outside theunit circle. In particular, xh =

∑pj=1 φjxh−j + wh. When h > p, the

regression of xh on xh−1, . . . , x1, is

xh−1h =

p∑j=1

φjxh−j .

We have not proved this obvious result yet, but we will prove it in thenext section. Thus, when h > p,

φhh = corr(xh − xh−1h , x0 − xh−1

0 )

= corr(wh, x0 − xh−10 ) = 0,

because, by causality, x0−xh−10 depends only on {wh−1, wh−2, . . .}; recall

equation (3.48). When h ≤ p, φpp is not zero, and φ11, . . . , φp−1,p−1 arenot necessarily zero. Figure 3.4 shows the ACF and the PACF of theAR(2) model presented in Example 3.9.

To reproduce Figure 3.4 in R, use the following commands:> acf = ARMAacf(ar=c(1.5,-.75), ma=0, 24)> pacf = ARMAacf(ar=c(1.5,-.75), ma=0, 24, pacf=T)> par(mfrow=c(1,2))> plot(acf, type="h", xlab="lag")> abline(h=0)> plot(pacf, type="h", xlab="lag")> abline(h=0)

Example 3.15 The PACF of an Invertible MA(q)

For an invertible MA(q), we can write xt = −∑∞j=1 πjxt−j + wt. More-

over, no finite representation exists. From this result, it should be ap-parent that the PACF will never cut off, as in the case of an AR(p).

For an MA(1), xt = wt + θwt−1, with |θ| < 1, calculations similar toExample 3.13 will yield φ22 = −θ2/(1 + θ2 + θ4). For the MA(1) ingeneral, we can show that

φhh = − (−θ)h(1 − θ2)1 − θ2(h+1) , h ≥ 1.

In the next section, we will discuss methods of calculating the PACF. ThePACF for MA models behaves much like the ACF for AR models. Also, thePACF for AR models behaves much like the ACF for MA models. Because aninvertible ARMA model has an infinite AR representation, the PACF will notcut off. We may summarize these results in Table 3.1.


Figure 3.4 The ACF and PACF, to lag 24, of an AR(2) model with φ1 = 1.5and φ2 = −.75.

Table 3.1 Behavior of the ACF and PACF for Causal andInvertible ARMA Models

AR(p) MA(q) ARMA(p, q)

ACF Tails off Cuts off Tails offafter lag q

PACF Cuts off Tails off Tails offafter lag p

Example 3.16 Preliminary Analysis of the Recruitment Series

We consider the problem of modeling the Recruitment series (numberof new fish) shown in Figure 1.5. There are 453 months of observedrecruitment ranging over the years 1950-1987. The ACF and the PACFgiven in Figure 3.5 are consistent with the behavior of an AR(2). TheACF has cycles corresponding roughly to a 12-month period, and thePACF has large values for h = 1, 2 and then is essentially zero for higherorder lags. Based on Table 3.1, these results suggest that a second-order(p = 2) autoregressive model might provide a good fit. Although we willdiscuss estimation in detail in §3.6, we ran a regression (see §2.2) using thedata triplets {(y; z1, z2) : (x3; x2, x1), (x4; x3, x2), . . . , (x453; x452, x451)}to fit a model of the form

xt = φ0 + φ1xt−1 + φ2xt−2 + wt

110 ARIMA Models

Figure 3.5 ACF and PACF of the Recruitment series.

for t = 3, 4, . . . , 453. The values of the estimates were φ0 = 6.74(1.11),φ1 = 1.35(.04), φ2 = −.46(.04), and σ2

w = 90.31, where the estimatedstandard errors are in parentheses.

To reproduce this analysis and the ACF and PACF in Figure 3.5 in R:

> rec = scan("/mydata/recruit.dat")> par(mfrow=c(2,1))> acf(rec, 48)> pacf(rec, 48)> fit=ar.ols(rec,aic=F,order.max=2,demean=F,intercept=T)> fit # estimates> fit$asy.se # standard errors

3.5 Forecasting

In forecasting, the goal is to predict future values of a time series, xn+m, m =1, 2, . . ., based on the data collected to the present, xxx = {xn, xn−1, . . . , x1}.Throughout this section, we will assume xt is stationary and the model para-meters are known. The problem of forecasting when the model parameters areunknown will be discussed in the next section; also, see Problem 3.25. Theminimum mean square error predictor of xn+m is

xnn+m = E(xn+m

∣∣ xn, xn−1, . . . , x1)

3.5: Forecasting 111

because the conditional expectation minimizes the mean square error

E [xn+m − g(xxx)]2 , (3.51)

where g(xxx) is a function of the observations xxx; see Problem 3.13.First, we will restrict attention to predictors that are linear functions of

the data, that is, predictors of the form

xnn+m = α0 +

n∑k=1

αkxk, (3.52)

where α0, α1, . . . , αn are real numbers. Linear predictors of the form (3.52)that minimize the mean square prediction error (3.51) are called best linearpredictors (BLPs). As we shall see, linear prediction depends only on thesecond-order moments of the process, which are easy to estimate from the data.Much of the material in this section is enhanced by the theoretical materialpresented in Appendix B. For example, Theorem B.3 states that if the processis Gaussian, minimum mean square error predictors and best linear predictorsare the same. The following property, which is based on the projection theorem,Theorem B.1 of Appendix B, is a key result.

Property P3.3: Best Linear Prediction for Stationary ProcessesGiven data x1, . . . , xn, the best linear predictor, xn

n+m = α0 +∑n

k=1 αkxk, ofxn+m, for m ≥ 1, is found by solving

E[(

xn+m − xnn+m

)xk

]= 0, k = 0, 1, . . . , n, (3.53)

where x0 = 1.

The equations specified in (3.53) are called the prediction equations, andthey are used to solve for the coefficients {α0, α1, . . . , αn}. If E(xt) = µ, thefirst equation (k = 0) of (3.53) implies

E(xnn+m) = E(xn+m) = µ.

Thus, taking expectation in (3.52), we have

µ = α0 +n∑

k=1

αkµ or α0 = µ

(1 −

n∑k=1

αk

).

Hence, the form of the BLP is

xnn+m = µ +

n∑k=1

αk(xk − µ).

Thus, until we discuss estimation, there is no loss of generality in consideringthe case that µ = 0, in which case, α0 = 0.

112 ARIMA Models

Consider, first, one-step-ahead prediction. That is, given {x1, . . . , xn}, wewish to forecast the value of the time series at the next time point, xn+1. TheBLP of xn+1 is

xnn+1 = φn1xn + φn2xn−1 + · · · + φnnx1, (3.54)

where, for purposes that will become clear shortly, we have written αk in (3.52),as φn,n+1−k in (3.54), for k = 1, . . . , n. Using Property P3.3, the coefficients{φn1, φn2, . . . , φnn} satisfy

E

⎡⎣⎛⎝xn+1 −n∑

j=1

φnjxn+1−j

⎞⎠xn+1−k

⎤⎦ = 0, k = 1, . . . , n,

orn∑

j=1

φnjγ(k − j) = γ(k), k = 1, . . . , n. (3.55)

The prediction equations (3.55) can be written in matrix notation as

Γnφφφn = γγγn, (3.56)

where Γn = {γ(k−j)}nj,k=1 is an n×n matrix, φφφn = (φn1, . . . , φnn)′ is an n×1

vector, and γγγn = (γ(1), . . . , γ(n))′ is an n × 1 vector.The matrix Γn is nonnegative definite. If Γn is singular, there are many

solutions to (3.56), but, by the projection theorem (Theorem B.1), xnn+1 is

unique. If Γn is nonsingular, the elements of φφφn are unique, and are given by

φφφn = Γ−1n γγγn. (3.57)

For ARMA models, the fact that σ2w > 0 and γ(h) → 0 as h → ∞ is enough to

ensure that Γn is positive definite (Problem 3.11). It is sometimes convenientto write the one-step-ahead forecast in vector notation

xnn+1 = φφφ′

nxxx, (3.58)

where xxx = (xn, xn−1, . . . , x1)′.The mean square one-step-ahead prediction error is

Pnn+1 = E(xn+1 − xn

n+1)2 = γ(0) − γγγ′

nΓ−1n γγγn. (3.59)

To verify (3.59) using (3.57) and (3.58),

E(xn+1 − xnn+1)

2 = E(xn+1 − φφφ′nxxx)2 = E(xn+1 − γγγ′

nΓ−1n xxx)2

= E(x2n+1 − 2γγγ′

nΓ−1n xxxxn+1 + γγγ′

nΓ−1n xxxxxx′Γ−1

n γγγn)= γ(0) − 2γγγ′

nΓ−1n γγγn + γγγ′

nΓ−1n ΓnΓ−1

n γγγn

= γ(0) − γγγ′nΓ−1

n γγγn.


Example 3.17 Prediction for an AR(2)

Suppose we have a causal AR(2) process xt = φ1xt−1 + φ2xt−2 + wt,and one observation x1. Then, using equation (3.57), the one-step-aheadprediction of x2 based on x1 is

x12 = φ11x1 =

γ(1)γ(0)

x1 = ρ(1)x1.

Now, suppose we want the one-step-ahead prediction of x3 based on twoobservations x1 and x2. We could use (3.57) again and solve

x23 = φ21x2 + φ22x1 = (γ(1), γ(2))

(γ(0) γ(1)γ(1) γ(0)

)−1(x2x1

),

but, it should be apparent from the model that x23 = φ1x2 + φ2x1. Be-

cause φ1x2 + φ2x1 satisfies the prediction equations (3.53),

E{[x3 − (φ1x2 + φ2x1)]x1} = E(w3x1) = 0,

E{[x3 − (φ1x2 + φ2x1)]x2} = E(w3x2) = 0,

it follows that, indeed, x23 = φ1x2 + φ2x1, and by the uniqueness of the

coefficients in this case, that φ21 = φ1 and φ22 = φ2. Continuing in thisway, it is easy to verify that, for n ≥ 2,

xnn+1 = φ1xn + φ2xn−1.

That is, φn1 = φ1, φn2 = φ2, and φnj = 0, for j = 3, 4, . . . , n.

From Example 3.17, it should be clear (Problem 3.38) that, if the timeseries is a causal AR(p) process, then, for n ≥ p,

xnn+1 = φ1xn + φ2xn−1 + · · · + φpxn−p+1. (3.60)

For ARMA models in general, the prediction equations will not be as simpleas the pure AR case. In addition, for n large, the use of (3.57) is prohibitivebecause it requires the inversion of a large matrix. There are, however, iterativesolutions that do not require any matrix inversion. In particular, we mentionthe recursive solution due to Levinson (1947) and Durbin (1960).

Property P3.4: The Durbin–Levinson AlgorithmEquations (3.57) and (3.59) can be solved iteratively as follows:

φ00 = 0, P 01 = γ(0). (3.61)

For n ≥ 1,

φnn =ρ(n) −∑n−1

k=1 φn−1,k ρ(n − k)

1 −∑n−1k=1 φn−1,k ρ(k)

, Pnn+1 = Pn−1

n (1 − φ2nn), (3.62)

114 ARIMA Models

where, for n ≥ 2,

φnk = φn−1,k − φnnφn−1,n−k, k = 1, 2, . . . , n − 1. (3.63)

The proof of Property P3.4 is left as an exercise; see Problem 3.12.

Example 3.18 Using the Durbin–Levinson Algorithm

To use the algorithm, start with φ00 = 0, P 01 = γ(0). Then, for n = 1,

φ11 = ρ(1) and P 12 = γ(0)[1 − φ2

11].

For n = 2,

φ22 =ρ(2) − φ11 ρ(1)1 − φ11 ρ(1)

=ρ(2) − ρ(1)2

1 − ρ(1)2

φ21 = φ11 − φ22φ11 = ρ(1)[1 − φ22]P 2

3 = γ(0)[1 − φ211][1 − φ2

22].

For n = 3,

φ33 =ρ(3) − φ21 ρ(2) − φ22 ρ(1)1 − φ21 ρ(1) − φ22 ρ(2)

,

and so on.

An important consequence of the Durbin–Levinson algorithm is (see Prob-lem 3.12) as follows.

Property P3.5: Iterative Solution for the PACFThe PACF of a stationary process xt, can be obtained iteratively via (3.62) asφnn, for n = 1, 2, . . . .

Example 3.19 The PACF of an AR(2)

From Example 3.14, we know that for an AR(2), φhh = 0 for h > 2, butwe will use the results of Example 3.17 and Property P3.5 to calculate thefirst three values of the PACF. Recall (Example 3.8) that for an AR(2),ρ(1) = φ1/(1 − φ2), and in general ρ(h) − φ1ρ(h − 1) − φ2ρ(h − 2) = 0,for h ≥ 2. Then,

φ11 = ρ(1) =φ1

1 − φ2

φ22 =ρ(2) − ρ(1)2

1 − ρ(1)2=

[φ1

(φ1

1−φ2

)+ φ2

]−(

φ11−φ2

)2

1 −(

φ11−φ2

)2 = φ2

φ21 = φ1

φ33 =ρ(3) − φ1ρ(2) − φ2ρ(1)1 − φ1ρ(1) − φ2ρ(2)

= 0.


So far, we have concentrated on one-step-ahead prediction, but PropertyP3.3 allows us to calculate the BLP of xn+m for any m ≥ 1. Given data,{x1, . . . , xn}, the m-step-ahead predictor is

xnn+m = φ

(m)n1 xn + φ

(m)n2 xn−1 + · · · + φ(m)

nn x1, (3.64)

where {φ(m)n1 , φ

(m)n2 , . . . , φ

(m)nn } satisfy the prediction equations,

n∑j=1

φ(m)nj E(xn+1−jxn+1−k) = E(xn+mxn+1−k), k = 1, . . . , n,

orn∑

j=1

φ(m)nj γ(k − j) = γ(m + k − 1), k = 1, . . . , n. (3.65)

The prediction equations can again be written in matrix notation as

Γnφφφ(m)n = γγγ(m)

n , (3.66)

where γγγ(m)n = (γ(m), . . . , γ(m + n − 1))′, and φφφ(m)

n = (φ(m)n1 , . . . , φ

(m)nn )′ are

n × 1 vectors.The mean square m-step-ahead prediction error is

Pnn+m = E

(xn+m − xn

n+m

)2 = γ(0) − γγγ(m)′n Γ−1

n γγγ(m)n . (3.67)

Another useful algorithm for calculating forecasts was given by Brockwelland Davis (1991, Chapter 5). This algorithm follows directly from apply-ing the projection theorem (Theorem B.1) to the innovations, xt − xt−1

t , fort = 1, . . . , n, using the fact that the innovations xt − xt−1

t and xs − xs−1s are

uncorrelated for s �= t (see Problem 3.39). We present the case in which xt isa mean-zero stationary time series.

Property P3.6: The Innovations AlgorithmThe one-step-ahead predictors, xt

t+1, and their mean-squared errors, P tt+1, can

be calculated iteratively as

x01 = 0, P 0

1 = γ(0)

xtt+1 =

t∑j=1

θtj(xt+1−j − xt−jt+1−j), t = 1, 2, . . . (3.68)

P tt+1 = γ(0) −

t−1∑j=0

θ2t,t−jP

jj+1 t = 1, 2, . . . , (3.69)

where, for j = 0, 1, . . . , t − 1,

θt,t−j =

(γ(t − j) −

j−1∑k=0

θj,j−kθt,t−kP kk+1

)(P j

j+1

)−1. (3.70)

116 ARIMA Models

Given data x1, . . . , xn, the innovations algorithm can be calculated succes-sively for t = 1, then t = 2 and so on, in which case the calculation of xn

n+1and Pn

n+1 is made at the final step t = n. The m-step-ahead predictor andits mean-square error based on the innovations algorithm (Problem 3.39) aregiven by

xnn+m =

n+m−1∑j=m

θn+m−1,j(xn+m−j − xn+m−j−1n+m−j ), (3.71)

Pnn+m = γ(0) −

n+m−1∑j=m

θ2n+m−1,jP

nn+m−j , (3.72)

where the θn+m−1,j are obtained by continued iteration of (3.70).

Example 3.20 Prediction for an MA(1)

The innovations algorithm lends itself well to prediction for moving aver-age processes. Consider an MA(1) model, xt = wt + θwt−1. Recall thatγ(0) = (1 + θ2)σ2

w, γ(1) = θσ2w, and γ(h) = 0 for h > 1. Then, using

Property P3.6, we have

θn1 = θσ2w/Pn−1

n

θnj = 0, j = 2, . . . , nP 0

1 = (1 + θ2)σ2w

Pnn+1 = (1 + θ2 − θθn1)σ2

w.

Finally, from (3.68), the one-step-ahead predictor is

xnn+1 = θ

(xn − xn−1

n

)σ2

w/Pn−1n .

Forecasting ARMA Processes

The general prediction equations (3.53) provide little insight into forecastingfor ARMA models in general. There are a number of different ways to expressthese forecasts, and each aids in understanding the special structure of ARMAprediction. Throughout, we assume xt is a causal and invertible ARMA(p, q)process, φ(B)xt = θ(B)wt, where wt ∼ iid N(0, σ2

w). In the non-zero meancase, E(xt) = µ, simply replace xt with xt −µ in the model. First, we considertwo types of forecasts. We write xn

n+m to mean the minimum mean squareerror predictor of xn+m based on the data {xn, . . . , x1}, that is,

xnn+m = E(xn+m

∣∣ xn, . . . , x1).

For ARMA models, it is easier to calculate the predictor of xn+m, assumingwe have the complete history of the process {xn, xn−1, . . .}. We will denote thepredictor of xn+m based on the infinite past as

xn+m = E(xn+m

∣∣ xn, xn−1, . . .).


The idea here is that, for large samples, xn+m will provide a good approxima-tion to xn

n+m.Now, write xn+m in its causal and invertible forms:

xn+m =∞∑

j=0

ψjwn+m−j , ψ0 = 1 (3.73)

wn+m =∞∑

j=0

πjxn+m−j , π0 = 1. (3.74)

Then, taking conditional expectations in (3.73), we have

xn+m =∞∑

j=0

ψjwn+m−j =∞∑

j=m

ψjwn+m−j , (3.75)

because, by (3.74),

wt ≡ E(wt|xn, xn−1, . . .) =

{ 0, t > n

wt, t ≤ n.

Similarly, taking conditional expectations in (3.74), we have

0 = xn+m +∞∑

j=1

πj xn+m−j ,

or

xn+m = −m−1∑j=1

πj xn+m−j −∞∑

j=m

πjxn+m−j , (3.76)

using the fact E(xt

∣∣ xn, xn−1, . . .) = xt, for t ≤ n. Prediction is accomplishedrecursively using (3.76), starting with the one-step-ahead predictor, m = 1,and then continuing for m = 2, 3, . . .. Using (3.75), we can write

xn+m − xn+m =m−1∑j=0

ψjwn+m−j ,

so the mean square prediction error can be written as

Pnn+m = E(xn+m − xn+m)2 = σ2

w

m−1∑j=0

ψ2j . (3.77)

Also, we note, for a fixed sample size, n, the prediction errors are correlated.That is, for k ≥ 1,

E{(xn+m − xn+m)(xn+m+k − xn+m+k)} = σ2w

m−1∑j=0

ψjψj+k. (3.78)

118 ARIMA Models

Example 3.21 Long-Range Forecasts

Consider forecasting an ARMA process with mean µ. From the zero-mean case in (3.75) we can deduce that the m-step-ahead forecast canbe written as

xn+m = µ +∞∑

j=m

ψjwn+m−j . (3.79)

Noting that the ψ-weights dampen to zero exponentially fast, it is clearthat

xn+m → µ

exponentially fast (in the mean square sense) as m → ∞. Moreover, by(3.77), the mean square prediction error

Pnn+m → σ2

w

∞∑j=0

ψ2j , (3.80)

exponentially fast as m → ∞.

It should be clear from (3.79) and (3.80) that ARMA forecasts quicklysettle to the mean with a constant prediction error as the forecast horizon,m, grows. This effect can be seen in Figure 3.6 where the recruitmentseries is forecast for 24 months; see Example 3.23.

When n is small, the general prediction equations (3.53) can be used easily.When n is large, we would use (3.76) by truncating, because only the datax1, x2, . . . , xn are available. In this case, we can truncate (3.76) by setting∑∞

j=n+m πjxn+m−j = 0. The truncated predictor is then written as

xnn+m = −

m−1∑j=1

πj xnn+m−j −

n+m−1∑j=m

πjxn+m−j , (3.81)

which is also calculated recursively, m = 1, 2, . . .. The mean square predictionerror, in this case, is approximated using (3.77).

For AR(p) models, and when n > p, equation (3.60) yields the exact pre-dictor, xn

n+m, of xn+m, and there is no need for approximations. That is, forn > p, xn

n+m = xn+m = xnn+m. Also, in this case, the one-step-ahead pre-

diction error is E(xn+1 − xnn+1)

2 = σ2w. For general ARMA(p, q) models, the

truncated predictors (Problem 3.15) for m = 1, 2, . . . , are

xnn+m = φ1x

nn+m−1 + · · · + φpx

nn+m−p + θ1w

nn+m−1 + · · · + θqw

nn+m−q, (3.82)

where xnt = xt for 1 ≤ t ≤ n and xn

t = 0 for t ≤ 0. The truncated predictionerrors are given by: wn

t = 0 for t ≤ 0 or t > n, and wnt = φ(B)xn

t − θ1wnt−1 −

· · · − θqwnt−q for 1 ≤ t ≤ n.


Example 3.22 Forecasting an ARMA(1, 1) Series

Given data x1, . . . , xn, for forecasting purposes, write the model as

xn+1 = φxn + wn+1 + θwn.

Then, based on (3.82), the one-step-ahead truncated forecast is

xnn+1 = φxn + 0 + θwn

n.

For m ≥ 2, we havexn

n+m = φxnn+m−1,

which can be calculated recursively, m = 2, 3, . . . .

To calculate wnn, which is needed to initialize the successive forecasts,

the model can be written as wt = xt − φxt−1 − θwt−1 for t = 1, . . . , n.For truncated forecasting, using (3.82), put wn

0 = 0, wn1 = x1, and then

iterate the errors forward in time

wnt = xt − φxt−1 − θwn

t−1, t = 2, . . . , n.

The approximate forecast variance is computed from (3.77) using theψ-weights determined as in Example 3.10. In particular, the ψ-weightssatisfy ψj = (φ + θ)φj−1, for j ≥ 1. This result gives

Pnn+m = σ2

w

[1 + (φ + θ)2

m−1∑j=1

φ2(j−1)]

= σ2w

[1 +

(φ + θ)2(1 − φ2(m−1))(1 − φ2)

].

To assess the precision of the forecasts, prediction intervals are typicallycalculated along with the forecasts. In general, (1−α) prediction intervals areof the form

xnn+m ± cα

2

√Pn

n+m, (3.83)

where cα/2 is chosen to get the desired degree of confidence. For example, ifthe process is Gaussian, then choosing cα/2 = 2 will yield an approximate 95%prediction interval for xn+m. If we are interested in establishing predictionintervals over more than one time period, then cα/2 should be adjusted appro-priately, for example, by using Bonferroni’s inequality [see (4.55) in Chapter 4or Johnson and Wichern, 1992, Chapter 5].

120 ARIMA Models

Figure 3.6 Twenty-four month forecasts for the Recruitment series. The ac-tual data shown are from January 1980 to September 1987, and then forecastsplus and minus one standard error are displayed.

Example 3.23 Forecasting the Recruitment Series

Using the parameter estimates as the actual parameter values, Figure 3.6shows the result of forecasting the Recruitment series given in Exam-ple 3.16 over a 24-month horizon, m = 1, 2, . . . , 24. The actual forecastsare calculated as

xnn+m = 6.74 + 1.35xn

n+m−1 − .46xnn+m−2

for n = 453 and m = 1, 2, . . . , 12. Recall that xst = xt when t ≤ s. The

forecasts errors Pnn+m are calculated using (3.77). Recall that σ2

w = 90.31,and using (3.36) from Example 3.10, we have ψj = 1.35ψj−1 − .46ψj−2for j ≥ 2, where ψ0 = 1 and ψ1 = 1.35. Thus, for n = 453,

Pnn+1 = 90.31,

Pnn+2 = 90.31(1 + 1.352),

Pnn+3 = 90.31(1 + 1.352 + [1.352 − .46]2),

and so on.

Note how the forecast levels off quickly and the prediction intervals arewide, even though in this case the forecast limits are only based on onestandard error; that is, xn

n+m ±√Pnn+m. We will revisit this problem,

including appropriate R commands, in Example 3.26.

We complete this section with a brief discussion of backcasting. In back-casting, we want to predict x1−m, m = 1, 2, . . ., based on the data {x1, . . . , xn}.


Write the backcast as

xn1−m =

n∑j=1

αjxj . (3.84)

Analogous to (3.65), the prediction equations (assuming µ = 0) aren∑

j=1

αjE(xjxk) = E(x1−mxk), k = 1, . . . , n, (3.85)

orn∑

j=1

αjγ(k − j) = γ(m + k − 1), k = 1, . . . , n. (3.86)

These equations are precisely the prediction equations for forward prediction.That is, αj ≡ φ

(m)nj , for j = 1, . . . , n, where the φ

(m)nj are given by (3.66).

Finally, the backcasts are given by

xn1−m = φ

(m)n1 x1 + · · · + φ(m)

nn xn, m = 1, 2, . . . . (3.87)

Example 3.24 Backcasting an ARMA(1, 1)

Consider a causal and invertible ARMA(1,1) process, xt = φxt−1 +θwt−1 + wt; we will call this the forward model. We have just seenthat best linear prediction backward in time is the same as best linearprediction forward in time for stationary models. Because we are as-suming ARMA models are Gaussian, we also have that minimum meansquare error prediction backward in time is the same as forward in timefor ARMA models. Thus, the process can equivalently be generated bythe backward model xt = φxt+1 + θvt+1 + vt, where {vt} is a Gaussian4

white noise process with variance σ2w. We may write xt =

∑∞j=0 ψjvt+j ,

where ψ0 = 1; this means that xt is uncorrelated with {vt−1, vt−2, . . .},in analogy to the forward model.Given data {x1, . . . ., xn}, truncate vn

n = E(vn

∣∣ x1, . . . ., xn) to zero.That is, put vn

n = 0, as an initial approximation, and then generate theerrors backward

vnt = xt − φxt+1 + θvn

t+1, t = (n − 1), (n − 2), . . . , 1.

Then,xn

0 = φx1 + θvn1 + vn

0 = φx1 + θvn1 ,

because vnt = 0 for t ≤ 0. Continuing, the general truncated backcasts

are given byxn

1−m = φxn2−m, m = 2, 3, . . . .

4In the stationary Gaussian case, (a) the distribution of {xn+1, xn, . . . , x1} is thesame as (b) the distribution of {x0, x1 . . . , xn}. In forecasting we use (a) to obtainE(xn+1|xn, . . . , x1); in backcasting we use (b) to obtain E(x0|x1, . . . , xn). Because (a)and (b) are the same, the two problems are equivalent.

122 ARIMA Models

3.6 Estimation

Throughout this section, we assume we have n observations, x1, . . . , xn, froma causal and invertible Gaussian ARMA(p, q) process in which, initially, theorder parameters, p and q, are known. Our goal is to estimate the parameters,φ1, . . . , φp, θ1, . . . , θq, and σ2

w. We will discuss the problem of determining pand q later in this section.

We begin with method of moments estimators. The idea behind these esti-mators is that of equating population moments to sample moments and thensolving for the parameters in terms of the sample moments. We immediatelysee that, if E(xt) = µ, then the method of moments estimator of µ is thesample average, x. Thus, while discussing method of moments, we will as-sume µ = 0. Although the method of moments can produce good estimators,they can sometimes lead to suboptimal estimators. We first consider the casein which the method leads to optimal (efficient) estimators, that is, AR(p)models.

When the process is AR(p),

xt = φ1xt−1 + · · · + φpxt−p + wt,

the first p + 1 equations of (3.42) and (3.43), h = 0, 1, . . . , p, lead to thefollowing:

Definition 3.10 The Yule–Walker equations are given by

γ(h) = φ1γ(h − 1) + · · · + φpγ(h − p), h = 1, 2, . . . , p, (3.88)σ2

w = γ(0) − φ1γ(1) − · · · − φpγ(p). (3.89)

In matrix notation, the Yule–Walker equations are

Γpφφφ = γγγp, σ2w = γ(0) − φφφ′γγγp, (3.90)

where Γp = {γ(k−j)}pj,k=1 is a p×p matrix, φφφ = (φ1, . . . , φp)′ is a p×1 vector,

and γγγp = (γ(1), . . . , γ(p))′ is a p× 1 vector. Using the method of moments, wereplace γ(h) in (3.90) by γ(h) [see equation (1.36)] and solve

φφφ = Γ−1p γγγp, σ2

w = γ(0) − γγγ′pΓ

−1p γγγp. (3.91)

These estimators are typically called the Yule–Walker estimators. For calcula-tion purposes, it is sometimes more convenient to work with the sample ACF.By factoring γ(0) in (3.91), we can write the Yule–Walker estimates as

φφφ = RRR−1p ρρρp, σ2

w = γ(0)[1 − ρρρ′

pRRR−1p ρρρp

], (3.92)

where Rp = {ρ(k − j)}pj,k=1 is a p × p matrix and ρρρp = (ρ(1), . . . , ρ(p))′ is a

p × 1 vector.

3.6: Estimation 123

For AR(p) models, if the sample size is large, the Yule–Walker estimatorsare approximately normally distributed, and σ2

w is close to the true value ofσ2

w. We state these results in Property P3.7. For details, see Appendix B,§B.3.

Property P3.7: Large Sample Results for Yule–Walker EstimatorsThe asymptotic (n → ∞) behavior of the Yule–Walker estimators in the caseof causal AR(p) processes is as follows:

√n(φφφ − φφφ

)d→ N

(000, σ2

wΓ−1p

), σ2

wp→ σ2

w. (3.93)

The Durbin–Levinson algorithm, (3.61)-(3.63), can be used to calculate φφφ

without inverting Γp or Rp, by replacing γ(h) by γ(h) in the algorithm. Inrunning the algorithm, we will iteratively calculate the h × 1 vector, φφφh =(φh1, . . . , φhh)′, for h = 1, 2, . . .. Thus, in addition to obtaining the desiredforecasts, the Durbin–Levinson algorithm yields φhh, the sample PACF. Using(3.93), we can show the following property.

Property P3.8: Large Sample Distribution of the PACFFor a causal AR(p) process, asymptotically (n → ∞),

√n φhh

d→ N (0, 1) , for h > p. (3.94)

Example 3.25 Yule–Walker Estimation for an AR(2) Process

The data shown in Figure 3.3 were n = 144 simulated observations fromthe AR(2) model

xt = 1.5xt−1 − .75xt−2 + wt,

where wt ∼ iid N(0, 1). For this data, γ(0) = 8.434, ρ(1) = .834, andρ(2) = .476. Thus,

φφφ =(

φ1φ2

)=[

1 .834.834 1

]−1(.834.476

)=(

1.439−.725

)and

σ2w = 8.434

[1 − (.834, .476)

(1.439−.725

)]= 1.215.

By Property P3.7, the asymptotic variance–covariance matrix of φφφ,

1144

1.2158.434

[1 .834

.834 1

]−1

=[

.0572 −.003−.003 .0572

],

can be used to get confidence regions for, or make inferences about φφφ andits components. For example, an approximate 95% confidence interval

124 ARIMA Models

for φ2 is −.725±2(.057), or (−.839,−.611), which contains the true valueof φ2 = −.75.

For this data, the first three sample partial autocorrelations were φ11 =ρ(1) = .834, φ22 = φ2 = −.725, and φ33 = −.075. According to Prop-erty P3.8, the asymptotic standard error of φ33 is 1/

√144 = .083, and the

observed value, −.075, is less than one standard deviation from φ33 = 0.

Example 3.26 Yule–Walker Estimation of the Recruitment Series

In Example 3.16 we fit an AR(2) model to the recruitment series usingregression. Below are the results of fitting the same model using Yule-Walker estimation in R (assuming the data are in rec), which are nearlyidentical to the values in Example 3.16.

> rec.yw = ar.yw(rec, order=2)> rec.yw$x.mean

[1] 62.26278 # mean estimate> rec.yw$ar

[1] 1.3315874 -.4445447 # phi1 and phi2 estimates> sqrt(diag(rec.yw$asy.var.coef))

[1] .04222637 .04222637 # their standard errors> rec.yw$var.pred

[1] 94.79912 # error variance estimate

To obtain the 24 month ahead predictions and their standard errors, andthen plot the results as in Example 3.23, use the R commands:

> rec.pr = predict(rec.yw, n.ahead=24)> U = rec.pr$pred + rec.pr$se> L = rec.pr$pred - rec.pr$se> month = 360:453> plot(month, rec[month], type="o", xlim=c(360,480),+ ylab="recruits")> lines(rec.pr$pred, col="red", type="o")> lines(U, col="blue", lty="dashed")> lines(L, col="blue", lty="dashed")

In the case of AR(p) models, the Yule–Walker estimators given in (3.92)are optimal in the sense that the asymptotic distribution, (3.93), is the best as-ymptotic normal distribution. This is because, given initial conditions, AR(p)models are linear models, and the Yule–Walker estimators are essentially leastsquares estimators. If we use method of moments for MA or ARMA models,we will not get optimal estimators because such processes are nonlinear in theparameters.

3.6: Estimation 125

Example 3.27 Method of Moments Estimation for an MA(1) Process

Consider the time series

xt = wt + θwt−1,

where |θ| < 1. The model can then be written as

xt =∞∑

j=1

(−θ)jxt−j + wt,

which is nonlinear in θ. The first two population autocovariances areγ(0) = σ2

w(1 + θ2) and γ(1) = σ2wθ, so the estimate of θ is found by

solving:

ρ(1) =γ(1)γ(0)

=θ

1 + θ2.

Two solutions exist, so we would pick the invertible one. If |ρ(1)| ≤ 12 , the

solutions are real, otherwise, a real solution does not exist. Even though|ρ(1)| < 1

2 for an invertible MA(1), it may happen that |ρ(1)| ≥ 12 because

it is an estimator. When |ρ(1)| < 12 , the invertible estimate is

θ =1 −√1 − 4ρ(1)2

2ρ(1).

It can be shown5 that

θ ∼ AN(

θ,1 + θ2 + 4θ4 + θ6 + θ8

n(1 − θ2)2

).

The maximum likelihood estimator (which we discuss next) of θ, in thiscase, has an asymptotic variance of (1−θ2)/n. When θ = .5, for example,the ratio of the asymptotic variance of the method of moments estimatorto the maximum likelihood estimator of θ is about 3.5. That is, for largesamples, the variance of the method of moments estimator is about 3.5times larger than the variance of the MLE of θ when θ = .5.

Maximum Likelihood and Least Squares Estimation

To fix ideas, we first focus on the causal AR(1) case. Let

xt = µ + φ(xt−1 − µ) + wt

where |φ| < 1 and wt ∼ iid N(0, σ2w). Given data x1, x2, . . . , xn, we seek the

likelihoodL(µ, φ, σ2

w) = fµ,φ,σ2w

(x1, x2, . . . , xn) .

5The result follows by using the delta method and Theorem A.7 given in Appendix A.See the proof of Theorem A.7 for details on the delta method.

126 ARIMA Models

In the case of an AR(1), we may write the likelihood as

L(µ, φ, σ2w) = f(x1)f(x2

∣∣ x1) · · · f(xn

∣∣ xn−1),

where we have dropped the parameters in the densities, f(·), to ease the no-tation. Because xt

∣∣ xt−1 ∼ N(µ + φ(xt−1 − µ), σ2

w

), we have

f(xt

∣∣ xt−1) = fw[(xt − µ) − φ(xt−1 − µ)],

where fw(·) is the density of wt, that is, the normal density with mean zeroand variance σ2

w. We may then write the likelihood as

L(µ, φ, σw) = f(x1)n∏

t=2

fw [(xt − µ) − φ(xt−1 − µ)] .

To find f(x1), we can use the causal representation

x1 = µ +∞∑

j=0

φjw1−j

to see that x1 is normal, with mean µ and variance σ2w/(1 − φ2). Finally, for

an AR(1), the likelihood is

L(µ, φ, σ2w) = (2πσ2

w)−n/2(1 − φ2)1/2 exp[−S(µ, φ)

2σ2w

], (3.95)

where

S(µ, φ) = (1 − φ2)(x1 − µ)2 +n∑

t=2

[(xt − µ) − φ(xt−1 − µ)]2 . (3.96)

Typically, S(µ, φ) is called the unconditional sum of squares. We could havealso considered the estimation of µ and φ using unconditional least squares,that is, estimation by minimizing S(µ, φ).

Taking the partial derivative of the log of (3.95) with respect to σ2w and

setting the result equal to zero, we see that for any given values of µ and φin the parameter space, σ2

w = n−1S(µ, φ) maximizes the likelihood. Thus, themaximum likelihood estimate of σ2

w is

σ2w = n−1S(µ, φ), (3.97)

where µ and φ are the MLEs of µ and φ, respectively. If we replace n in (3.97)by n − 2, we would obtain the unconditional least squares estimate of σ2

w.If, in (3.95), we take logs, replace σ2

w by σ2w, and ignore constants, µ and φ

are the values that minimize the criterion function

l(µ, φ) = ln[n−1S(µ, φ)

]− n−1 ln(1 − φ2). (3.98)

3.6: Estimation 127

That is, l(µ, φ) ∝ −2 ln L(µ, φ, σ2w).6 Because (3.96) and (3.98) are complicated

functions of the parameters, the minimization of l(µ, φ) or S(µ, φ) is accom-plished numerically. In the case of AR models, we have the advantage that,conditional on initial values, they are linear models. That is, we can drop theterm in the likelihood that causes the nonlinearity. Conditioning on x1, theconditional likelihood becomes

L(µ, φ, σ2w|x1) =

n∏t=2

fw [(xt − µ) − φ(xt−1 − µ)]

= (2πσ2w)−(n−1)/2 exp

[−Sc(µ, φ)

2σ2w

], (3.99)

where the conditional sum of squares is

Sc(µ, φ) =n∑

t=2

[(xt − µ) − φ(xt−1 − µ)]2 . (3.100)

The conditional MLE of σ2w is

σ2w = Sc(µ, φ)/(n − 1), (3.101)

and µ and φ are the values that minimize the conditional sum of squares,Sc(µ, φ). Letting α = µ(1 − φ), the conditional sum of squares can be writtenas

Sc(µ, φ) =n∑

t=2

[xt − (α + φxt−1)]2. (3.102)

The problem is now the linear regression problem stated in §2.2. Followingthe results from least squares estimation, we have α = x(2) − φx(1), wherex(1) = (n − 1)−1∑n−1

t=1 xt, and x(2) = (n − 1)−1∑nt=2 xt, and the conditional

estimates are then

µ =x(2) − φx(1)

1 − φ(3.103)

φ =∑n

t=2(xt − x(2))(xt−1 − x(1))∑nt=2(xt−1 − x(1))2

. (3.104)

From (3.103) and (3.104), we see that µ ≈ x and φ ≈ ρ(1). That is, theYule–Walker estimators and the conditional least squares estimators are ap-proximately the same. The only difference is the inclusion or exclusion of termsinvolving the end points, x1 and xn. We can also adjust the estimate of σ2

w in(3.101) to be equivalent to the least squares estimator, that is, divide Sc(µ, φ)by (n − 3) instead of (n − 1) in (3.101).

For general AR(p) models, maximum likelihood estimation, unconditionalleast squares, and conditional least squares follow analogously to the AR(1)

6The criterion function is sometimes called the profile likelihood.

128 ARIMA Models

example. For general ARMA models, it is difficult to write the likelihood asan explicit function of the parameters. Instead, it is advantageous to writethe likelihood in terms of the innovations, or one-step-ahead prediction errors,xt − xt−1

t . This will also be useful in Chapter 6 when we study state-spacemodels.

Suppose xt is a causal ARMA(p, q) process with wt ∼ iid N(0, σ2w). Let

βββ = (µ, φ1, . . . , φp, θ1, . . . , θq)′ be the (p + q + 1) × 1 vector of the modelparameters. The likelihood can be written as

L(βββ, σ2w) =

n∏t=1

f(xt

∣∣ xt−1, . . . , x1).

The conditional distribution of xt given xt−1, . . . , x1 is Gaussian with meanxt−1

t and variance P t−1t . In addition, for ARMA models, we may write P t−1

t =σ2

wrt−1t where rt−1

t does not depend on σ2w (this can readily be seen from

Proposition P3.4 by noting P 01 = γ(0) = σ2

w

∑∞j=0 ψ2

j ).The likelihood of the data can now be written as

L(βββ, σ2w) = (2πσ2

w)−n/2 [r01(βββ)r1

2(βββ) · · · rn−1n (βββ)

]−1/2exp

[−S(βββ)

2σ2w

], (3.105)

where

S(βββ) =n∑

t=1

[(xt − xt−1

t (βββ))2

rt−1t (βββ)

]. (3.106)

Both xt−1t and rt−1

t are functions of βββ, and we make that fact explicit in (3.105)-(3.106). Given values for βββ and σ2

w, the likelihood may be evaluated using thetechniques of §3.5. Maximum likelihood estimation would now proceed bymaximizing (3.105) with respect to βββ and σ2

w. As in the AR(1) example, wehave

σ2w = n−1S(βββ), (3.107)

where βββ is the value of βββ that minimizes the criterion function

l(βββ) = ln[n−1S(βββ)

]+ n−1

n∑t=1

ln rt−1t (βββ). (3.108)

For example, for the AR(1) model previously discussed, the generic l(βββ) in(3.108) is l(µ, φ) in (3.98), and the generic S(βββ) in (3.106) is S(µ, φ) given in(3.96). From (3.96) and (3.98) we see x0

1 = µ, and xt−1t = µ + φ(xt−1 − µ) for

t = 2, . . . , n. Also r01 = (1 − φ2), and rt−1

t = 1 for t = 2, . . . , n.Unconditional least squares would be performed by minimizing (3.106) with

respect to βββ. Conditional least squares estimation would involve minimizing(3.106) with respect to βββ but where, to ease the computational burden, thepredictions and their errors are obtained by conditioning on initial values ofthe data. In general, numerical optimization routines are used to obtain theactual estimates and their standard errors.

3.6: Estimation 129

Example 3.28 The Newton–Raphson and Scoring Algorithms

Two common numerical optimization routines for accomplishing maxi-mum likelihood estimation are Newton–Raphson and scoring. We willgive a brief account of the mathematical ideas here. The actual im-plementation of these algorithms is much more complicated than ourdiscussion might imply. For details, the reader is referred to any of theNumerical Recipes books, for example, Press et al. (1993).

Let l(βββ) be a criterion function of k parameters βββ = (β1, . . . , βk) that wewish to minimize with respect to βββ. For example, consider the likelihoodfunction given by (3.98) or by (3.108). Suppose l(βββ) is the extremum thatwe are interested in finding, and βββ is found by solving ∂l(βββ)/∂βj = 0, forj = 1, . . . , k. Let l(1)(βββ) denote the k × 1 vector of partials

l(1)(βββ) =(

∂l(βββ)∂β1

, . . . ,∂l(βββ)∂βk

)′.

Note, l(1)(βββ) = 000, the k × 1 zero vector. Let l(2)(βββ) denote the k × kmatrix of second-order partials

l(2)(βββ) ={

− ∂l2(βββ)∂βi∂βj

}k

i,j=1,

and assume l(2)(βββ) is nonsingular. Let βββ(0) be an initial estimator of βββ.Then, using a Taylor expansion, we have the following approximation:

000 = l(1)(βββ) ≈ l(1)(βββ(0)) − l(2)(βββ(0))[βββ − βββ0

].

Setting the right-hand side equal to zero and solving for βββ (call thesolution βββ(1)), we get

βββ(1) = βββ(0) +[l(2)(βββ(0))

]−1l(1)(βββ(0)).

The Newton–Raphson algorithm proceeds by iterating this result, replac-ing βββ(0) by βββ(1) to get βββ(2), and so on, until convergence. Under a setof appropriate conditions, the sequence of estimators, βββ(1), βββ(2), . . ., willconverge to βββ, the MLE of βββ.

For maximum likelihood estimation, the criterion function used is l(βββ)given by (3.108); l(1)(βββ) is called the score vector, and l(2)(βββ) is calledthe Hessian. In the method of scoring, we replace l(2)(βββ) by E[l(2)(βββ)],the information matrix. Under appropriate conditions, the inverse ofthe information matrix is the asymptotic variance–covariance matrix ofthe estimator βββ. This is sometimes approximated by the inverse of theHessian at βββ. If the derivatives are difficult to obtain, it is possible touse quasi-maximum likelihood estimation where numerical techniques areused to approximate the derivatives.

130 ARIMA Models

Example 3.29 MLE for the Recruitment Series

So far, we have fit an AR(2) model to the recruitment series usingordinary least squares (Example 3.16) and using Yule–Walker (Exam-ple 3.26). The following is an R session used to fit an AR(2) model viamaximum likelihood estimation to the recruitment series; these resultscan be compared to the results in Examples 3.16 and 3.26. As before,we assume the data have been read into R as rec.

> rec.mle = ar.mle(rec, order=2)> rec.mle$x.mean

[1] 62.26153> rec.mle$ar

[1] 1.3512809 -.4612736> sqrt(diag(rec.mle$asy.var.coef))

[1] .04099159 .04099159> rec.mle$var.pred

[1] 89.33597

We now discuss least squares for ARMA(p, q) models via Gauss–Newton.For general and complete details of the Gauss–Newton procedure, the reader isreferred to Fuller (1995). Let xt be a causal and invertible Gaussian ARMA(p, q)process. Write βββ = (φ1, . . . , φp, θ1, . . . , θq)′, and for the ease of discussion, wewill put µ = 0. We write the model in terms of the errors

wt(βββ) = xt −p∑

j=1

φjxt−j −q∑

k=1

θkwt−k(βββ), (3.109)

emphasizing the dependence of the errors on the parameters.For conditional least squares, we approximate the residual sum of squares

by conditioning on x1, . . . , xp (p > 0) and wp = wp−1 = wp−2 = · · · = w1−q = 0(q > 0), in which case we may evaluate (3.109) for t = p+1, p+2, . . . , n. Usingthis conditioning argument, the conditional error sum of squares is

Sc(βββ) =n∑

t=p+1

w2t (βββ).

Minimizing Sc(βββ) with respect to βββ yields the conditional least squares esti-mates. If q = 0, the problem is linear regression, and no iterative technique isneeded to minimize Sc(φ1, . . . , φp). If q > 0, the problem becomes nonlinearregression, and we will have to rely on numerical optimization.

When n is large, conditioning on a few initial values will have little influenceon the final parameter estimates. In the case of small to moderate sample sizes,one may wish to rely on unconditional least squares. The unconditional leastsquares problem is to choose βββ to minimize the unconditional sum of squares,which we have generically denoted by S(βββ) in this section. The unconditional

3.6: Estimation 131

sum of squares can be written in various ways, and one useful form in the caseof ARMA(p, q) models is derived in Box et al. (1994, Appendix A7.3). Theyshowed (see Problem 3.18) the unconditional sum of squares can be written as

S(βββ) =n∑

t=−∞w2

t (βββ),

where wt(βββ) = E(wt

∣∣ x1, . . . , xn). When t ≤ 0, the wt(βββ) are obtainedby backcasting. As a practical matter, we approximate S(βββ) by startingthe sum at t = −M + 1, where M is chosen large enough to guarantee∑−M

t=−∞ w2t (βββ) ≈ 0. In the case of unconditional least squares estimation,

a numerical optimization technique is needed even when q = 0.To employ Gauss–Newton, let βββ(0) = (φ(0)

1 , . . . , φ(0)p , θ

(0)1 , . . . , θ

(0)q )′ be an

initial estimate of βββ. For example, we could obtain βββ(0) by method of moments.The first-order Taylor expansion of wt(βββ) is

wt(βββ) ≈ wt(βββ(0)) −(βββ − βββ(0)

)′zzzt(βββ(0)), (3.110)

where

zzzt(βββ(0)) =(

−∂wt(βββ(0))∂β1

, . . . ,−∂wt(βββ(0))∂βp+q

)′, t = 1, . . . , n.

The linear approximation of Sc(βββ) is

Q(βββ) =n∑

t=p+1

[wt(βββ(0)) −

(βββ − βββ(0)

)′zzzt(βββ(0))

]2(3.111)

and this is the quantity that we will minimize. For approximate unconditionalleast squares, we would start the sum in (3.111) at t = −M + 1, for a largevalue of M , and work with the backcasted values.

Using the results of ordinary least squares (§2.2), we know

(βββ − βββ(0)) =

(n−1

n∑t=p+1

zzzt(βββ(0))zzz′t(βββ(0))

)−1(n−1

n∑t=p+1

zzzt(βββ(0))wt(βββ(0))

)(3.112)

minimizes Q(βββ). From (3.112), we write the one-step Gauss–Newton estimateas

βββ(1) = βββ(0) + ∆(βββ(0)), (3.113)

where ∆(βββ(0)) denotes the right-hand side of (3.112). Gauss–Newton esti-mation is accomplished by replacing βββ(0) by βββ(1) in (3.113). This process isrepeated by calculating, at iteration j = 2, 3, . . .,

βββ(j) = βββ(j−1) + ∆(βββ(j−1))

until convergence.

132 ARIMA Models

Example 3.30 Gauss–Newton for an MA(1)

Consider an invertible MA(1) process, xt = wt + θwt−1. Write the trun-cated errors as

wt(θ) = xt − θwt−1(θ), t = 1, . . . , n, (3.114)

where we condition on w0(θ) = 0. Taking derivatives,

−∂wt(θ)∂θ

= wt−1(θ) + θ∂wt−1(θ)

∂θ, t = 1, . . . , n, (3.115)

where ∂w0(θ)/∂θ = 0. Using the notation of (3.110), we can also write(3.115) as

zt(θ) = wt−1(θ) − θzt−1(θ), t = 1, . . . , n, (3.116)

where z0(θ) = 0.

Let θ(0) be an initial estimate of θ, for example, the estimate given inExample 3.27. Then, the Gauss–Newton procedure for conditional leastsquares is given by

θ(j+1) = θ(j) +∑n

t=1 zt(θ(j))wt(θ(j))∑nt=1 z2

t (θ(j)), j = 0, 1, 2, . . . , (3.117)

where the values in (3.117) are calculated recursively using (3.114) and(3.116). The calculations are stopped when |θ(j+1)−θ(j)|, or |Q(θ(j+1))−Q(θ(j))|, are smaller than some preset amount.

Example 3.31 Fitting the Glacial Varve Series

Consider the series of glacial varve thicknesses from Massachusetts forn = 634 years, as analyzed in Example 2.5 and in Problem 1.8, where itwas argued that a first-order moving average model might fit the loga-rithmically transformed and differenced varve series, say,

∇[ln(xt)] = ln(xt) − ln(xt−1) = ln(

xt

xt−1

),

which can be interpreted as being proportional to the percentage changein the thickness.

The sample ACF and PACF, shown in Figure 3.7, confirm the tendencyof ∇[ln(xt)] to behave as a first-order moving average process as theACF has only a significant peak at lag one and the PACF decreasesexponentially. Using Table 3.1, this sample behavior fits that of theMA(1) very well.

3.6: Estimation 133

Figure 3.7 ACF and PACF of transformed glacial varves.

Nine iterations of the Gauss–Newton procedure, (3.117), starting withθ0 = −.1 yielded the values

−.442,−.624,−.717,−.750,−.763,−.768,−.771,−.772,−.772

for θ(1), . . . , θ(9), and a final estimated error variance σ2w = .236. Using

the final value of θ = θ(9) = −.772 and the vectors zt of partial derivativesin (3.116) leads to a standard error of .025 and a t-value of −.772/.025 =−30.88 with 632 degrees of freedom (one is lost in differencing).

In the general case of causal and invertible ARMA(p, q) models, maximumlikelihood estimation and conditional and unconditional least squares estima-tion (and Yule–Walker estimation in the case of AR models) all lead to optimalestimators. The proof of this general result can be found in a number of textson theoretical time series analysis (for example, Brockwell and Davis, 1991,or Hannan, 1970, to mention a few). We will denote the ARMA coefficientparameters by βββ = (φ1, . . . , φp, θ1, . . . , θq)′.

Property P3.9: Large Sample Distribution of the EstimatorsUnder appropriate conditions, for causal and invertible ARMA processes, themaximum likelihood, the unconditional least squares, and the conditional leastsquares estimators, each initialized by the method of moments estimator, allprovide optimal estimators of σ2

w and βββ, in the sense that σ2w is consistent, and

the asymptotic distribution of βββ is the best asymptotic normal distribution. Inparticular, as n → ∞,

√n(βββ − βββ

)d→ N

(000, σ2

w ΓΓΓ−1p,q

). (3.118)

134 ARIMA Models

In (3.118), the variance–covariance matrix of the estimator βββ is the inverse ofthe information matrix. In this case, the (p + q) × (p + q) matrix Γp,q, has theform

Γp,q =(

Γφφ Γφθ

Γθφ Γθθ

). (3.119)

The p × p matrix Γφφ is given by (3.90), that is, the ij-th element of Γφφ, fori, j = 1, . . . , p, is γx(i− j) from an AR(p) process, φ(B)xt = wt. Similarly, Γθθ

is a q × q matrix with the ij-th element, for i, j = 1, . . . , q, equal to γy(i − j)from an AR(q) process, θ(B)yt = wt. The p × q matrix Γφθ = {γxy(i − j)},for i = 1, . . . , p; j = 1, . . . , q; that is, the ij-th element is the cross-covariancebetween the two AR processes given by φ(B)xt = wt and θ(B)yt = wt. Finally,Γθφ = Γ′

φθ is q × p. Further discussion of Property P3.9, including a prooffor the case of least squares estimators for AR(p) processes, can be found inAppendix B, §B.3.

Example 3.32 Some Specific Asymptotic Distributions

The following are some specific cases of Property P3.9.

AR(1): γx(0) = σ2w/(1 − φ2), so σ2

wΓ−11,0 = (1 − φ2). Thus,

φ ∼ AN[φ, n−1(1 − φ2)

]. (3.120)

AR(2): The reader can verify that

γx(0) =(

1 − φ2

1 + φ2

)σ2

w

(1 − φ2)2 − φ21

and γx(1) = φ1γx(0) + φ2γx(1). From these facts, we can compute Γ−12,0.

In particular, we have(φ1φ2

)∼ AN

[(φ1φ2

), n−1

(1 − φ2

2 −φ1(1 + φ2)sym 1 − φ2

2

)]. (3.121)

MA(1): In this case, write θ(B)yt = wt, or yt + θyt−1 = wt. Then,analogous to the AR(1) case, γy(0) = σ2

w/(1 − θ2), so σ2wΓ−1

0,1 = (1 − θ2).Thus,

θ ∼ AN[θ, n−1(1 − θ2)

]. (3.122)

MA(2): Write yt + θ1yt−1 + θ2yt−2 = wt, so , analogous to the AR(2)case, we have(

θ1θ2

)∼ AN

[(θ1θ2

), n−1

(1 − θ2

2 θ1(1 + θ2)sym 1 − θ2

2

)]. (3.123)

3.6: Estimation 135

ARMA(1,1): To calculate Γφθ, we must find γxy(0), where xt−φxt−1 =wt and yt + θyt−1 = wt. We have

γxy(0) = cov(xt, yt) = cov(φxt−1 + wt,−θyt−1 + wt)= −φθγxy(0) + σ2

w.

Solving, we find, γxy(0) = σ2w/(1 + φθ). Thus,(

φθ

)∼ AN

[(φθ

), n−1

[(1 − φ2)−1 (1 + φθ)−1

sym (1 − θ2)−1

]−1]

. (3.124)

The reader might wonder, for example, why the asymptotic distributions ofφ from an AR(1) [equation (3.120)] and θ from an MA(1) [equation (3.122)] areof the same form. It is possible to explain this unexpected result heuristicallyusing the intuition of linear regression. That is, for the normal regressionmodel presented in §2.2 with no intercept term, xt = βzt + wt, we know β isnormally distributed with mean β, and from (2.8),

var{√

n(β − β

)}= nσ2

w

(n∑

t=1

z2t

)−1

= σ2w

(n−1

n∑t=1

z2t

)−1

.

For the causal AR(1) model given by xt = φxt−1 + wt, the intuition ofregression tells us to expect that, for n large,

√n(φ − φ

)is approximately normal with mean zero and with variance given by

σ2w

(n−1

n∑t=2

x2t−1

)−1

.

Now, n−1∑nt=2 x2

t−1 is the sample variance (recall that the mean of xt is zero)of the xt, so as n becomes large we would expect it to approach var(xt) =γ(0) = σ2

w/(1 − φ2). Thus, the large sample variance of√

n(φ − φ

)is

σ2wγx(0)−1 = σ2

w

(σ2

w

1 − φ2

)−1

= (1 − φ2);

that is, (3.120) holds.In the case of an MA(1), we may use the discussion of Example 3.30 to

write an approximate regression model for the MA(1). That is, consider theapproximation (3.116) as the regression model

zt(θ) = −θzt−1(θ) + wt−1,

136 ARIMA Models

where now, zt−1(θ) as defined in Example 3.30, plays the role of the regressor.Continuing with the analogy, we would expect the asymptotic distribution of√

n(θ − θ

)to be normal, with mean zero, and approximate variance

σ2w

(n−1

n∑t=2

z2t−1(θ)

)−1

.

As in the AR(1) case, n−1∑nt=2 z2

t−1(θ) is the sample variance of the zt(θ) so,for large n, this should be var{zt(θ)} = γz(0), say. But note, as seen from(3.116), zt(θ) is approximately an AR(1) process with parameter −θ. Thus,

σ2wγz(0)−1 = σ2

w

(σ2

w

1 − (−θ)2

)−1

= (1 − θ2),

which agrees with (3.122). Finally, the asymptotic distributions of the ARparameters estimates and the MA parameter estimates are of the same formbecause in the MA case, the “regressors” are the differential processes zt(θ)that have AR structure, and it is this structure that determines the asymptoticvariance of the estimators. For a rigorous account of this approach for thegeneral case, see Fuller (1995, Theorem 5.5.4).

In Example 3.31, the estimated standard error of θ was .025. In the exam-ple, this value was calculated as the square root of

s2w

(n−1

n∑t=2

z2t−1(θ)

)−1

,

where n = 633, s2w = .236, and θ = −.772. Using (3.122), we could have also

calculated this value using the asymptotic approximation, the square root of(1 − .7722)/633, which is also .025.

The asymptotic behavior of the parameter estimators gives us an additionalinsight into the problem of fitting ARMA models to data. For example, supposea time series follows an AR(1) process and we decide to fit an AR(2) to thedata. Does any problem occur in doing this? More generally, why not simplyfit large-order AR models to make sure that we capture the dynamics of theprocess? After all, if the process is truly an AR(1), the other autoregressiveparameters will not be significant. The answer is that if we overfit, we will loseefficiency. For example, if we fit an AR(1) to an AR(1) process, for large n,var(φ1) ≈ n−1(1 − φ2

1). But if we fit an AR(2) to the AR(1) process, for largen, var(φ1) ≈ n−1(1 − φ2

2) = n−1 because φ2 = 0. Thus, the variance of φ1 hasbeen inflated, making the estimator less precise. We do want to mention thatoverfitting can be used as a diagnostic tool. For example, if we fit an AR(2)model to the data and are satisfied with that model, then adding one moreparameter and fitting an AR(3) should lead to approximately the same modelas in the AR(2) fit. We will discuss model diagnostics in more detail in §3.8.

3.6: Estimation 137

Figure 3.8 One hundred observations generated from the AR(1) model inExample 3.33.

If n is small, or if the parameters are close to the boundaries, the asymptoticapproximations can be quite poor. The bootstrap can be helpful in this case;for a broad treatment of the bootstrap, see Efron and Tibshirani (1994). Wediscuss the case of an AR(1) here and leave the general discussion for Chapter 6.For now, we give a simple example of the bootstrap for an AR(1) process.

Example 3.33 Bootstrapping an AR(1)

We consider an AR(1) model with a regression coefficient near the bound-ary of causality and an error process that is symmetric but not normal.Specifically, consider the stationary and causal model

xt = µ + φ(xt−1 − µ) + wt, (3.125)

where µ = 50, φ = .95, and wt are iid double exponential with locationzero, and scale parameter β = 2. The density of wt is given by

fwt(w) =12β

exp {−|w|/β} − ∞ < w < ∞.

In this example, E(wt) = 0 and var(wt) = 2β2 = 8. Figure 3.8 shows n =100 simulated observations from this process. This particular realizationis interesting; the data look like they were generated from a nonstationaryprocess with three different mean levels. In fact, the data were generatedfrom a well-behaved, albeit non-normal, stationary and causal model. Toshow the advantages of the bootstrap, we will act as if we do not knowthe actual error distribution and we will proceed as if it were normal; of

138 ARIMA Models

Estimates

density

0.6 0.7 0.8 0.9 1.0

02

46

8

Estimates

density

0.6 0.7 0.8 0.9 1.0

02

46

8

Figure 3.9 Finite sample density of the Yule–Walker estimate of φ in Example3.33.

course, this means, for example, that the normal based MLE of φ willnot be the actual MLE because the data are not normal.

Using the data shown in Figure 3.8, we obtained the Yule–Walker esti-mates µ = 40.048, φ = .957, and s2

w = 15.302, where s2w is the estimate of

var(wt). Based on Property P3.9, we would say that φ is approximatelynormal with mean φ (which we supposedly do not know) and variance(1 − φ2)/100, which we would approximate by (1 − .9572)/100 = .0292.

To assess the finite sample distribution of φ when n = 100, we simulated1000 realizations of this AR(1) process and estimated the parameters viaYule–Walker. The finite sampling density of the Yule–Walker estimateof φ, based on the 1000 repeated simulations, is shown in Figure 3.9.Clearly the sampling distribution is not close to normality for this samplesize. The mean of the distribution shown in Figure 3.9 is .907, andthe variance of the distribution is .0522; these values are considerablydifferent than the asymptotic values. Some of the quantiles of the finitesample distribution are .81 (5%), .84 (10%), .88 (25%), .92 (50%), .95(75%), .96 (90%), and .97 (95%).

Before discussing the bootstrap, we first investigate the sample inno-vation process, xt − xt−1

t , with corresponding variances P t−1t . For the

AR(1) model in this example,

xt−1t = µ + φ(xt−1 − µ), t = 2, . . . , 100.

3.6: Estimation 139

From this, it follows that

P t−1t = E(xt − xt−1

t )2 = σ2w, t = 2, . . . , 100.

When t = 1, we have

x01 = µ and P 0

1 = σ2w/(1 − φ2).

Thus, the innovations have zero mean but different variances; in orderthat all of the innovations have the same variance, σ2

w, we will write themas

ε1 = (x1 − µ)√

(1 − φ2)εt = (xt − µ) − φ(xt−1 − µ), for t = 2, . . . , 100. (3.126)

From these equations, we can write the model in terms of the innovationsεt as

x1 = µ + ε1/√

(1 − φ2)xt = µ + φ(xt−1 − µ) + εt for t = 2, . . . , 100. (3.127)

Next, replace the parameters with their estimates in (3.126), that is,n = 100, µ = 40.048, and φ = .957, and denote the resulting sam-ple innovations as {ε1, . . . , ε100}. To obtain one bootstrap sample, firstrandomly sample, with replacement, n = 100 values from the set of sam-ple innovations; call the sampled values {ε∗

1, . . . , ε∗100}. Now, generate a

bootstrapped data set sequentially by setting

x∗1 = 40.048 + ε∗

1/√

(1 − .9572)x∗

t = 40.048 + .957(x∗t−1 − 40.048) + ε∗

t , t = 2, . . . , n. (3.128)

Next, estimate the parameters as if the data were x∗t . Call these esti-

mates µ(1), φ(1), and s2w(1). Repeat this process a large number, B,

of times, generating a collection of bootstrapped parameter estimates,{µ(b), φ(b), s2

w(b), b = 1, . . . , B}. We can then approximate the finitesample distribution of an estimator from the bootstrapped parametervalues. For example, we can approximate the distribution of φ − φ bythe empirical distribution of φ(b) − φ, for b = 1, . . . , B.

Figure 3.10 shows the bootstrap histogram of 200 bootstrapped estimatesof φ using the data shown in Figure 3.8. In particular, the mean of thedistribution of φ(b) is .918 with a variance of .0462. Some quantiles of thisdistribution are .83 (5%), .85 (10%), .90 (25%), .93 (50%), .95 (75%), .97(90%), and .98 (95%). Clearly, the bootstrap distribution of φ is closer tothe distribution of φ shown in Figure 3.9 than to the asymptotic (normal)approximation.

140 ARIMA Models

To perform a similar bootstrap exercise in R, use the following com-mands. We note that the R estimation procedure is conditional on thefirst observation, so the first residual is not returned. To get around thisproblem, we simply fix the first observation and bootstrap the remainingdata. The simulated data are available in the file ar1boot.dat.7

> x = scan("/mydata/ar1boot.dat")> m = mean(x) # estimate of mu> fit = ar.yw(x, order=1)> phi = fit$ar # estimate of phi> nboot = 200 # number of bootstrap replicates> resids = fit$resid> resids = resids[2:100] # the first resid is NA> x.star = x # initialize x.star> phi.star = matrix(0, nboot, 1)> for (i in 1:nboot) {+ resid.star = sample(resids)+ for (t in 1:99){+ x.star[t+1] = m + phi*(x.star[t]-m) + resid.star[t]+ }+ phi.star[i] = ar.yw(x.star, order=1)$ar+ }

Now, 200 bootstrapped estimates are available in phi.star, and variousmethods can be used to evaluate the estimates. For example, to obtain ahistogram of the estimates, hist(phi.star) can be used . Also considerthe statistics mean(phi.star), sd(phi.star), for the mean and stan-dard deviation, and quantile(phi.star, probs = seq(0, 1, .25))for some quantiles. Other interesting graphics are boxplot(phi.star)for a boxplot and stem(phi.star) for a stem-and-leaf diagram.

3.7 Integrated Models for Nonstationary Data

In Chapters 1 and 2, we saw that if xt is a random walk, xt = xt−1+wt, then bydifferencing xt, we find that ∇xt = wt is stationary. In many situations, timeseries can be thought of as being composed of two components, a nonstationarytrend component and a zero-mean stationary component. For example, in §2.2we considered the model

xt = µt + yt, (3.129)

7If you want to simulate your own data, use the following commands:> e = rexp(150, rate = .5); u = runif(150,-1,1); de = e*sign(u)> x = 50 + arima.sim(n = 100, list(ar =.95), innov = de, n.start = 50)

3.7: Integrated Models 141

Figure 3.10 Bootstrap histogram of φ based on 200 bootstraps.

where µt = β0 + β1t and yt is stationary. Differencing such a process will leadto a stationary process:

∇xt = xt − xt−1 = β1 + yt − yt−1 = β1 + ∇yt.

Another model that leads to first differencing is the case in which µt in (3.129)is stochastic and slowly varying according to a random walk. That is, in (3.129)

µt = µt−1 + vt

where vt is stationary. In this case,

∇xt = vt + ∇yt,

is stationary. If µt in (3.129) is a k-th order polynomial, µt =∑k

j=0 βjtj ,

then (Problem 3.26) the differenced series ∇kyt is stationary. Stochastic trendmodels can also lead to higher order differencing. For example, suppose in(3.129)

µt = µt−1 + vt and vt = vt−1 + et,

where et is stationary. Then, ∇xt = vt + ∇yt is not stationary, but

∇2xt = et + ∇2yt

is stationary.The integrated ARMA, or ARIMA model, is a broadening of the class of

ARMA models to include differencing.

142 ARIMA Models

Definition 3.11 A process, xt is said to be ARIMA(p, d, qp, d, qp, d, q) if

∇dxt = (1 − B)dxt

is ARMA(p, q). In general, we will write the model as

φ(B)(1 − B)dxt = θ(B)wt. (3.130)

If E(∇dxt) = µ, we write the model as

φ(B)(1 − B)dxt = α + θ(B)wt,

where α = µ(1 − φ1 − · · · − φp).

Example 3.34 IMA(1, 1) and EWMA

The ARIMA(0,1,1), or IMA(1,1) model is of interest because many eco-nomic time series can be successfully modeled this way. In addition,the model leads to a frequently used, and abused, forecasting methodcalled exponentially weighted moving averages (EWMA). We will writethe model as

xt = xt−1 + wt − λwt−1 (3.131)

because this model formulation is easier to work with here, and it leadsto the standard representation for EWMA. When |λ| < 1, the model hasan invertible representation,

xt =∞∑

j=1

(1 − λ)λj−1xt−j + wt. (3.132)

Verification of (3.132) is left to the reader (Problem 3.27). From (3.132),we have that the one-step-ahead prediction, using the notation of §3.5,is

xn+1 =∞∑

j=1

(1 − λ)λj−1xn+1−j

= (1 − λ)xn + λ

∞∑j=1

(1 − λ)λj−1xn−j

= (1 − λ)xn + λxn. (3.133)

Based on (3.133), the truncated forecasts are obtained by setting x01 = 0,

and then updating as follows:

xnn+1 = (1 − λ)xn + λxn−1

n , n ≥ 1. (3.134)

3.8: Building ARIMA Models 143

From (3.134), we see that the new forecast is a linear combination ofthe old forecast and the new observation. In EWMA, the parameter λis called the smoothing constant and is restricted to be between zeroand one. Larger values of λ lead to smoother forecasts. This methodof forecasting is popular because it is easy to use; we need only retainthe previous forecast value and the current observation to forecast thenext time period. Unfortunately, as previously suggested, the method isoften abused because some forecasters do not verify that the observationsfollow an IMA(1, 1) process, and often arbitrarily pick values of λ.

Finally, the model for the glacial varve series in Example 3.31 is anIMA(1, 1) on the logarithms of the data. Recall that the fitted modelthere was ln xt = lnxt−1 + wt − .772wt−1 and var(wt) = .236.

3.8 Building ARIMA Models

There are a few basic steps to fitting ARIMA models to time series data. Thesesteps involve plotting the data, possibly transforming the data, identifying thedependence orders of the model, parameter estimation, diagnostics, and modelchoice. First, as with any data analysis, we should construct a time plot ofthe data, and inspect the graph for any anomalies. If, for example, the vari-ability in the data grows with time, it will be necessary to transform the datato stabilize the variance. In such cases, the Box–Cox class of power transfor-mations, equation (2.34), could be employed. Also, the particular applicationmight suggest an appropriate transformation. For example, suppose a processevolves as a fairly small and stable percent change, such as an investment. Forexample, we might have

xt = (1 + pt)xt−1,

where xt is the value of the investment at time t and pt is the percentagechange from period t − 1 to t, which may be negative. Taking logs we have

ln(xt) = ln(1 + pt) + ln(xt−1),

or∇[ln(xt)] = ln(1 + pt).

If the percent change pt stays relatively small in magnitude, then ln(1+pt) ≈ pt

and, thus,∇[ln(xt)] ≈ pt,

will be a relatively stable process. Frequently, ∇[ln(xt)] is called the return orgrowth rate. This general idea was used in Example 3.31, and we will use itagain in Example 3.35.

After suitably transforming the data, the next step is to identify preliminaryvalues of the autoregressive order, p, the order of differencing, d, and the

144 ARIMA Models

1950 1960 1970 1980 1990 2000

2000

4000

6000

8000

quarter

gnp

Figure 3.11 Quarterly U.S. GNP from 1947(1) to 2002(3).

moving average order, q. We have already addressed, in part, the problemof selecting d. A time plot of the data will typically suggest whether anydifferencing is needed. If differencing is called for, then difference the dataonce, d = 1, and inspect the time plot of ∇xt. If additional differencing isnecessary, then try differencing again and inspect a time plot of ∇2xt. Becareful not to overdifference because this may introduce dependence wherenone exists. For example, xt = wt is serially uncorrelated, but ∇xt = wt−wt−1is MA(1). In addition to time plots, the sample ACF can help in indicatingwhether differencing is needed. Because the polynomial φ(z)(1−z)d has a unitroot, the sample ACF, ρ(h), will not decay to zero fast as h increases. Thus,a slow decay in ρ(h) is an indication that differencing may be needed.

When preliminary values of d have been settled, the next step is to lookat the sample ACF and PACF of ∇dxt for whatever values of d have beenchosen. Using Table 3.1 as a guide, preliminary values of p and q are chosen.Recall that, if p = 0 and q > 0, the ACF cuts off after lag q, and the PACFtails off. If q = 0 and p > 0, the PACF cuts off after lag p, and the ACFtails off. If p > 0 and q > 0, both the ACF and PACF will tail off. Becausewe are dealing with estimates, it will not always be clear whether the sampleACF or PACF is tailing off or cutting off. Also, two models that are seeminglydifferent can actually be very similar. With this in mind, we should not worryabout being so precise at this stage of the model fitting. At this stage, a fewpreliminary values of p, d, and q should be at hand, and we can start estimatingthe parameters.

Example 3.35 Analysis of GNP Data

In this example, we consider the analysis of quarterly U.S. GNP from1947(1) to 2002(3), n = 223 observations. The data are Real U.S. GrossNational Product in billions of chained 1996 dollars and they have beenseasonally adjusted. The data were obtained from the Federal Reserve


0 10 20 30 40 50

0.00.2

0.40.6

0.81.0

Lag

ACF

Figure 3.12 Sample ACF of the GNP data.

1950 1960 1970 1980 1990 2000

−100

−50

050

100

150

quarter

diff(gn

p)

Figure 3.13 First difference of the U.S. GNP data.

Bank of St. Louis (http://research.stlouisfed.org/). Figure 3.11 shows aplot of the data, say, yt. Because strong trend hides any other effect, it isnot clear from Figure 3.11 that the variance is increasing with time. Forthe purpose of demonstration, the sample ACF of the data is displayedin Figure 3.12. Figure 3.13 shows the first difference of the data, ∇yt,and now that the trend has been removed we are able to notice thatthe variability in the second half of the data is larger than in the firsthalf of the data. Also, it appears as though a trend is still presentafter differencing. The growth rate, say, xt = ∇ ln(yt), is plotted inFigure 3.14, and, appears to be a stable process. Moreover, we mayinterpret the values of xt as the percentage quarterly growth of U.S. GNP.

The sample ACF and PACF of the quarterly growth rate are plottedin Figure 3.15. Inspecting the sample ACF and PACF, we might feelthat the ACF is cutting off at lag 2 and the PACF is tailing off. This

146 ARIMA Models

1950 1960 1970 1980 1990 2000

−0.02

−0.01

0.00

0.01

0.02

0.03

0.04

quarter

gnpg

r

Figure 3.14 U.S. GNP quarterly growth rate.

5 10 15 20

−0.2

0.00.2

0.40.6

Lag

ACF

5 10 15 20

−0.2

0.00.2

0.40.6

Lag

Partia

l ACF

Figure 3.15 Sample ACF and PACF of the GNP quarterly growth rate.

would suggest the GNP growth rate follows an MA(2) process, or logGNP follows an ARIMA(0, 1, 2) model. Rather than focus on one model,we will also suggest that it appears that the ACF is tailing off and thePACF is cutting off at lag 1. This suggests an AR(1) model for thegrowth rate, or ARIMA(1, 1, 0) for log GNP. As a preliminary analysis,we will fit both models.

Using MLE to fit the MA(2) model for the growth rate, xt, the estimated


model is

xt = .008(.001) + .303(.065)wt−1 + .204(.064)wt−2 + wt, (3.135)

where σw = .0094 is based on 219 degrees of freedom. The values inparentheses are the corresponding estimated standard errors. All of theregression coefficients are significant, including the constant. We make aspecial note of this because, as a default, some computer packages do notfit a constant in a differenced model. That is, these packages assume, bydefault, that there is no drift. In this example, not including a constantleads to the wrong conclusions about the nature of the U.S. economy.Not including a constant assumes the average quarterly growth rate iszero, whereas the U.S. GNP average quarterly growth rate is about 1%(which can be seen easily in Figure 3.14). We leave it to the reader toinvestigate what happens when the constant is not included.

The estimated AR(1) model is

xt = .005(.0006) + .347(.063)xt−1 + wt, (3.136)

where σw = .0095 on 220 degrees of freedom.

We will discuss diagnostics next, but assuming both of these models fitwell, how are we to reconcile the apparent differences of the estimatedmodels (3.135) and (3.136)? In fact, the fitted models are nearly thesame. To show this, consider an AR(1) model of the form in (3.136)without a constant term; that is,

xt = .35xt−1 + wt,

and write it in its causal form, xt =∑∞

j=0 ψjwt−j , where we recall ψj =.35j . Thus, ψ0 = 1, ψ1 = .350, ψ2 = .123, ψ3 = .043, ψ4 = .015, ψ5 =.005, ψ6 = .002, ψ7 = .001, ψ8 = 0, ψ9 = 0, ψ10 = 0, and so forth. Thus,

xt ≈ .35wt−1 + .12wt−2 + wt,

which is similar to the fitted MA(2) model in (3.136).

The analyses and graphics of the example can be performed in R usingthe following commands. We note that we did not fit integrated modelsto log GNP, but rather we fit nonintegrated models to the growth rate,xt. We believe at the time of writing that there is a problem with fittingARIMA models with a nonzero constant in R. The data are in a filecalled gnp96.dat; the file contains two columns, the first column is thequarter and the second column is the GNP.

> gnp96 = read.table("/mydata/gnp96.dat")> gnp = ts(gnp96[,2], start=1947, frequency=4)> plot(gnp)

148 ARIMA Models

> acf(gnp, 50)> gnpgr = diff(log(gnp)) # growth rate> plot.ts(gnpgr)> par(mfrow=c(2,1))> acf(gnpgr, 24)> pacf(gnpgr, 24)> # ARIMA fits:> gnpgr.ar = arima(gnpgr, order = c(1, 0, 0))> gnpgr.ma = arima(gnpgr, order = c(0, 0, 2))> # to view the results:> gnpgr.ar # potential problem here (see below *)> gnpgr.ma> ARMAtoMA(ar=.35, ma=0, 10) # prints psi-weights

∗At this time, the R output for the AR fit lists the estimated mean andits standard error, but calls it the intercept. That is, the output says itis giving you α when in fact it’s listing µ. In this case, α = µ(1 − φ).

The next step in model fitting is diagnostics. This investigation includesthe analysis of the residuals as well as model comparisons. Again, the firststep involves a time plot of the innovations (or residuals), xt − xt−1

t , or of thestandardized innovations

et =(xt − xt−1

t

) / √P t−1

t , (3.137)

where xt−1t is the one-step-ahead prediction of xt based on the fitted model and

P t−1t is the estimated one-step-ahead error variance. If the model fits well, the

standardized residuals should behave as an iid sequence with mean zero andvariance one. The time plot should be inspected for any obvious departuresfrom this assumption. Unless the time series is Gaussian, it is not enough thatthe residuals are uncorrelated. For example, it is possible in the non-Gaussiancase to have an uncorrelated process for which values contiguous in time arehighly dependent. As an example, we mention the family of GARCH modelsthat are discussed in Chapter 5.

Investigation of marginal normality can be accomplished visually by lookingat a histogram of the residuals. In addition to this, a normal probability plotor a Q-Q plot can help in identifying departures from normality. See Johnsonand Wichern (1992, Chapter 4) for details of this test as well as additionaltests for multivariate normality.

There are several tests of randomness, for example the runs test, that couldbe applied to the residuals. We could also inspect the sample autocorrelationsof the residuals, say, ρe(h), for any patterns or large values. Recall that, for awhite noise sequence, the sample autocorrelations are approximately indepen-dently and normally distributed with zero means and variances 1/n. Hence, agood check on the correlation structure of the residuals is to plot ρe(h) versush along with the error bounds of ±2/

√n. The residuals from a model fit,


Standardized Residuals

Time

0 50 100 150 200

−3−1

13

0 5 10 15 20

0.0

0.4

0.8

Lag

ACF

ACF of Residuals

5 10 15 20

0.0

0.4

0.8

p values for Ljung−Box statistic

lag

p va

lue

Figure 3.16 Diagnostics of the residuals from MA(2) fit on GNP growth rate.

however, will not quite have the properties of a white noise sequence and thevariance of ρe(h) can be much less than 1/n. Details can be found in Box andPierce (1970) and McLeod (1978). This part of the diagnostics can be viewedas a visual inspection of ρe(h) with the main concern being the detection ofobvious departures from the independence assumption.

In addition to plotting ρe(h), we can perform a general test that takes intoconsideration the magnitudes of ρe(h) as a group. For example, it may be thecase that, individually, each ρe(h) is small in magnitude, say, each one is justslightly less that 2/

√n in magnitude, but, collectively, the values are large.

The Ljung–Box–Pierce Q-statistic given by

Q = n(n + 2)H∑

h=1

ρ2e(h)

n − h(3.138)

can be used to perform such a test. The value H in (3.138) is chosen somewhatarbitrarily, typically, H = 20. Under the null hypothesis of model adequacy,asymptotically (n → ∞), Q ∼ χ2

H−p−q. Thus, we would reject the null hy-pothesis at level α if the value of Q exceeds the (1−α)-quantile of the χ2

H−p−q

150 ARIMA Models

Residual HistogramFr

eque

ncy

−0.02 0.00 0.02 0.04

010

3050

−3 −2 −1 0 1 2 3

−0.0

30.

000.

02

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple Q

uant

iles

Figure 3.17 Histogram of the residuals (top), and a normal Q-Q plot of theresiduals (bottom).

distribution. Details can be found in Box and Pierce (1970), Ljung and Box(1978), and Davies et al. (1977).

Example 3.36 Diagnostics for GNP Growth Rate Example

We will focus on the MA(2) fit from Example 3.35; the analysis of theAR(1) residuals is similar. Figure 3.16 displays a plot of the standardizedresiduals, the ACF of the residuals (note that R includes the correlationat lag zero which is always one), and the value of the Q-statistic, (3.138),at lags H = 1 through H = 20. These diagnostics are provided by issuingthe command

> tsdiag(gnpgr.ma, gof.lag=20)

where gnpgr.ma was described in the previous example.

Inspection of the time plot of the standardized residuals in Figure 3.16shows no obvious patterns. Notice that there are outliers, however, witha few values exceeding 3 standard deviations in magnitude. The ACF ofthe standardized residuals shows no apparent departure from the modelassumptions, and the Q-statistic is never significant at the lags shown.

Finally, Figure 3.17 shows a histogram of the residuals (top), and a nor-mal Q-Q plot of the residuals (bottom). Here we see the residuals aresomewhat close to normality except for a few extreme values in the tails.



Time

0 100 200 300 400 500 600

−3−1

12

3

0 5 10 15 20 25

0.0

0.4

0.8

Lag

ACF

ACF of Residuals

2 4 6 8 10

0.0

0.4

0.8


lag

p va

lue

Figure 3.18 Diagnostics for the ARIMA(0, 1, 1) fit to the logged varve data.

Running a Shapiro–Wilk test (Royston, 1982) yields a p-value of .003,which indicates the residuals are not normal. Hence, the model appearsto fit well except for the fact that a distribution with heavier tails thanthe normal distribution should be employed. We discuss some possibil-ities in Chapters 5 and 6. These diagnostics can be performed in R byissuing the commands:

> hist(gnpgr.ma$resid, br=12)> qqnorm(gnpgr.ma$resid)> shapiro.test(gnpgr.ma$resid)

Example 3.37 Diagnostics for the Glacial Varve Series

In Example 3.31, we fit an ARIMA(0, 1, 1) model to the logarithms ofthe glacial varve data. Figure 3.18 shows the diagnostics from that fit,and we notice a significant lag 1 correlation. In addition, the Q-statisticis significant for every value of H displayed. Because the ACF of theresiduals appear to be tailing off, an AR term is suggested.

152 ARIMA Models


Time

0 100 200 300 400 500 600

−3−1

12

3

0 5 10 15 20 25

0.0

0.4

0.8

Lag

ACF

ACF of Residuals

2 4 6 8 10

0.0

0.4

0.8


lag

p va

lue

Figure 3.19 Diagnostics for the ARIMA(1, 1, 1) fit to the logged varve data.

Next, we fit an ARIMA(1, 1, 1) to the logged varve data and obtained theestimates φ = .23(.05), θ = −.89(.03), and σ2

w = .23. Hence the AR termis significant. Diagnostics for this model are displayed in Figure 3.19,and it appears this model fits the data well.

To implement these analyses in R, use the following commands (we as-sume the data are in varve):

> varve.ma = arima(log(varve), order = c(0, 1, 1))> varve.ma # to display results> tsdiag(varve.ma)> varve.arma = arima(log(varve), order = c(1, 1, 1))> varve.arma # to display results> tsdiag(varve.arma, gof.lag=20)

In Example 3.35, we have two competing models, an AR(1) and an MA(2)on the GNP growth rate, that each appear to fit the data well. In addition, wemight also consider that an AR(2) or an MA(3) might do better for forecasting.Perhaps combining both models, that is, fitting an ARMA(1, 2) to the GNP


Figure 3.20 A perfect fit and a terrible forecast.

growth rate, would be the best. As previously mentioned, we have to be con-cerned with overfitting the model; it is not always the case that more is better.Overfitting leads to less-precise estimators, and adding more parameters mayfit the data better but may also lead to bad forecasts. This result is illustratedin the following example.

Example 3.38 A Problem with Overfitting

Figure 3.20 shows the U.S. population by official census, every 10 yearsfrom 1910 to 1990, as points. If we use these nine observations to predictthe future population of the U.S., we can use an eight-degree polynomialso the fit to the nine observations is perfect. The model in this case is

xt = β0 + β1t + β2t2 + · · · + β8t

8 + wt.

The fitted model, which is plotted through the year 2010 as a line, passesthrough the nine observations. The model predicts that the populationof the U.S. will be close to zero in the year 2000, and will cross zerosometime in the year 2002!

The final step of model fitting is model choice or model selection. That is,we must decide which model we will retain for forecasting. The most populartechniques, AIC, AICc, and SIC, were described in §2.2 in the context ofregression models. A discussion of AIC based on Kullback–Leibler distancewas given in Problems 2.4 and 2.5.

154 ARIMA Models

Example 3.39 Model Choice for the U.S. GNP Series

Returning to the analysis of the U.S. GNP data presented in Exam-ples 3.35 and 3.36, recall that two models, an AR(1) and an MA(2), fitthe GNP growth rate well. To choose the final model, we compare theAIC, the AICc, and the SIC for both models.

Below are the R commands for the comparison. The effective samplesize in this example is 222. We note that R returns AIC8 as part of theARIMA fit.

> # AIC> gnpgr.ma$aic

[1] -1431.929 # MA(2)> gnpgr.ar$aic

[1] -1431.221 # AR(1)> # AICc - see Section 2.2> log(gnpgr.ma$sigma2)+(222+2)/(222-2-2)

[1] -8.297199 # MA(2)> log(gnpgr.ar$sigma2)+(222+1)/(222-1-2)

[1] -8.294156 # AR(1)> # SIC or BIC - see Section 2.2> log(gnpgr.ma$sigma2)+(2*log(222)/222)

[1] -9.276049 # MA(2)> log(gnpgr.ar$sigma2)+(1*log(222)/222)

[1] -9.288084 # AR(1)

The AIC and AICc both prefer the MA(2) fit, whereas the SIC (or BIC)prefers the simpler AR(1) model. It is often the case that the SIC willselect a model of smaller order than the AIC or AICc. It would not beunreasonable in this case to retain the AR(1) because pure autoregressivemodels are easier to work with.

3.9 Multiplicative Seasonal ARIMA Models

In this section, we introduce several modifications made to the ARIMA modelto account for seasonal and nonstationary behavior. Often, the dependenceon the past tends to occur most strongly at multiples of some underlyingseasonal lag s. For example, with monthly economic data, there is a strongyearly component occurring at lags that are multiples of s = 12, because

8R calculates this value as AIC = −2 ln Lx(βββ, σ2w) + 2(p + q), where Lx(βββ, σ2

w) is thelikelihood of the data evaluated at the MLE; see (3.105). Note that AIC consists of twoparts, one measuring model fit and one penalizing for the addition of parameters. Dividingthis quantity by n, writing k = p + q, and ignoring constants and terms involving initialconditions, we obtain AIC as given in §2.2. Details are provided in Problems 2.4 and 2.5.

3.9: Seasonal ARIMA 155

of the strong connections of all activity to the calendar year. Data takenquarterly will exhibit the yearly repetitive period at s = 4 quarters. Naturalphenomena such as temperature also have strong components correspondingto seasons. Hence, the natural variability of many physical, biological, andeconomic processes tends to match with seasonal fluctuations. Because of this,it is appropriate to introduce autoregressive and moving average polynomialsthat identify with the seasonal lags. The resulting pure seasonal autoregressivemoving average model, say, ARMA(P, Q)s, then takes the form

ΦP (Bs)xt = ΘQ(Bs)wt, (3.139)

with the following definition.

Definition 3.12 The operators

ΦP (Bs) = 1 − Φ1Bs − Φ2B

2s − · · · − ΦP BPs (3.140)

andΘQ(Bs) = 1 + Θ1B

s + Θ2B2s + · · · + ΘQBQs (3.141)

are the seasonal autoregressive operator and the seasonal moving av-erage operator of orders P and Q, respectively, with seasonal period s.

Analogous to the properties of nonseasonal ARMA models, the pure sea-sonal ARMA(P, Q)s is causal only when the roots of ΦP (zs) lie outside theunit circle, and it is invertible only when the roots of ΘQ(zs) lie outside theunit circle.

Example 3.40 A Seasonal ARMA Series

A first-order seasonal autoregressive moving average series that mightrun over months could be written as

(1 − ΦB12)xt = (1 + ΘB12)wt

orxt = Φxt−12 + wt + Θwt−12.

This model exhibits the series xt in terms of past lags at the multi-ple of the yearly seasonal period s = 12 months. It is clear from theabove form that estimation and forecasting for such a process involvesonly straightforward modifications of the unit lag case already treated.In particular, the causal condition requires |Φ| < 1, and the invertiblecondition requires |Θ| < 1.

For the first-order seasonal (s = 12) MA model, xt = wt + Θwt−12, it iseasy to verify that

γ(0) = (1 + Θ2)σ2

156 ARIMA Models

Table 3.2 Behavior of the ACF and PACF for Causal andInvertible Pure Seasonal ARMA Models

AR(P )s MA(Q)s ARMA(P, Q)s

ACF* Tails off at lags ks, Cuts off after Tails off atk = 1, 2, . . . , lag Qs lags ks

PACF* Cuts off after Tails off at lags ks Tails off atlag Ps k = 1, 2, . . . , lags ks

*The values at nonseasonal lags h �= ks, for k = 1, 2, . . ., are zero.

γ(±12) = Θσ2

γ(h) = 0, otherwise.

Thus, the only nonzero correlation, aside from lag zero, is

ρ(±12) = Θ/(1 + Θ2).

For the first-order seasonal (s = 12) AR model, using the techniques of thenonseasonal AR(1), we have

γ(0) = σ2/(1 − Φ2)γ(±12k) = σ2Φk/(1 − Φ2) k = 1, 2, . . .

γ(h) = 0, otherwise.

In this case, the only non-zero correlations are

ρ(±12k) = Φk, k = 0, 1, 2, . . . .

These results can be verified using the general result that γ(h) = Φγ(h − 12),for h ≥ 1. For example, when h = 1, γ(1) = Φγ(11), but when h = 11, wehave γ(11) = Φγ(1), which implies that γ(1) = γ(11) = 0. In addition to theseresults, the PACF have the analogous extensions from nonseasonal to seasonalmodels.

As an initial diagnostic criterion, we can use the properties for the pureseasonal autoregressive and moving average series listed in Table 3.2. Theseproperties may be considered as generalizations of the properties for nonsea-sonal models that were presented in Table 3.1.

In general, we can combine the seasonal and nonseasonal operators intoa multiplicative seasonal autoregressive moving average model, denoted byARMA(p, q) × (P, Q)s, and write

ΦP (Bs)φ(B)xt = ΘQ(Bs)θ(B)wt (3.142)


as the overall model. Although the diagnostic properties in Table 3.2 arenot strictly true for the overall mixed model, the behavior of the ACF andPACF tends to show rough patterns of the indicated form. In fact, for mixedmodels, we tend to see a mixture of the facts listed in Tables 3.1 and 3.2.In fitting such models, focusing on the seasonal autoregressive and movingaverage components first generally leads to more satisfactory results.

Example 3.41 A Mixed Seasonal Model

Consider an ARMA(0, 1) × (1, 0)12 model

xt = Φxt−12 + wt + θwt−1,

where |Φ| < 1 and |θ| < 1. Then, because xt−12, wt, and wt−1 areuncorrelated, and xt is stationary, γ(0) = Φ2γ(0) + σ2

w + θ2σ2w, or

γ(0) =1 + θ2

1 − Φ2 σ2w.

In addition, multiplying the model by xt−h, h > 0, and taking expecta-tions, we have γ(1) = Φγ(11) + θσ2

w, and γ(h) = Φγ(h − 12), for h ≥ 2.Thus, the ACF for this model is

ρ(12h) = Φh h = 1, 2, . . .

ρ(12h − 1) = ρ(12h + 1) =θ

1 + θ2 Φh h = 0, 1, 2, . . . ,

ρ(h) = 0, otherwise.

The ACF and PACF for this model, with Φ = .8 and θ = −.5, are shownin Figure 3.21. These type of correlation relationships, although idealizedhere, are typically seen with seasonal data.

To reproduce Figure 3.21 in R, use the following commands:> phi = c(rep(0,11),.8)> acf = ARMAacf(ar=phi, ma=-.5, 50)> pacf = ARMAacf(ar=phi, ma=-.5, 50, pacf=T)> par(mfrow=c(1,2))> plot(acf, type="h", xlab="lag")> abline(h=0)> plot(pacf, type="h", xlab="lag")> abline(h=0)

Seasonal nonstationarity can occur, for example, when the process is nearlyperiodic in the season. For example, with average monthly temperatures overthe years, each January would be approximately the same, each February wouldbe approximately the same, and so on. In this case, we might think of averagemonthly temperature xt as being modeled as

xt = St + wt,

158 ARIMA Models

Figure 3.21 ACF and PACF of the mixed seasonal ARMA model xt =.8xt−12 + wt − .5wt−1.

where St is a seasonal component that varies slowly from one year to the next,according to a random walk,

St = St−12 + vt.

In this model, wt and vt are uncorrelated white noise processes. The tendencyof data to follow this type of model will be exhibited in a sample ACF that islarge and decays very slowly at lags h = 12k, for k = 1, 2, . . . . If we subtractthe effect of successive years from each other, we find that

(1 − B12)xt = xt − xt−12 = vt + wt − wt−12.

This model is a stationary MA(1)12, and its ACF will have a peak only at lag12. In general, seasonal differencing can be indicated when the ACF decaysslowly at multiples of some season s, but is negligible between the periods.Then, a seasonal difference of order D is defined as

∇Ds xt = (1 − Bs)Dxt, (3.143)

where D = 1, 2, . . . takes integer values. Typically, D = 1 is sufficient to obtainseasonal stationarity.

Incorporating these ideas into a general model leads to the following defi-nition.


Definition 3.13 The multiplicative seasonal autoregressive integratedmoving average model, or SARIMA model, of Box and Jenkins (1970)is given by

ΦP (Bs)φ(B)∇Ds ∇dxt = α + ΘQ(Bs)θ(B)wt, (3.144)

where wt is the usual Gaussian white noise process. The general model isdenoted as ARIMA(p, d, q) × (P, D, Q)s(p, d, q) × (P, D, Q)s(p, d, q) × (P, D, Q)s. The ordinary autoregressive andmoving average components are represented by polynomials φ(B) and θ(B) oforders p and q, respectively [see (3.5) and (3.17)], and the seasonal autoregres-sive and moving average components by ΦP (Bs) and ΘQ(Bs) [see (3.140) and(3.141)] of orders P and Q and ordinary and seasonal difference componentsby ∇d = (1 − B)d and ∇D

s = (1 − Bs)D.

Example 3.42 A SARIMA Model

Consider the following model, which often provides a reasonable repre-sentation for seasonal, nonstationary, economic time series. We exhibitthe equations for the model, denoted by ARIMA(0, 1, 1) × (0, 1, 1)12 inthe notation given above, where the seasonal fluctuations occur every 12months. Then, the model (3.144) becomes

(1 − B12)(1 − B)xt = (1 + ΘB12)(1 + θB)wt. (3.145)

Expanding both sides of (3.145) leads to the representation

(1 − B − B12 + B13)xt = (1 + θB + ΘB12 + ΘθB13)wt,

or in difference equation form

xt = xt−1 + xt−12 − xt−13 + wt + θwt−1 + Θwt−12 + Θθwt−13.

Selecting the appropriate model for a given set of data from all of thoserepresented by the general form (3.144) is a daunting task, and we usuallythink first in terms of finding difference operators that produce a roughlystationary series and then in terms of finding a set of simple autoregressivemoving average or multiplicative seasonal ARMA to fit the resulting residualseries. Differencing operations are applied first, and then the residuals areconstructed from a series of reduced length. Next, the ACF and the PACF ofthese residuals are evaluated. Peaks that appear in these functions can oftenbe eliminated by fitting an autoregressive or moving average component inaccordance with the general properties of Tables 3.1 and 3.2. In consideringwhether the model is satisfactory, the diagnostic techniques discussed in §3.8still apply.

160 ARIMA Models

1950 1955 1960 1965 1970 1975

40

60

80

100

120

140

160Production

1950 1955 1960 1965 1970 19750

200

400

600

800

1000

Unemployment

Figure 3.22 Values of the Monthly Federal Reserve Board Production Indexand Unemployment (1948-1978, n = 372 months).

Example 3.43 Analysis of the Federal Reserve Board ProductionIndex.

A problem of great interest in economics involves first identifying a modelwithin the Box–Jenkins class for a given time series and then producingforecasts based on the model. For example, we might consider applyingthis methodology to the Federal Reserve Board Production Index shownin Figure 3.22. The ACFs and PACFs for this series are shown in Fig-ure 3.23, and we note the slow decay in the ACF and the peak at lagh = 1 in the PACF, indicating nonstationary behavior.Following the recommended procedure, a first difference was taken, andthe ACF and PACF of the first difference

∇xt = xt − xt−1

are shown in Figure 3.24. Noting the peaks at 12, 24, 36, and 48 with rel-atively slow decay suggested a seasonal difference and Figure 3.25 showsthe seasonal difference of the differenced production, say,

∇12∇xt = (1 − B12)(1 − B)xt.

Characteristics of the ACF and PACF of this series tend to show a strongpeak at h = 12 in the autocorrelation function, with smaller peaks ap-pearing at h = 24, 36, combined with peaks at h = 12, 24, 36, 48, in the


Figure 3.23 ACF and PACF of the production series.

partial autocorrelation function. Using Table 3.2, this suggests either aseasonal moving average of order Q = 1, a seasonal autoregression ofpossible order P = 2, or due to the fact that both the ACF and PACFmay be tailing off at the seasonal lags, perhaps both components, P = 2and Q = 1, are needed.

Inspecting the ACF and the PACF at the within season lags, h =1, . . . , 11, it appears that both the ACF and PACF are tailing off. Basedon Table 3.1, this result indicates that we should consider fitting a modelwith both p > 0 and q > 0 for the nonseasonal components. Hence, atfirst we will consider p = 1 and q = 1.

Fitting the three models suggested by these observations and computingthe AIC for each, we obtain:

(i) ARIMA(1, 1, 1) × (0, 1, 1)12, AIC = 1162.30

(ii) ARIMA(1, 1, 1) × (2, 1, 0)12, AIC = 1169.04

(iii) ARIMA(1, 1, 1) × (2, 1, 1)12, AIC = 1148.43

On the basis of the AICs, we prefer the ARIMA(1, 1, 1)×(2, 1, 1)12 model.Figure 3.26 shows the diagnostics for this model, leading to the conclusionthat the model is adequate. We note, however, the presence of a fewoutliers.

162 ARIMA Models

Figure 3.24 ACF and PACF of differenced production, (1 − B)xt.

Figure 3.25 ACF and PACF of first differenced and then seasonally differ-enced production, (1 − B)(1 − B12)xt.



Time

0 100 200 300

−4−2

02

4

0 5 10 15 20 25

−0.2

0.2

0.6

1.0

Lag

ACF

ACF of Residuals

0 10 20 30 40

0.0

0.4

0.8


lag

p va

lue

Figure 3.26 Diagnostics for the ARIMA(1, 1, 1) × (2, 1, 1)12 fit on the Pro-duction data.

The fitted ARIMA(1, 1, 1) × (2, 1, 1)12 is

(1 + .22(.08)B12 + .28(.06)B

24)(1 − .58(.11)B)∇12∇xt

= (1 − .50(.07)B12)(1 − .27(.13)B)wt

with σ2w = 1.35. Forecasts based on the fitted model for the next 12

months are shown in Figure 3.27.

Finally, we present the R code necessary to reproduce most of the analy-ses performed in Example 3.43.

> prod=scan("/mydata/prod.dat")> par(mfrow=c(2,1)) # (P)ACF of data> acf(prod, 48)> pacf(prod, 48)> par(mfrow=c(2,1)) # (P)ACF of d1 data> acf(diff(prod), 48)> pacf(diff(prod), 48)

164 ARIMA Models

340 350 360 370 380

100

120

140

160

180

month

Prod

uctio

n

Figure 3.27 Forecasts and limits for production index. The vertical dottedline separates the data from the predictions.

> par(mfrow=c(2,1)) # (P)ACF of d1-d12 data> acf(diff(diff(prod),12), 48)> pacf(diff(diff(prod),12), 48)

> ### fit model (iii)> prod.fit3 = arima(prod, order=c(1,1,1),+ seasonal=list(order=c(2,1,1), period=12))> prod.fit3 # to view the results> tsdiag(prod.fit3, gof.lag=48) # diagnostics

> ### forecasts for the final model> prod.pr = predict(prod.fit3, n.ahead=12)> U = prod.pr$pred + 2*prod.pr$se> L = prod.pr$pred - 2*prod.pr$se> month=337:372> plot(month, prod[month], type="o", xlim=c(337,384),+ ylim=c(100,180), ylab="Production")> lines(prod.pr$pred, col="red", type="o")> lines(U, col="blue", lty="dashed")> lines(L, col="blue", lty="dashed")> abline(v=372.5,lty="dotted")

Problems 165

Problems

Section 3.2

3.1 For an MA(1), xt = wt + θwt−1, show that |ρx(1)| ≤ 1/2 for any numberθ. For which values of θ does ρx(1) attain its maximum and minimum?

3.2 Let wt be white noise with variance σ2w and let |φ| < 1 be a constant.

Consider the process

x1 = w1xt = φxt−1 + wt t = 2, 3, . . . .

(a) Find the mean and the variance of {xt, t = 1, 2, . . .}. Is xt station-ary?

(b) Show

corr(xt, xt−h) = φh

[var(xt−h)var(xt)

]1/2

for h ≥ 0.

(c) Argue that for large t,

var(xt) ≈ σ2w

1 − φ2

andcorr(xt, xt−h) ≈ φh, h ≥ 0,

so in a sense, xt is “asymptotically stationary.”

(d) Comment on how you could use these results to simulate n obser-vations of a stationary Gaussian AR(1) model from simulated iidN(0,1) values.

(e) Now suppose x1 = w1/√

1 − φ2. Is this process stationary?

3.3 Identify the following models as ARMA(p, q) models (watch out for pa-rameter redundancy), and determine whether they are causal and/orinvertible:

(a) xt = .80xt−1 − .15xt−2 + wt − .30wt−1.

(b) xt = xt−1 − .50xt−2 + wt − wt−1.

3.4 Verify the causal conditions for an AR(2) model given in (3.27). Thatis, show that an AR(2) is causal if and only if (3.27) holds.

166 ARIMA Models

Section 3.3

3.5 For the AR(2) model given by xt = −.9xt−2 + wt, find the roots of theautoregressive polynomial, and then sketch the ACF, ρ(h).

3.6 For the AR(2) autoregressive series shown below, determine a set of dif-ference equations that can be used to find ψj , j = 0, 1, . . . in the represen-tation (3.24) and the autocorrelation function ρ(h), h = 0, 1, . . .. Solvefor the constants in the ACF using the known initial conditions, and plotthe first eight values.

(a) xt + 1.6xt−1 + .64xt−2 = wt.

(b) xt − .40xt−1 − .45xt−2 = wt.

(c) xt − 1.2xt−1 + .85xt−2 = wt.

Section 3.4

3.7 Verify the calculations for the autocorrelation function of an ARMA(1, 1)process given in Example 3.11. Compare the form with that of the ACFfor the ARMA(1, 0) and the ARMA(0, 1) series. Plot the ACFs of thethree series on the same graph for φ = .6, θ = .9, and comment on thediagnostic capabilities of the ACF in this case.

3.8 Generate n = 100 observations from each of the three models discussedin Problem 3.7. Compute the sample ACF for each model and compareit to the theoretical values. Compute the sample PACF for each of thegenerated series and compare the sample ACFs and PACFs with thegeneral results given in Table 3.1.

Section 3.5

3.9 Let Mt represent the cardiovascular mortality series discussed in Chap-ter 2, Example 2.2.

(a) Fit an AR(2) to Mt using linear regression as in Example 3.16.

(b) Assuming the fitted model in (a) is the true model, find the fore-casts over a four-week horizon, xn

n+m, for m = 1, 2, 3, 4, and thecorresponding 95% prediction intervals.

3.10 Consider the MA(1) series

xt = wt + θwt−1,

where wt is white noise with variance σ2w.

Problems 167

(a) Derive the minimum mean square error one-step forecast based onthe infinite past, and determine the mean square error of this fore-cast.

(b) Let xnn+1 be the truncated one-step-ahead forecast as given in (3.82).

Show thatE[(xn+1 − xn

n+1)2] = σ2(1 + θ2+2n).

Compare the result with (a), and indicate how well the finite ap-proximation works in this case.

3.11 In the context of equation (3.56), show that, if γ(0) > 0 and γ(h) → 0as h → ∞, then Γn is positive definite.

3.12 Suppose xt is stationary with zero mean and recall the definition of thePACF given by (3.49) and (3.50). That is, let

εt = xt −h−1∑i=1

aixt−i

and

δt−h = xt−h −h−1∑j=1

bjxt−j

be the two residuals where {a1, . . . , ah−1} and {b1, . . . , bh−1} are chosenso that they minimize the mean-squared errors

E[ε2t ] and E[δ2t−h].

The PACF at lag h was defined as the cross-correlation between εt andδt−h; that is,

φhh =E(εtδt−h)√

E(ε2t )E(δ2t−h)

.

Let Rh be the h × h matrix with elements ρ(i − j), i, j = 1, . . . , h, andlet ρρρh = (ρ(1), ρ(2), . . . , ρ(h))′ be the vector of lagged autocorrelations,ρ(h) = corr(xt+h, xt). Let ρρρh = (ρ(h), ρ(h−1), . . . , ρ(1))′ be the reversedvector. In addition, let xh

t denote the BLP of xt given {xt−1, . . . , xt−h}:

xht = αh1xt−1 + · · · + αhhxt−h,

as described in Property P3.3. Prove

φhh =ρ(h) − ρρρ′

h−1R−1h−1ρρρh

1 − ρρρ′h−1R

−1h−1ρρρh−1

= αhh.

In particular, this result proves Property P3.4.

168 ARIMA Models

Hint: Divide the prediction equations [see (3.56)] by γ(0) and write thematrix equation in the partitioned form as(

Rh−1 ρρρh−1ρρρ′

h−1 ρ(0)

)(ααα1αhh

)=(

ρρρh−1ρ(h)

),

where the h × 1 vector of coefficients ααα = (αh1, . . . , αhh)′ is partitionedas ααα = (ααα′

1, αhh)′.

3.13 Suppose we wish to find a prediction function g(x) that minimizes

MSE = E[(y − g(x))2],

where x and y are jointly distributed random variables with density func-tion f(x, y).

(a) Show that MSE is minimized by the choice

g(x) = E(y∣∣ x).

Hint:

MSE =∫ [∫

(y − g(x))2f(y|x)dy

]f(x)dx.

(b) Apply the above result to the model

y = x2 + z,

where x and z are independent zero-mean normal variables withvariance one. Show that MSE = 1.

(c) Suppose we restrict our choices for the function g(x) to linear func-tions of the form

g(x) = a + bx

and determine a and b to minimize MSE. Show that a = 1 and

b =E(xy)E(x2)

= 0

and MSE = 3. What do you interpret this to mean?

3.14 For an AR(1) model, determine the general form of the m-step-aheadforecast xt

t+m and show

E[(xt+m − xtt+m)2] = σ2

w

1 − φ2m

1 − φ2 .

3.15 Consider the ARMA(1,1) model discussed in Example 3.6, equation (3.26);that is, xt = .9xt−1 + .5wt−1 + wt. Show that truncated prediction asdefined in (3.81) is equivalent to truncated prediction using the recursiveformula (3.82).

3.16 Verify statement (3.78), that for a fixed sample size, the ARMA predic-tion errors are correlated.

Problems 169

Section 3.6

3.17 Let Mt represent the cardiovascular mortality series discussed in Chap-ter 2, Example 2.2. Fit an AR(2) model to the data using linear regres-sion and using Yule–Walker.

(a) Compare the parameter estimates obtained by the two methods.(b) Compare the estimated standard errors of the coefficients obtained

by linear regression with their corresponding asymptotic approxi-mations, as given in Property P3.9.

3.18 Suppose x1, . . . , xn are observations from an AR(1) process with µ = 0.

(a) Show the backcasts can be written as xnt = φ1−tx1, for t ≤ 1.

(b) In turn, show, for t ≤ 1, the backcasted errors are wt(φ) = xnt −

φxnt−1 = φ1−t(1 − φ2)x1.

(c) Use the result of (b) to show∑1

t=−∞ w2t (φ) = (1 − φ2)x2

1.(d) Use the result of (c) to verify the unconditional sum of squares,

S(φ), can be written in the innovations form as∑n

t=−∞ w2t (φ).

(e) Find xt−1t and rt−1

t , and show that S(φ) can also be written as∑nt=1(xt − xt−1

t )2/

rt−1t .

3.19 Generate n = 500 observations from the ARMA model given by

xt = .9xt−1 + wt − .9wt−1,

with wt ∼ iid N(0, 1). Plot the simulated data, compute the sample ACFand PACF of the simulated data, and fit an ARMA(1, 1) model to thedata. What happened and how do you explain the results?

3.20 Generate 10 realizations of length n = 200 of a series from an ARMA(1,1)model with φ1 = .90, θ1 = .2 and σ2 = .25. Fit the model by nonlin-ear least squares or maximum likelihood in each case and compare theestimators to the true values.

3.21 Generate n = 50 observations from a Gaussian AR(1) model with φ = .99and σw = 1. Using an estimation technique of your choice, comparethe approximate asymptotic distribution of your estimate (the one youwould use for inference) with the results of a bootstrap experiment (useB = 200).

3.22 Using Example 3.30 as your guide, find the Gauss–Newton procedurefor estimating the autoregressive parameter, φ, from the AR(1) model,xt = φxt−1 +wt, given data x1, . . . , xn. Does this procedure produce theunconditional or the conditional estimator? Hint: Write the model aswt(φ) = xt − φxt−1; your solution should work out to be a non-recursiveprocedure.

170 ARIMA Models

3.23 Consider the stationary series generated by

xt = α + φxt−1 + wt + θwt−1,

where E(xt) = µ, |θ| < 1, |φ| < 1 and the wt are iid random variableswith zero mean and variance σ2

w.

(a) Determine the mean as a function of α for the above model. Findthe autocovariance and ACF of the process xt, and show that theprocess is weakly stationary. Is the process strictly stationary?

(b) Prove the limiting distribution as n → ∞ of the sample mean,

x = n−1n∑

t=1

xt,

is normal, and find its limiting mean and variance in terms of α, φ,θ, and σ2

w. (Note: This part uses results from Appendix A.)

3.24 A problem of interest in the analysis of geophysical time series involvesa simple model for observed data containing a signal and a reflectedversion of the signal with unknown amplification factor a and unknowntime delay δ. For example, the depth of an earthquake is proportionalto the time delay δ for the P wave and its reflected form pP on a seismicrecord. Assume the signal is white and Gaussian with variance σ2

s , andconsider the generating model

xt = st + ast−δ.

(a) Prove the process xt is stationary. If |a| < 1, show that

st =∞∑

j=0

(−a)jxt−δj

is a mean square convergent representation for the signal st, fort = 1,±1,±2, . . ..

(b) If the time delay δ is assumed to be known, suggest an approximatecomputational method for estimating the parameters a and σ2

s usingmaximum likelihood and the Gauss–Newton method.

(c) If the time delay δ is an unknown integer, specify how we couldestimate the parameters including δ. Generate a n = 500 pointseries with a = .9, σ2

w = 1 and δ = 5. Estimate the integer timedelay δ by searching over δ = 3, 4, . . . , 7.

Problems 171

3.25 Forecasting with estimated parameters: Let x1, x2, . . . , xn be a sampleof size n from a causal AR(1) process, xt = φxt−1 + wt. Let φ be theYule–Walker estimator of φ.

(a) Show φ − φ = Op(n−1/2). See Appendix A for the definition ofOp(·).

(b) Let xnn+1 be the one-step-ahead forecast of xn+1 given the data

x1, . . . , xn, based on the known parameter, φ, and let xnn+1 be the

one-step-ahead forecast when the parameter is replaced by φ. Showxn

n+1 − xnn+1 = Op(n−1/2).

Section 3.7

3.26 Supposeyt = β0 + β1t + · · · + βqt

q + xt, βq �= 0,

where xt is stationary. First, show that ∇kxt is stationary for any k =1, 2, . . . , and then show that ∇kyt is not stationary for k < q, but isstationary for k ≥ q.

3.27 Verify that the IMA(1,1) model given in (3.131) can be inverted andwritten as (3.132).

3.28 For the logarithm of the glacial varve data, say, xt, presented in Example3.31, use the first 100 observations and calculate the EWMA, xt

t+1, givenin (3.134) for t = 1, . . . , 100, using λ = .25, .50, and .75, and plot theEWMAs and the data superimposed on each other. Comment on theresults.

Section 3.8

3.29 In Example 3.36, we presented the diagnostics for the MA(2) fit to theGNP growth rate series. Using that example as a guide, complete thediagnostics for the AR(1) fit.

3.30 Using the gas price series described in Problem 2.9, fit an ARIMA(p, d, q)model to the data, performing all necessary diagnostics. Comment.

3.31 The second column in the data file globtemp2.dat are annual globaltemperature deviations from 1880 to 2004. The data are an update tothe Hansen-Lebedeff global temperature data and the url of the datasource is in the file. Fit an ARIMA(p, d, q) model to the data, performingall of the necessary diagnostics. After deciding on an appropriate model,forecast (with limits) the next 10 years. Comment. In R, use read.tableto load the data file.

172 ARIMA Models

3.32 One of the series collected along with particulates, temperature, andmortality described in Example 2.2 is the sulfur dioxide series. Fit anARIMA(p, d, q) model to the data, performing all of the necessary diag-nostics. After deciding on an appropriate model, forecast the data intothe future four time periods ahead (about one month) and calculate 95%prediction intervals for each of the four forecasts. Comment.

Section 3.9

3.33 Consider the ARIMA model

xt = wt + Θwt−2.

(a) Identify the model using the notation ARIMA(p, d, q) × (P, D, Q)s.

(b) Show that the series is invertible for |Θ| < 1, and find the coefficientsin the representation

wt =∞∑

k=0

πkxt−k.

(c) Develop equations for the m-step ahead forecast, xn+m, and itsvariance based on the infinite past, xn, xn−1, . . . .

3.34 Sketch the ACF of the seasonal ARIMA(0, 1)×(1, 0)12 model with Φ = .8and θ = .5.

3.35 Fit a seasonal ARIMA model of your choice to the unemployment datadisplayed in Figure 3.22. Use the estimated model to forecast the next12 months.

3.36 Fit a seasonal ARIMA model of your choice to the U.S. Live Birth Series(birth.dat). Use the estimated model to forecast the next 12 months.

3.37 Fit an appropriate seasonal ARIMA model to the log-transformed John-son and Johnson earnings series of Example 1.1. Use the estimated modelto forecast the next 4 quarters.

The following problems require the supplemental material given in Appendix B

3.38 Suppose xt =∑p

j=1 φjxt−j +wt, where φp �= 0 and wt is white noise suchthat wt is uncorrelated with {xk; k < t}. Use the Projection Theorem toshow that, for n > p, the BLP of xn+1 on sp{xk, k ≤ n} is

xn+1 =p∑

j=1

φjxn+1−j .

Problems 173

3.39 Use the Projection Theorem to derive the Innovations Algorithm, Prop-erty P3.6, equations (3.68)-(3.70). Then, use Theorem B.2 to derive them-step-ahead forecast results given in (3.71) and (3.72).

3.40 Consider the series xt = wt − wt−1, where wt is a white noise processwith mean zero and variance σ2

w. Suppose we consider the problem ofpredicting xn+1, based on only x1, . . . , xn. Use the Projection Theoremto answer the questions below.

(a) Show the best linear predictor is

xnn+1 = − 1

n + 1

n∑k=1

k xk.

(b) Prove the mean square error is

E(xn+1 − xnn+1)

2 =n + 2n + 1

σ2w.

3.41 Use Theorem B.2 and B.3 to verify (3.105).

3.42 Prove Theorem B.2.

3.43 Prove Property P3.2.

Chapter 4

Spectral Analysis andFiltering

4.1 Introduction

The notion that a time series exhibits repetitive or regular behavior over timeis of fundamental importance because it distinguishes time series analysis fromclassical statistics, which assumes complete independence over time. We haveseen how dependence over time can be introduced through models that de-scribe in detail the way certain empirical data behaves, even to the extent ofproducing forecasts based on the models. It is natural that models based onpredicting the present as a regression on the past, such as are provided by thecelebrated ARIMA or state-space forms, will be attractive to statisticians, whoare trained to view nature in terms of linear models. In fact, the differenceequations used to represent these kinds of models are simply the discrete ver-sions of linear differential equations that may, in some instances, provide theideal physical model for a certain phenomenon. An alternate version of theway nature behaves exists, however, and is based on a decomposition of anempirical series into its regular components.

In this chapter, we argue, the concept of regularity of a series can best beexpressed in terms of periodic variations of the underlying phenomenon thatproduced the series, expressed as Fourier frequencies being driven by sinesand cosines. Such a possibility was discussed in Chapters 1 and 2. Froma regression point of view, we may imagine a system responding to variousdriving frequencies by producing linear combinations of sine and cosine func-tions. Expressed in these terms, the time domain approach may be thoughtof as regression of the present on the past, whereas the frequency domain ap-proach may be considered as regression of the present on periodic sines andcosines. The frequency domain approaches are the focus of this chapter and

174

4.1: Introduction 175

Chapter 7. To illustrate the two methods for generating series with a singleprimary periodic component, consider Figure 1.9, which was generated from asimple second-order autoregressive model, and the middle and bottom panelsof Figure 1.11, which were generated by adding a cosine wave with a periodof 50 points to white noise. Both series exhibit strong periodic fluctuations,illustrating that both models can generate time series with regular behavior.As discussed in Examples 2.7–2.9, a fundamental objective of spectral analysisis to identify the dominant frequencies in a series and to find an explanationof the system from which the measurements were derived.

Of course, the primary justification for any alternate model must lie in itspotential for explaining the behavior of some empirical phenomenon. In thissense, an explanation involving only a few kinds of primary oscillations be-comes simpler and more physically meaningful than a collection of parametersestimated for some selected difference equation. It is the tendency of observeddata to show periodic kinds of fluctuations that justifies the use of frequencydomain methods. Many of the examples in §1.2 are time series representing realphenomena that are driven by periodic components. The speech recording ofthe syllable aa...hh in Figure 1.3 contains a complicated mixture of frequenciesrelated to the opening and closing of the glottis. Figure 1.5 shows the monthlySOI, which we later explain as a combination of two kinds of periodicities, aseasonal periodic component of 12 months and an El Nino component of aboutthree to five years. Of fundamental interest is the return period of the El Ninophenomenon, which can have profound effects on local climate. Also of in-terest is whether the different periodic components of the new fish populationdepend on corresponding seasonal and El Nino-type oscillations. We introducethe coherence as a tool for relating the common periodic behavior of two series.Seasonal periodic components are often pervasive in economic time series; thisphenomenon can be seen in the quarterly earnings series shown in Figure 1.1.In Figure 1.6, we see the extent to which various parts of the brain will respondto a periodic stimulus generated by having the subject do alternate left andright finger tapping. Figure 1.7 shows series from an earthquake and a nuclearexplosion. The relative amounts of energy at various frequencies for the twophases can produce statistics, useful for discriminating between earthquakesand explosions.

In this chapter, we summarize an approach to handling correlation gen-erated in stationary time series that begins by transforming the series to thefrequency domain. This simple linear transformation essentially matches sinesand cosines of various frequencies against the underlying data and serves twopurposes as discussed in Examples 2.7 and 2.8. The periodogram that wasintroduced in Example 2.8 has its population counterpart called the powerspectrum, and its estimation is a main goal of spectral analysis. Anotherpurpose of exploring this topic is statistical convenience resulting from the pe-riodic components being nearly uncorrelated. This property facilitates writinglikelihoods based on classical statistical methods

An important part of analyzing data in the frequency domain, as well as

176 Spectral Analysis and Filtering

the time domain, is the investigation and exploitation of the properties of thetime-invariant linear filter. This special linear transformation is used similarlyto linear regression in conventional statistics, and we use many of the sameterms in the time series context. We have previously mentioned the coherenceas a measure of the relation between two series at a given frequency, andwe show later that this coherence also measures the performance of the bestlinear filter relating the two series. Linear filtering can also be an importantstep in isolating a signal embedded in noise. For example, the lower panelsof Figure 1.11 contain a signal contaminated with an additive noise, whereasthe upper panel contains the pure signal. It might also be appropriate to askwhether a linear filter transformation exists that could be applied to the lowerpanel to produce a series closer to the signal in the upper panel. The useof filtering for reducing noise will also be a part of the presentation in thischapter. We emphasize, throughout, the analogy between filtering techniquesand conventional linear regression.

Many frequency scales will often coexist, depending on the nature of theproblem. For example, in the Johnson & Johnson data set in Figure 1.1,the predominant frequency of oscillation is one cycle per year (4 quarters), or.25 cycles per observation. The predominant frequency in the SOI and fishpopulations series in Figure 1.5 is also one cycle per year, but this correspondsto 1 cycle every 12 months, or .083 cycles per observation. For simplicity, wemeasure frequency, ω, at cycles per time point and discuss the implications ofcertain frequencies in terms of the problem context. Of descriptive interest isthe period of a time series, defined as the number of points in a cycle, i.e.,

T =1ω

. (4.1)

Hence, the predominant period of the Johnson & Johnson series is 1/.25 or4 quarters per cycle, whereas the predominant period of the SOI series is 12months per cycle.

4.2 Cyclical Behavior and Periodicity

As previously mentioned, we have already encountered the notion of periodicityin numerous examples in Chapters 1 and 2. The general notion of periodicitycan be made more precise by introducing some terminology. In order to definethe rate at which a series oscillates, we first define a cycle as one completeperiod of a sine or cosine function defined over a time interval of length 2π.As in (1.5), we consider the periodic process

xt = A cos(2πωt + φ) (4.2)

for t = 0,±1,±2, . . ., where ω is a frequency index, defined in cycles per unittime with A determining the height or amplitude of the function and φ, called

4.2: Periodicity 177

the phase, determining the start point of the cosine function. We can introducerandom variation in this time series by allowing the amplitude and phase tovary randomly.

As discussed in Example 2.7, for purposes of data analysis, it is easier touse a trigonometric identity1 and write (4.2) as

xt = U1 cos(2πωt) + U2 sin(2πωt), (4.3)

where U1 = A cos φ and U2 = −A sin φ are often taken to be normally distrib-uted random variables. In this case, the amplitude is A =

√U2

1 + U22 and the

phase is φ = tan−1(−U2/U1). From these facts we can show that if, and onlyif, in (4.2), A and φ are independent random variables, where A2 is chi-squaredwith 2 degrees of freedom, and φ is uniformly distributed on (−π, π), then U1and U2 are independent, standard normal random variables (see Problem 4.2).

The above random process is also a function of its frequency, defined bythe parameter ω. The frequency is measured in cycles per unit time, or incycles per point in the above illustration. For ω = 1, the series makes onecycle per time unit; for ω = .50, the series makes a cycle every two time units;for ω = .25, every four units, and so on. In general, data that occurs atdiscrete time points will need at least two points to determine a cycle, so thehighest frequency of interest is .5 cycles per point. This frequency is calledthe folding frequency and defines the highest frequency that can be seen indiscrete sampling. Higher frequencies sampled this way will appear at lowerfrequencies, called aliases; an example is the way a camera samples a rotatingwheel on a moving automobile in a movie, in which the wheel appears to berotating at a different rate. For example, movies are recorded at 24 frames persecond. If the camera is filming a wheel that is rotating at the rate of 24 cyclesper second (or 24 Hertz), the wheel will appear to stand still (that’s about 110miles per hour in case you were wondering).

Consider a generalization of (4.3) that allows mixtures of periodic series,with multiple frequencies and amplitudes.

xt =q∑

k=1

[Uk1 cos(2πωkt) + Uk2 sin(2πωkt)] , (4.4)

where Uk1, Uk2, for k = 1, 2, . . . , q, are independent zero-mean random vari-ables with variances σ2

k, and the ωk are distinct frequencies. Notice that (4.4)exhibits the process as a sum of independent components, with variance σ2

k forfrequency ωk. Using the independence of the Us and a trig identity,1 it is easyto show (Problem 4.3) that the autocovariance function of the process is

γ(h) =q∑

k=1

σ2k cos(2πωkh), (4.5)

1cos(α ± β) = cos(α) cos(β) ∓ sin(α) sin(β).


freq=6/100, amp^2=13

Time

x1

0 20 40 60 80 100

−15

−50

510

15

freq=10/100, amp^2=41

Time

x2

0 20 40 60 80 100

−15

−50

510

15

freq=40/100, amp^2=85

Time

x3

0 20 40 60 80 100

−15

−50

510

15

sum

Time

x

0 20 40 60 80 100

−15

−50

510

15

Figure 4.1 Periodic components and their sum as described in Example 4.1.

and we note the autocovariance function is the sum of periodic componentswith weights proportional to the variances σ2

k. Hence, xt is a mean-zero sta-tionary processes with variance

γ(0) = E(x2t ) =

q∑k=1

σ2k, (4.6)

which exhibits the overall variance as a sum of variances of each of the com-ponent parts.

Example 4.1 A Periodic Series

Figure 4.1 shows an example of the mixture (4.4) with q = 3 constructedin the following way. First, for t = 1, . . . , 100, we generated three series

xt1 = 2 cos(2πt 6/100) + 3 sin(2πt 6/100)

xt2 = 4 cos(2πt 10/100) + 5 sin(2πt 10/100)

xt3 = 6 cos(2πt 40/100) + 7 sin(2πt 40/100)

These three series are displayed in Figure 4.1 along with the correspond-ing frequencies and squared amplitudes. For example, the squared am-plitude of xt1 is 22 +32 = 13. Hence, the maximum and minimum valuesthat xt1 will attain are ±√

13 = ±3.61.

4.2: Periodicity 179

Finally, we constructed

xt = xt1 + xt2 + xt3

and this series is also displayed in Figure 4.1. We note that xt appearsto behave as some of the periodic series we saw in Chapters 1 and 2. Thesystematic sorting out of the essential frequency components in a timeseries, including their relative contributions, constitutes one of the mainobjectives of spectral analysis.

The R code to reproduce Figure 4.1 is

> t = 1:100> x1 = 2*cos(2*pi*t*6/100) + 3*sin(2*pi*t*6/100)> x2 = 4*cos(2*pi*t*10/100) + 5*sin(2*pi*t*10/100)> x3 = 6*cos(2*pi*t*40/100) + 7*sin(2*pi*t*40/100)> x = x1 + x2 + x3> par(mfrow=c(2,2))> plot.ts(x1, ylim=c(-16,16), main="freq=6/100, ampˆ2=13")> plot.ts(x2, ylim=c(-16,16), main="freq=10/100, ampˆ2=41")> plot.ts(x3, ylim=c(-16,16), main="freq=40/100, ampˆ2=85")> plot.ts(x, ylim=c(-16,16), main="sum")

Example 4.2 The Scaled Periodogram for Example 4.1

In §2.3, Example 2.8, we introduced the periodogram as a way to dis-cover the periodic components of a time series. Recall that the scaledperiodogram is given by

P (j/n) =

(2n

n∑t=1

xt cos(2πtj/n)

)2

+

(2n

n∑t=1

xt sin(2πtj/n)

)2

(4.7)

and it may regarded as a measure of the squared correlation of the datawith sinusoids oscillating at a frequency of ωj = j/n, or j cycles in ntime points. Recall that we are basically computing the regression ofthe data on the sinusoids varying at the fundamental frequencies, j/n.As discussed in Example 2.8, the periodogram may be computed quicklyusing the fast Fourier transform (FFT), and there is no need to runrepeated regressions.

The scaled periodogram of the data, xt, simulated in Example 4.1 isshown in Figure 4.2, and it clearly identifies the three components xt1, xt2,and xt3 of xt. Moreover, the heights of the scaled periodogram shown inthe figure are

P (6/100) = 13, P (10/100) = 41, P (40/100) = 85

and P (j/n) = 0 otherwise. These are exactly the values of the squaredamplitudes of the components generated in Example 4.1. This outcome


0.0 0.1 0.2 0.3 0.4 0.5

020

4060

80

frequency

perio

dogra

m

Figure 4.2 Periodogram of the data generated in Example 4.1.

suggests that the periodogram may provide some insight into the variancecomponents, (4.6), of a real set of data.

Assuming the simulated data, x, were retained from the previous exam-ple, the R code to reproduce Figure 4.2 is

> P = abs(2*fft(x)/100)ˆ2> f = 0:50/100> plot(f, P[1:51], type="o", xlab="frequency",+ ylab="periodogram")

A curious reader may also wish to plot the entire periodogram over allfundamental frequencies between zero and one. A quick and easy way todo this is to use the command plot.ts(P).

If we consider the data xt in Example 4.1 as a color (waveform) made upof primary colors xt1, xt2, xt3 at various strengths (amplitudes), then we mightconsider the periodogram as a prism that decomposes the color xt into itsprimary colors (spectrum). Hence the term spectral analysis.

Another fact that may be of use in understanding the periodogram is thatfor any time series sample x1, . . . , xn, where n is odd, we may write, exactly

xt = a0 +(n−1)/2∑

j=1

[aj cos(2πt j/n) + bj sin(2πt j/n)] , (4.8)

for t = 1, . . . , n and suitably chosen coefficients. If n is even, the representation(4.8) can be modified by summing to (n/2 − 1) and adding an additionalcomponent given by an/2 cos(2πt 1/2) = an/2(−1)t. The crucial point hereis that (4.8) is exact for any sample. Hence (4.4) may be thought of as an

4.3: Spectral Density 181

approximation to (4.8), the idea being that many of the coefficients in (4.8)may be close to zero. Recall from Example 2.8, that

P (j/n) = a2j + b2

j , (4.9)

so the scaled periodogram indicates which periodic components in (4.8) arelarge and which components are small. We also saw (4.9) in Example 4.2.

The periodogram, which was introduced in Schuster (1898) and used inSchuster (1906) for studying the periodicities in the sunspot series (shown inFigure 4.31 in the Problems section) is a sample based statistic. In Exam-ple 4.2, we discussed the fact that the periodogram may be giving us an ideaof the variance components associated with each frequency, as presented in(4.6), of a time series. These variance components, however, are populationparameters. The concepts of population parameters and sample statistics, asthey relate to spectral analysis of time series can be generalized to cover sta-tionary time series and that is the topic of the next section.

4.3 The Spectral Density

The idea that a time series is composed of periodic components, appearing inproportion to their underlying variances, is fundamental in the spectral repre-sentation given in Theorem C.2 of Appendix C. The result is quite technicalbecause it involves stochastic integration; that is, integration with respect to astochastic process. In nontechnical terms, Theorem C.2 says that (4.4) is ap-proximately true for any stationary time series. In other words, any stationarytime series may be thought of, approximately, as the random superposition ofsines and cosines oscillating at various frequencies.

Given that (4.4) is approximately true for all stationary time series, thenext question is whether a meaningful representation for its autocovariancefunction, like the one displayed in (4.5), also exists. The answer is yes, andthis representation is given in Theorem C.1 of Appendix C. The followingexample will help explain the result.

Example 4.3 A Periodic Stationary Process

Consider a periodic stationary random process given by (4.3), with afixed frequency ω0, say,

xt = U1 cos(2πω0t) + U2 sin(2πω0t),

where U1 and U2 are independent zero-mean random variables with equalvariance σ2. The number of time periods needed for the above series to


complete one cycle is exactly 1/ω0, and the process makes exactly ω0cycles per point for t = 0,±1,±2, . . .. It is easily shown that2

γ(h) = σ2 cos(2πω0h) =σ2

2e−2πiω0h +

σ2

2e2πiω0h

=∫ 1/2

−1/2e2πiωhdF (ω)

using a Riemann–Stieltjes integration, where F (ω) is the function definedby

F (ω) =

⎧⎨⎩ 0 ω < −ω0σ2/2, −ω0 ≤ ω < ω0σ2 ω ≥ ω0 .

The function F (ω) behaves like a cumulative distribution function for adiscrete random variable, except that F (∞) = σ2 = γx(0) instead of one.In fact, F (ω) is a cumulative distribution function, not of probabilities,but rather of variances associated with the frequency ω0 in an analysis ofvariance, with F (∞) being the total variance of the process xt. Hence,we term F (ω) the spectral distribution function.

Theorem C.1 in Appendix C states that a representation such as the onegiven in Example 4.3 always exists for a stationary process. In particular, ifxt is stationary with autocovariance γ(h) = E[(xt+h − µ)(xt − µ)], then thereexists a unique monotonically increasing function F (ω), called the spectraldistribution function, that is bounded, with F (−∞) = F (−1/2) = 0, andF (∞) = F (1/2) = γ(0) such that

γ(h) =∫ 1/2

−1/2e2πiωh dF (ω). (4.10)

A more important situation we use repeatedly is the one covered by The-orem C.3, where it is shown that, subject to absolute summability of the au-tocovariance, the spectral distribution function is absolutely continuous withdF (ω) = f(ω) dω, and the representation (4.10) becomes the motivation forthe property given below.

Property P4.1: The Spectral DensityIf the autocovariance function, γ(h), of a stationary process satisfies

∞∑h=−∞

|γ(h)| < ∞, (4.11)

2Some identities may be helpful here: eiα = cos(α)+ i sin(α), so cos(α) = (eiα +e−iα)/2and sin(α) = (eiα − e−iα)/2i.


then it has the representation

γ(h) =∫ 1/2

−1/2e2πiωh f(ω) dω h = 0,±1,±2, . . . (4.12)

as the inverse transform of the spectral density, which has the representation

f(ω) =∞∑

h=−∞γ(h)e−2πiωh − 1/2 ≤ ω ≤ 1/2. (4.13)

This spectral density is the analogue of the probability density function;the fact that γ(h) is non-negative definite ensures

f(ω) ≥ 0

for all ω (see Appendix C, Theorem C.3 for details). It follows immediatelyfrom (4.12) and (4.13) that

f(ω) = f(−ω)

andf(ω + 1) = f(ω),

verifying the spectral density is an even function of period one. Because ofthe evenness, we will typically only plot f(ω) for ω ≥ 0. In addition, puttingh = 0 in (4.12) yields

γ(0) = var(xt) =∫ 1/2

−1/2f(ω) dω,

which expresses the total variance as the integrated spectral density over all ofthe frequencies. We show later on, that a linear filter can isolate the variancein certain frequency intervals or bands.

Analogous to probability theory, γ(h) in (4.12) is the characteristic functionof the spectral density f(ω) in (4.13). These facts should make it clear that,when the condition of Property P4.1 is satisfied, the autocovariance functionγ(h) and the spectral density function f(ω) contain the same information.That information, however, is expressed in different ways. The autocovariancefunction expresses information in terms of lags, whereas the spectral densityexpresses the same information in terms of cycles. Some problems are easier towork with when considering lagged information and we would tend to handlethose problems in the time domain. Nevertheless, other problems are easier towork with when considering periodic information and we would tend to handlethose problems in the spectral domain.

We also mention, at this point, that we have been focusing on the frequencyω, expressed in cycles per point rather than the more common (in statistics)


alternative λ = 2πω that would give radians per point. Finally, the absolutesummability condition, (4.11), is not satisfied by (4.5), the example that wehave used to introduce the idea of a spectral representation. The condition,however, is satisfied for ARMA models.

We note that the autocovariance function, γ(h), in (4.12) and the spectraldensity, f(ω), in (4.13) are Fourier transform pairs. In general, we have thefollowing definition.

Definition 4.1 For a general function {at; t = 0,±1,±2, . . .} satisfying theabsolute summability condition

∞∑t=−∞

|at| < ∞, (4.14)

we define a Fourier transform pair to be of the form

A(ω) =∞∑

t=−∞ate−2πiωt (4.15)

and

at =∫ 1/2

−1/2A(ω)e2πiωt dω. (4.16)

The use of (4.12) and (4.13) as Fourier transform pairs is fundamental in thestudy of stationary discrete time processes. Under the summability condition(4.11), the Fourier transform pair (4.12) and (4.13) will exist and this relationis unique. If f(ω) and g(ω) are two spectral densities for which∫ 1/2

−1/2f(ω)e2πiωh dω =

∫ 1/2

−1/2g(ω)e2πiωh dω (4.17)

for all h = 0,±1,±2, . . . , then

f(ω) = g(ω) (4.18)

almost everywhere.It is illuminating to examine the spectral density for the series that we have

looked at in earlier discussions.

Example 4.4 White Noise Series

As a simple example, consider the theoretical power spectrum of a se-quence of uncorrelated random variables, wt, with variance σ2

w. A sim-ulated set of data is displayed in the top of Figure 1.8. Because theautocovariance function was computed in Example 1.16 as γw(h) = σ2

w

for h = 0, and zero, otherwise, it follows from (4.13) that

fw(ω) = σ2w


0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.5

1

1.5

2White Noise

pow

er

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.5

1Smoothed White Noise

pow

er

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50

100

1502nd−order AR

frequency

pow

er

Figure 4.3 Theoretical spectra of white noise (top), smoothed white noise(middle), and a second-order autoregressive process (bottom).

for −1/2 ≤ ω ≤ 1/2 with the resulting equal power at all frequencies.This property is seen in the realization, which seems to contain all dif-ferent frequencies in a roughly equal mix. In fact, the name white noisecomes from the analogy to white light, which contains all frequencies inthe color spectrum. Figure 4.3 shows a plot of the white noise spectrumfor σ2

w = 1.

Example 4.5 A Simple Moving Average

A series that does not have an equal mix of frequencies is the smoothedwhite noise series shown in the bottom panel of Figure 1.8. Specifically,we construct the three-point moving average series, defined by

vt =13(wt−1 + wt + wt+1

).

It is clear from the sample realization that the series has less of the higheror faster frequencies, and we calculate its power spectrum to verify thisobservation. We have previously computed the autocovariance of this


process in Example 1.17, obtaining

γv(h) =σ2

w

9

(3 − |h|

)for |h| ≤ 2 and γy(h) = 0 for |h| > 2. Then, using (4.13) gives

fv(ω) =2∑

h=−2

γy(h) e−2πiωh

=σ2

w

9(e−4πiω + e4πiω

)+

2σ2w

9(e−2πiω + e2πω

)+

3σ2w

9

=σ2

w

9[3 + 4 cos(2πω) + 2 cos(4πω)

].

Plotting the spectrum for σ2w = 1, as in Figure 4.3, shows the lower

frequencies near zero have greater power and the higher or faster fre-quencies, say, ω > .2, tend to have less power.

Example 4.6 A Second-Order Autoregressive Series

As a final example, we consider the spectrum of an AR(2) series of theform

xt − φ1xt−1 − φ2xt−2 = wt,

for the special case φ1 = 1 and φ2 = −.9. Recall Example 1.10 andFigure 1.9, which shows a sample realization of such a process for σw = 1.We note the data exhibit a strong periodic component that makes a cycleabout every six points. First, computing the autocovariance function ofthe right side and equating it to the autocovariance on the left yields

γw(h) = E[(xt+h − φ1xt+h−1 − φ2xt+h−2)(xt − φ1xt−1 − φ2xt−2)]= [1 + φ2

1 + φ22]γx(h) + (φ1φ2 − φ1)[γx(h + 1) + γx(h − 1)]

− φ2[γx(h + 2) + γx(h − 2)]= 2.81γx(h) − 1.90[γx(h + 1) + γx(h − 1)]

+ .90[γx(h + 2) + γx(h − 2)],

where we have substituted the values of φ1 = 1 and φ2 = −.9 in theequation. Now, substituting the spectral representation (4.12) for γx(h)in the above equation yields

γw(h) =∫ 1/2

−1/2[2.81 − 1.90(e2πiω + e−2πiω)

+ .90(e4πiω + e−4πiω)] e2πiωhfx(ω)dω

=∫ 1/2

−1/2[2.81 − 3.80 cos(2πω) + 1.80 cos(4πω)] e2πiωhfx(ω) dω.

4.4: DFT, Periodogram 187

If the spectrum of the white noise process is gw(ω), the uniqueness of theFourier transform allows us to identify

gw(ω) = [2.81 − 3.80 cos(2πω) + 1.80 cos(4πω)] fx(ω).

But, as we have already seen, gw(ω) = σ2w, from which we deduce that

fx(ω) =σ2

w

2.81 − 3.80 cos(2πω) + 1.80 cos(4πω)

is the spectrum of the autoregressive series. Setting σw = 1, Figure 4.3displays fx(ω) and shows a strong power component at about ω = .16cycles per point or a period between six and seven cycles per point andvery little power at other frequencies. In this case, modifying the whitenoise series by applying the second-order AR operator has concentratedthe power or variance of the resulting series in a very narrow frequencyband.

The above examples have been given primarily to motivate the use of thepower spectrum for describing the theoretical variance fluctuations of a sta-tionary time series. Indeed, the interpretation of the spectral density functionas the variance of the time series over a given frequency band gives us the intu-itive explanation for its physical meaning. The plot of the function f(ω) overthe frequency argument ω can even be thought of as an analysis of variance,in which the columns or block effects are the frequencies, indexed by ω.

4.4 Periodogram and Discrete FourierTransform

We are now ready to tie together the periodogram, which is the sample-basedconcept presented in §4.2, with the spectral density, which is the population-based concept of §4.3.

Definition 4.2 Given data x1, . . . , xn, we define the discrete Fourier trans-form (DFT) to be

d(ωj) = n−1/2n∑

t=1

xte−2πiωjt (4.19)

for j = 0, 1, . . . , n − 1, where the frequencies ωj = j/n are called the Fourieror fundamental frequencies.

If n is a highly composite integer (i.e., it has many factors), the DFT canbe computed by the fast Fourier transform (FFT) introduced in Cooley andTukey (1965). Also, different packages scale the FFT differently, so it is agood idea to consult the documentation. R computes the DFT defined in(4.19) without the factor n−1/2, but with an additional factor of e2πiωj that


can be ignored because we will be interested in the squared modulus of theDFT. Sometimes it is helpful to exploit the inversion result for DFTs whichshows the linear transformation is one-to-one. For the inverse DFT we have,

xt = n−1/2n−1∑j=0

d(ωj)e2πiωjt (4.20)

for t = 1, . . . , n. The following example shows how to calculate the DFT and itsinverse in R for the data set {1, 2, 3, 4}; note that R writes a complex numberz = a + ib as a+bi.

> x = 1:4> dft = fft(x)/sqrt(4)> dft

[1] 5+0i -1+1i -1+0i -1-1i> idft = fft(dft, inverse=T)/sqrt(4)> idft

[1] 1+0i 2+0i 3+0i 4+0i

We now define the periodogram as the squared modulus3 of the DFT.

Definition 4.3 Given data x1, . . . , xn, we define the periodogram to be

I(ωj) = |d(ωj)|2 (4.21)

for j = 0, 1, 2, . . . , n − 1.

Note that I(0) = nx2, where x is the sample mean. In addition, because∑nt=1 exp(−2πiωjt) = 0 for j �= 0,4 we can write the DFT as

d(ωj) = n−1/2n∑

t=1

(xt − x)e−2πiωjt (4.22)

for j �= 0. Thus, for j �= 0,

I(ωj) = |d(ωj)|2 = n−1n∑

t=1

n∑s=1

(xt − x)(xs − x)e−2πiωj(t−s)

= n−1n−1∑

h=−(n−1)

n−|h|∑t=1

(xt+|h| − x)(xt − x)e−2πiωjh

=n−1∑

h=−(n−1)

γ(h)e−2πiωjh (4.23)

3If z = a + ib is a complex number, then z = a − ib, and |z|2 = zz = a2 + b2.4Note

∑n

t=1 zt = z 1−zn

1−zfor z �= 1.


where we have put h = t − s, with γ(h) as given in (1.36).Recall, P (ωj) = (4/n)I(ωj) where P (ωj) is the scaled periodogram defined

in (4.7). Henceforth we will work with I(ωj) instead of P (ωj). Note that, inview of (4.23), I(ωj) in (4.21) is the sample version of f(ωj) given in (4.13).That is, we may think of the periodogram, I(ωj), as the “sample spectraldensity” of xt.

It is sometimes useful to work with the real and imaginary parts of theDFT individually. To this end, we define the following transforms.

Definition 4.4 Given data x1, . . . , xn, we define the cosine transform

dc(ωj) = n−1/2n∑

t=1

xt cos(2πωjt) (4.24)

and the sine transform

ds(ωj) = n−1/2n∑

t=1

xt sin(2πωjt) (4.25)

where ωj = j/n for j = 0, 1, . . . , n − 1.

We note that d(ωj) = dc(ωj) − i ds(ωj) and hence

I(ωj) = d2c(ωj) + d2

s(ωj). (4.26)

We have also discussed the fact that spectral analysis can be thought of asan analysis of variance. The next example examines this notion.

Example 4.7 Spectral ANOVA

Let x1, . . . , xn be a sample of size n, where for ease, n is odd. Then,recalling Example 2.8 and the discussion around (4.8) and (4.9),

xt = a0 +m∑

j=1

[aj cos(2πωjt) + bj sin(2πωjt)] , (4.27)

where m = (n − 1)/2, is exact for t = 1, . . . , n. In particular, usingmultiple regression formulas, we have a0 = x,

aj =2n

n∑t=1

xt cos(2πωjt) =2√n

dc(ωj)

bj =2n

n∑t=1

xt sin(2πωjt) =2√n

ds(ωj).

Hence, we may write

(xt − x) =2√n

m∑j=1

[dc(ωj) cos(2πωjt) + ds(ωj) sin(2πωjt)]


for t = 1, . . . , n. Squaring both sides and summing we have5

n∑t=1

(xt − x)2 = 2m∑

j=1

[d2

c(ωj) + d2s(ωj)

]= 2

m∑j=1

I(ωj).

Thus, we have partitioned the sum of squares into harmonic componentsrepresented by frequency ωj with the periodogram, I(ωj), being the meansquare regression. This leads to the ANOVA table:

Source df SS MSω1 2 2I(ω1) I(ω1)ω2 2 2I(ω2) I(ω2)...

......

...ωm 2 2I(ωm) I(ωm)

Total n − 1∑n

t=1(xt − x)2

This decomposition means that if the data contain some strong peri-odic components, then the periodogram values corresponding to thosefrequencies (or near those frequencies) will be large. On the other hand,the corresponding values of the periodogram will be small for periodiccomponents not present in the data. The following is an R exampleto help explain this concept. We consider n = 5 observations given byx1 = 1, x2 = 2, x3 = 3, x4 = 2, x5 = 1. Note that the data complete onecycle, but not in a sinusoidal way. Thus, we should expect the ω1 = 1/5component to be relatively large but not exhaustive, and the ω2 = 2/5component to be small.

> x = c(1,2,3,2,1)> t = 1:5> c1 = cos(2*pi*t*1/5)> s1 = sin(2*pi*t*1/5)> c2 = cos(2*pi*t*2/5)> s2 = sin(2*pi*t*2/5)> creg = lm(x˜c1+s1+c2+s2)> anova(creg) # partial output and combined ANOVA shown

# ANOVADf Sum Sq # Source df SS MS

c1 1 1.79443 #s1 1 0.94721 # freq=1/5 2 2.74164 1.37082c2 1 0.00557 #s2 1 0.05279 # freq=2/5 2 0.05836 0.02918Residuals 0 0.00000 #

5Recall∑n

t=1 cos2(2πωjt) =∑n

t=1 sin2(2πωjt) = n/2 for j �= 0 or a multiple of n. Also∑n

t=1 cos(2πωjt) sin(2πωkt) = 0 for any j and k.


> abs(fft(x))ˆ2/5 # the periodogram (as a check)[1] 16.2000 1.3708 0.02918 0.02918 1.3708

> # I(0) I(1/5) I(2/5) I(3/5) I(4/5)

Note that x = 1.8 so I(0) = 5 × 1.82 = 16.2. Also, as a check

I(1/5) = [SS(c1) + SS(s1)]/2 = (1.79443 + .94721)/2 = 1.3708,

I(2/5) = [SS(c2) + SS(s2)]/2 = (.00557 + .05279)/2 = .02918,

and I(j/5) = I(1 − j/5), for j = 3, 4. Finally, we note that the sum ofsquares associated with the residuals is zero, indicating an exact fit.

We are now ready to present some large sample properties of the peri-odogram. First, let µ be the mean of a stationary process xt with absolutelysummable autocovariance function γ(h) and spectral density f(ω). We can usethe same argument as in (4.23), replacing x by µ in (4.22), to write

I(ωj) = n−1n−1∑

h=−(n−1)

n−|h|∑t=1

(xt+|h| − µ)(xt − µ)e−2πiωjh (4.28)

where ωj is a non-zero fundamental frequency. Taking expectation in (4.28)we obtain

E [I(ωj)] =n−1∑

h=−(n−1)

(n − |h|

n

)γ(h)e−2πiωjh. (4.29)

For any given ω �= 0, choose a fundamental frequency ωj:n → ω as n → ∞,6

from which it follows by (4.29) that

E [I(ωj:n)] → f(ω) =∞∑

h=−∞γ(h)e−2πihω (4.30)

as n → ∞.7 In other words, under absolute summability of γ(h), the spectraldensity is the long-term average of the periodogram.

To examine the asymptotic distribution of the periodogram, we note thatif xt is a normal time series, the sine and cosine transforms will also be jointlynormal, because they are linear combinations of the jointly normal randomvariables x1, x2, . . . , xn. In that case, the assumption that the covariance func-tion satisfies the condition

θ =∞∑

h=−∞|h||γ(h)| < ∞ (4.31)

6By this we mean ωj:n is a frequency of the form jn/n, where {jn} is a sequence ofintegers chosen so that jn/n → ω as n → ∞.

7From Definition 4.3 we have I(0) = nx2, so the analogous result for the case ω = 0 isE[I(0)] − nµ2 = n var(x) → f(0) as n → ∞.


is enough to obtain simple large sample approximations for the variances andcovariances. Using the same argument used to develop (4.29) we have

cov[dc(ωj), dc(ωk)] =n∑

s=1

n∑t=1

γ(s − t) cos(2πωjs) cos(2πωkt), (4.32)

cov[dc(ωj), ds(ωk)] =n∑

s=1

n∑t=1

γ(s − t) cos(2πωjs) sin(2πωkt), (4.33)

and

cov[ds(ωj), ds(ωk)] =n∑

s=1

n∑t=1

γ(s − t) sin(2πωjs) sin(2πωkt), (4.34)

where the variance terms are obtained by setting ωj = ωk in (4.32) and (4.34).In Appendix C, §C.2, we show the terms in (4.32)-(4.34) have interesting prop-erties under assumption (4.31), namely, for ωj , ωk �= 0 or 1/2,

cov[dc(ωj), dc(ωk)] ={

f(ωj)/2 + εn, ωj = ωk

εn, ωj �= ωk(4.35)

cov[ds(ωj), ds(ωk)] ={

f(ωj)/2 + εn, ωj = ωk

εn, ωj �= ωk(4.36)

andcov[dc(ωj), ds(ωk)] = εn, (4.37)

where the error term εn in the approximations can be bounded,

|εn| ≤ θ/n, (4.38)

and θ is given by (4.31). If ωj = ωk = 0 or 1/2 in (4.35), the multiplier 1/2disappears; note that ds(0) = ds(1/2) = 0, so (4.36) does not apply.

Example 4.8 Covariance of Sines and Cosines for an MA Process

For the three-point moving average series of Example 4.5, the theoreticalspectrum is shown in Figure 4.3. For n = 256 points, the theoreticalcovariance matrix of the vector

ddd = (dc(ω26), ds(ω26), dc(ω27), ds(ω27))′

is

cov(ddd) =

⎛⎜⎜⎝.3752 −.0009 −.0022 −.0010

−.0009 .3777 −.0009 .0003−.0022 −.0009 .3667 −.0010−.0010 .0003 −.0010 .3692

⎞⎟⎟⎠ .


The diagonal elements can be compared with the theoretical spectral val-ues of .7548 for the spectrum at frequency ω26 = .102, and of .7378 for thespectrum at ω27 = .105. Hence, the cosine and sine transforms producenearly uncorrelated variables with variances approximately equal to onehalf of the theoretical spectrum. For this particular case, the uniformbound is determined from θ = 8/9, yielding |ε256| ≤ .0035 for the boundon the approximation error.

If xt ∼ iid(0, σ2), then it follows from (4.31)-(4.37) and the central limittheorem8 that

dc(ωj:n) ∼ AN(0, σ2/2) and ds(ωj:n) ∼ AN(0, σ2/2) (4.39)

jointly and independently, and independent of dc(ωk:n) and ds(ωk:n) providedωj:n → ω1 and ωk:n → ω2 where 0 < ω1 �= ω2 < 1/2. We note that in thiscase, f(ω) = σ2. In view of (4.39), it follows immediately that as n → ∞,

2I(ωj:n)σ2

d→ χ22 and

2I(ωk:n)σ2

d→ χ22 (4.40)

with I(ωj:n) and I(ωk:n) being asymptotically independent, where χ2ν denotes

a chi-squared random variable with ν degrees of freedom.Using the central limit theory of §C.2, it is fairly easy to extend the results

of the iid case to the case of a linear process.

Property P4.2: Distribution of the Periodogram OrdinatesIf

xt =∞∑

j=−∞ψjwt−j ,

∞∑j=−∞

|ψj | < ∞ (4.41)

where wt ∼ iid(0, σ2w), and (4.31) holds, then for any collection of m distinct

frequencies ωj with ωj:n → ωj

2I(ωj:n)f(ωj)

d→ iid χ22 (4.42)

provided f(ωj) > 0, for j = 1, . . . , m.

This result is stated more precisely in Theorem C.7 of §C.3. Other ap-proaches to large sample normality of the periodogram ordinates are in termsof cumulants, as in Brillinger (1981), or in terms of mixing conditions, suchas in Rosenblatt (1956). We adopt the approach here used by Hannan (1970),Fuller (1995), and Brockwell and Davis (1991).

8If Yj ∼ iid(0, σ2) and {aj} are constants for which∑n

j=1 a2j/ max1≤j≤n a2

j → ∞ as n →∞, then

∑n

j=1 ajYj ∼ AN(

0, σ2∑n

j=1 a2j

); the notation AN is explained in Definition A.5.


The distributional result (4.42) can be used to derive an approximate con-fidence interval for the spectrum in the usual way. Let χ2

ν(α) denote the lowerα probability tail for the chi-squared distribution with ν degrees of freedom;that is,

Pr{χ2ν ≤ χ2

ν(α)} = α. (4.43)

Then, an approximate 100(1−α)% confidence interval for the spectral densityfunction would be of the form

2 I(ωj:n)χ2

2(1 − α/2)≤ f(ω) ≤ 2 I(ωj:n)

χ22(α/2)

(4.44)

Often, nonstationary trends are present that should be eliminated beforecomputing the periodogram. Trends introduce extremely low frequency com-ponents in the periodogram that tend to obscure the appearance at higherfrequencies. For this reason, it is usually conventional to center the data priorto a spectral analysis using either mean-adjusted data of the form xt − x toeliminate the zero or d-c component or to use detrended data of the formxt − β1 − β2t to eliminate the term that will be considered a half cycle bythe spectral analysis. Note that higher order polynomial regressions in t ornonparametric smoothing (linear filtering) could be used in cases where thetrend is nonlinear.

As previously indicated, it is often convenient to calculate the DFTs, andhence the periodogram, using the fast Fourier transform algorithm. The FFTutilizes a number of redundancies in the calculation of the DFT when n ishighly composite; that is, an integer with many factors of 2, 3, or 5, the bestcase being when n = 2p is a factor of 2. Details may be found in Cooley andTukey (1965). To accommodate this property, we can pad the centered (ordetrended) data of length n to the next highly composite integer n′ by addingzeros, i.e., setting xc

n+1 = xcn+2 = · · · = xc

n′ = 0, where xct denotes the centered

data. This means that the fundamental frequency ordinates will be ωj = j/n′

instead of j/n. We illustrate by considering the periodogram of the SOI andRecruitment series, as has been given in Figure 1.5 of Chapter 1. Recall thatthe series are monthly series and n = 453. To find n′ in R, use the commandnextn(453) to see that n′ = 480 will be used in the spectral analyses by default(use help(spec.pgram) to see how to override this default).

Example 4.9 Periodogram of SOI and Recruitment Series

Figure 4.4 shows the periodograms of each series. As previously indi-cated, the centered data have been padded to a series of length 480. Wenotice a narrow band peak at the obvious yearly cycle, ω = 1/12. Inaddition, there is considerable amount of power in a wide band at thelower frequencies that is centered around the four-year cycle ω = 1/48representing a possible El Nino effect. This wide band activity suggeststhat the possible El Nino cycle is irregular, but tends to be around four


0.0 0.1 0.2 0.3 0.4 0.5

02

46

812

frequency

spec

trum

Series: soiRaw Periodogram

bandwidth = 0.000601

0.0 0.1 0.2 0.3 0.4 0.5

050

0015

000

frequency

spec

trum

Series: recRaw Periodogram

bandwidth = 0.000601

Figure 4.4 Periodogram of SOI and Recruitment, n = 453 (n′ = 480), showingcommon peaks at ω = 1/12 = .083 and ω = 1/48 = .021 cycles/month.

years on average. We will continue to address this problem as we moveto more sophisticated analyses.

Noting χ22(.025) = .05 and χ2

2(.975) = 7.38, we can obtain approximate95% confidence intervals for the frequencies of interest. For example, theperiodogram of the SOI series is IS(1/12) = 11.64 at the yearly cycle.An approximate 95% confidence interval for the spectrum fS(1/12) isthen

[2(11.67)/7.38, 2(11.67)/.05] = [3.16, 460.81],

which is too wide to be of much use. We do notice, however, that thelower value of 3.16 is higher than any other periodogram ordinate, soit is safe to say that this value is significant. On the other hand, anapproximate 95% confidence interval for the spectrum at the four-yearcycle, fS(1/48), is

[2(.64)/7.38, 2(.64)/.05] = [.17, 25.47],

which again is extremely wide, and with which we are unable to establish


significance of the peak.

We now give the R commands that can be used to reproduce Figure 4.4.To calculate and graph the periodogram, we used the spec.pgram com-mand in R. We have set log="no" because R will plot the periodogramon a log10 scale by default. Figure 4.4 displays a bandwidth and by de-fault, R tapers the data (which we override in the commands below). Wewill discuss bandwidth and tapering in the next section, so ignore theseconcepts for the time being.> soi = scan("/mydata/soi.dat")> rec = scan("/mydata/rec.dat")> par(mfrow=c(2,1))> soi.per = spec.pgram(soi, taper=0, log="no")> abline(v=1/12, lty="dotted")> abline(v=1/48, lty="dotted")> rec.per = spec.pgram(rec, taper=0, log="no")> abline(v=1/12, lty="dotted")> abline(v=1/48, lty="dotted")

The confidence intervals for the SOI series at the yearly cycle, ω = 1/12 =40/480, and the possible El Nino cycle of four years ω = 1/48 = 10/480can be computed in R as follows:> soi.per$spec[40] # soi pgram at freq 1/12 = 40/480

[1] 11.66677> soi.per$spec[10] # soi pgram at freq 1/48 = 10/480

[1] 0.6447554> # -- conf intervals -- # returned value:> U = qchisq(.025,2) # 0.05063562> L = qchisq(.975,2) # 7.377759> 2*soi.per$spec[10]/L # 0.1747835> 2*soi.per$spec[10]/U # 25.46648> 2*soi.per$spec[40]/L # 3.162688> 2*soi.per$spec[40]/U # 460.813> #-- replace soi with rec above to get recruit values

The example above makes it fairly clear the periodogram as an estimator issusceptible to large uncertainties, and we need to find a way to reduce the vari-ance. Not surprisingly, this result follows if we think about the periodogram,I(ωj) as an estimator of the spectral density f(ω) and realize that it is thesum of squares of only two random variables for any sample size. The solutionto this dilemma is suggested by the analogy with classical statistics where welook for independent random variables with the same variance and average thesquares of these common variance observations. Independence and equality ofvariance do not hold in the time series case, but the covariance structure ofthe two adjacent estimators given in Example 4.8 suggests that for neighboringfrequencies, these assumptions are approximately true.

4.5: Nonparametric Spectral Estimation 197

4.5 Nonparametric Spectral Estimation

To continue the discussion that ended the previous section, we define a fre-quency band, B, of L << n contiguous fundamental frequencies centeredaround ωj = j/n that are close to the frequency of interest, ω, as

B ={

ω : ωj − m

n≤ ω ≤ ωj +

m

n

}, (4.45)

whereL = 2m + 1 (4.46)

is an odd number, chosen such that the spectral values in the interval B,

f(ωj + k/n), k = −m, . . . , 0, . . . , m

are approximately equal to f(ω). This structure can be realized for largesample sizes, as shown formally in §C.2. Values of the spectrum in this bandshould be relatively constant, as well, for the smoothed spectra defined belowto be good estimators.

Using the above band, we may now define an averaged or smoothed peri-odogram as the average of the periodogram values, say,

f(ω) =1L

m∑k=−m

I(ωj + k/n), (4.47)

as the average over the band B.Under the assumption that the spectral density is fairly constant in the

band B, and in view of (4.42) we can show that under appropriate conditions,9

for large n, the periodograms in (4.47) are approximately distributed as inde-pendent f(ω)χ2

2/2 random variables, for 0 < ω < 1/2, as long as we keep Lfairly small relative to n. This result is discussed formally in §C.2. Thus, underthese conditions, Lf(ω) is the sum of L approximately independent f(ω)χ2

2/2random variables. It follows that, for large n,

2Lf(ω)f(ω)

·∼ χ22L (4.48)

where ·∼ means approximately distributed as.In this scenario, it seems reasonable to call the length of the interval defined

by (4.45),

Bw =L

n(4.49)

the bandwidth. Bandwidth, of course, refers to the width of the frequenceyband used in smoothing the periodogram. The concept of the bandwidth, how-ever, becomes more complicated with the introduction of spectral estimators

9The conditions, which are sufficient, are that xt is a linear process, as described inProperty P4.2, with

∑j>0

√j |ψj | < ∞, and wt has a finite fourth moment.


that smooth with unequal weights. Note (4.49) implies the degrees of freedomcan be expressed as

2L = 2Bwn, (4.50)

or twice the time-bandwidth product. The result (4.48) can be rearranged toobtain an approximate 100(1 − α)% confidence interval of the form

2Lf(ω)χ2

2L(1 − α/2)≤ f(ω) ≤ 2Lf(ω)

χ22L(α/2)

(4.51)

for the true spectrum, f(ω).Many times, the visual impact of a spectral density plot will be improved

by plotting the logarithm of the spectrum instead of the spectrum.10 This phe-nomenon can occur when regions of the spectrum exist with peaks of interestmuch smaller than some of the main power components. For the log spectrum,we obtain an interval of the form[

ln f(ω) + ln 2L − lnχ22L(1 − α/2), ln f(ω) + ln 2L − lnχ2

2L(α/2)]. (4.52)

We can also test hypotheses relating to the equality of spectra using thefact that the distributional result (4.48) implies that the ratio of spectra basedon roughly independent samples will have an approximate F2L,2L distribution.The independent estimators can either be from different frequency bands orfrom different series.

If zeros are appended before computing the spectral estimators, we needto adjust the degrees of freedom and an approximation is to replace 2L by2Ln/n′. Hence, we define the adjusted degrees of freedom as

df =2Ln

n′ (4.53)

and use it instead of 2L in the confidence intervals (4.51) and (4.52). Forexample, (4.51) becomes

dff(ω)χ2

df (1 − α/2)≤ f(ω) ≤ dff(ω)

χ2df (α/2)

. (4.54)

A number of assumptions are made in computing the approximate confi-dence intervals given above, which may not hold in practice. In such cases, itmay be reasonable to employ resampling techniques such as one of the para-metric bootstraps proposed by Hurvich and Zeger (1987) or a nonparametriclocal bootstrap proposed by Paparoditis and Politis (1999). To develop thebootstrap distributions, we assume that the contiguous DFTs in a frequencyband of the form (4.45) all came from a time series with identical spectrumf(ω). This, in fact, is exactly the same assumption made in deriving the large-sample theory. We may then simply resample the L DFTs in the band, with

10The log transformation is the variance stabilizing transformation in this situation.


0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.4

0.8

1.2

frequency

spec

trum

Series: soiSmoothed Periodogram

bandwidth = 0.00541

0.0 0.1 0.2 0.3 0.4 0.5

020

0060

00

frequency

spec

trum

Series: recSmoothed Periodogram

bandwidth = 0.00541

Figure 4.5 The averaged periodogram of the SOI and Recruitment seriesn = 453, n′ = 480, L = 9, df = 17, showing common peaks at the four yearperiod, ω = 1/48 = .021 cycles/month, the yearly period, ω = 1/12 = .083cycles/month and some of its harmonics ω = k/12 for k = 2, 3.

replacement, calculating a spectral estimate from each bootstrap sample. Thesampling distribution of the bootstrap estimators approximates the distribu-tion of the nonparametric spectral estimator. For further details, including thetheoretical properties of such estimators, see Paparoditis and Politis (1999).

Before proceeding further, we pause to consider computing the averageperiodograms for the SOI and Recruitment series, as shown in Figure 4.5.

Example 4.10 Averaged Periodogram of SOI and RecruitmentSeries

Generally, it is a good idea to try several bandwidths that seem to becompatible with the general overall shape of the spectrum, as suggestedby the periodogram. The SOI and Recruitment series periodograms,


previously computed in Figure 4.4, suggest the power in the lower El Ninofrequency needs smoothing to identify the predominant overall period.Trying values of L leads to the choice L = 9 as a reasonable value, andthe result is displayed in Figure 4.5. In our notation, the bandwidthin this case is Bw = 9/480 = .01875 cycles per month for the spectralestimator. This bandwidth means we are assuming a relatively constantspectrum over about .01875/.5 = 3.75% of the entire frequency interval(0, 1/2). The bandwidth reported in R is taken from Bloomfield (2000),and in the current case amounts to dividing (4.49) by

√12. An excellent

discussion of the concept of bandwidth may be found in Percival andWalden (1993, §6.7). To obtain the bandwidth, Bw = .01875, from theone reported by R in Figure 4.5, we can multiply .00541 by

√12.

The smoothed spectra shown in Figure 4.5 provide a sensible compro-mise between the noisy version, shown in Figure 4.4, and a more heavilysmoothed spectrum, which might lose some of the peaks. An undesirableeffect of averaging can be noticed at the yearly cycle, ω = 1/12, wherethe narrow band peaks that appeared in the periodograms in Figure 4.4have been flattened and spread out to nearby frequencies. We also notice,and have marked, the appearance of harmonics of the yearly cycle, thatis, frequencies of the form ω = k/12 for k = 1, 2, . . . . Harmonics typi-cally occur when a periodic component is present, but not in a sinusoidalfashion.

Figure 4.5 can be reproduced in R using the following commands. Thebasic call is to the function spec.pgram. To compute averaged peri-odograms, use the Daniell kernel, and specify m, where L = 2m + 1(L = 9 and m = 4 in this example). We will explain the kernel conceptlater in this section, specifically just prior to Example 4.11.

> par(mfrow=c(2,1))> k = kernel("daniell",4)> soi.ave = spec.pgram(soi, k, taper=0, log="no")> abline(v=1/12, lty="dotted")> abline(v=2/12, lty="dotted")> abline(v=3/12, lty="dotted")> abline(v=1/48, lty="dotted")> #-- Repeat 5 lines above using rec in place of soi> soi.ave$bandwidth # reported bandwidth

[1] 0.005412659> soi.ave$bandwidth*sqrt(12) # Bw

[1] 0.01875

The adjusted degrees of freedom are df = 2(9)(453)/480 ≈ 17. We canuse this value for the 95% confidence intervals, with χ2

df (.025) = 7.56and χ2

df (.975) = 30.17. Substituting into (4.54) gives the intervals inTable 4.1 for the two frequency bands identified as having the maximum


Table 4.1 Confidence Intervals for the Spectraof the SOI and Recruitment Series

Series ω Period Power Lower UpperSOI 1/48 4 years .59 .33 1.34

1/12 1 year 1.43 .80 3.21Recruits 1/48 4 years 7.91 4.45 17.78

×103 1/12 1 year 2.63 1.48 5.92

power. To examine the two peak power possibilities, we may look at the95% confidence intervals and see whether the lower limits are substan-tially larger than adjacent baseline spectral levels. For example, the ElNino frequency of 48 months has lower limits that exceed the values thespectrum would have if there were simply a smooth underlying spectralfunction without the peaks. The relative distribution of power over fre-quencies is different, with the SOI index having less power at the lowerfrequency, relative to the seasonal periods, and the recruit series havingrelatively more power at the lower or El Nino frequency.

The entries in Table 4.1 for SOI can be obtained in R as follows:

> df = soi.ave$df # df = 16.9875 (returned values)> U = qchisq(.025,df) # U = 7.555916> L = qchisq(.975,df) # L = 30.17425> soi.ave$spec[10] # 0.5942431> soi.ave$spec[40] # 1.428959> # -- intervals --> df*soi.ave$spec[10]/L # 0.334547> df*soi.ave$spec[10]/U # 1.336000> df*soi.ave$spec[40]/L # 0.8044755> df*soi.ave$spec[40]/U # 3.212641> #-- repeat above commands with soi replaced by rec

Finally, Figure 4.6 shows the averaged periodograms in Figure 4.5 plottedon a log10 scale. This is the default plot in R, and these graphs can beobtained by removing the statement log="no" in the spec.pgram call.Notice that the default plot also shows a generic confidence interval of theform (4.52) (with ln replaced by log10) in the upper right-hand corner.To use it, imagine placing the tick mark on the averaged periodogramordinate of interest; the resulting bar then constitutes an approximate95% confidence interval for the spectrum at that frequency. Of course,actual intervals may be computed as was done in this example. Wenote that displaying the estimates on a log scale tends to emphasize theharmonic components.

This example points out the necessity for having some relatively systematicprocedure for deciding whether peaks are significant. The question of deciding


0.0 0.1 0.2 0.3 0.4 0.5

0.02

0.10

0.50

frequency

spec

trum


bandwidth = 0.00541

0.0 0.1 0.2 0.3 0.4 0.5

1050

500

5000

frequency

spec

trum


bandwidth = 0.00541

Figure 4.6 Figure 4.5 with the average periodogram ordinates plotted on alog10 scale. The display in the upper right-hand corner represents a generic95% confidence interval.

whether a single peak is significant usually rests on establishing what we mightthink of as a baseline level for the spectrum, defined rather loosely as the shapethat one would expect to see if no spectral peaks were present. This profile canusually be guessed by looking at the overall shape of the spectrum that includesthe peaks; usually, a kind of baseline level will be apparent, with the peaksseeming to emerge from this baseline level. If the lower confidence limit for thespectral value is still greater than the baseline level at some predetermined levelof significance, we may claim that frequency value as a statistically significantpeak. To maintain an α that is consistent with our stated indifference to theupper limits, we might use a one-sided confidence interval.

An important aspect of interpreting the significance of confidence intervalsand tests involving spectra is that typically, more than one frequency will be ofinterest, so that we will potentially be interested in simultaneous statements


about a whole collection of frequencies. For example, it would be unfair toclaim in Table 4.1 the two frequencies of interest as being statistically signifi-cant and all other potential candidates as nonsignificant at the overall level ofα = .05. In this case, we follow the usual statistical approach, noting that if Kstatements S1, S1, . . . , Sk are made at significance level α, i.e., P{Sk} = 1−α,then the overall probability all statements are true satisfies the Bonferroniinequality

P{all Sk true} ≥ 1 − Kα. (4.55)

For this reason, it is desirable to set the significance level for testing eachfrequency at α/K if there are K potential frequencies of interest. If, a priori,potentially K = 10 frequencies are of interest, setting α = .01 would give anoverall significance level of bound of .10.

The use of the confidence intervals and the necessity for smoothing requiresthat we make a decision about the bandwidth Bw over which the spectrum willbe essentially constant. Taking too broad a band will tend to smooth out validpeaks in the data when the constant variance assumption is not met over theband. Taking too narrow a band will lead to confidence intervals so widethat peaks are no longer statistically significant. Thus, we note that thereis a conflict here between variance properties or bandwidth stability, whichcan be improved by increasing Bw and resolution, which can be improved bydecreasing Bw. A common approach is to try a number of different bandwidthsand to look qualitatively at the spectral estimators for each case.

To address the problem of resolution, it should be evident that the flatten-ing of the peaks in Figures 4.5 and 4.6 was due to the fact that simple averagingwas used in computing f(ω) defined in (4.47). There is no particular reasonto use simple averaging, and we might improve the estimator by employing aweighted average, say

f(ω) =m∑

k=−m

hk I(ωj + k/n), (4.56)

using the same definitions as in (4.47) but where now, the weights satisfy

h−k = hk > 0 all k andm∑

k=−m

hk = 1.

In particular, it seems reasonable that the resolution of the estimator willimprove if we use weights that decrease as distance from the center weighth0 increases; we will return to this idea shortly. To obtain the averaged pe-riodogram, f(ω), in (4.56), set hk = L−1, for all k, where L = 2m + 1.The asymptotic theory established for f(ω) still holds for f(ω) provided thatthe weights satisfy the additional condition that if m → ∞ as n → ∞ butm/n → 0, then

m∑k=−m

h2k → 0.


Under these conditions, as n → ∞,

(i) E(f(ω)

)→ f(ω)

(ii)(∑m

k=−m h2k

)−1cov

(f(ω), f(λ)

)→ f2(ω) for ω = λ �= 0, 1/2.

In (ii), replace f2(ω) by 0 if ω �= λ and by 2f2(ω) if ω = λ = 0 or 1/2.We have already seen these results in the case of f(ω), where the weights

are constant, hk = L−1, in which case∑m

k=−m h2k = L−1. The distributional

properties of (4.56) are more difficult now because f(ω) is a weighted linearcombination of asymptotically independent χ2 random variables. An approx-imation that seems to work well is to replace L by

(∑mk=−m h2

k

)−1. That is,define

Lh =

(m∑

k=−m

h2k

)−1

(4.57)

and use the approximation11

2Lhf(ω)f(ω)

·∼ χ22Lh

. (4.58)

In analogy to (4.49), we will define the bandwidth in this case to be

Bw =Lh

n. (4.59)

Using the approximation (4.58) we obtain an approximate 100(1 − α)% confi-dence interval of the form

2Lhf(ω)χ2

2Lh(1 − α/2)

≤ f(ω) ≤ 2Lhf(ω)χ2

2Lh(α/2)

(4.60)

for the true spectrum, f(ω). If the data are padded to n′, then replace 2Lh in(4.60) with df = 2Lhn/n′ as in (4.53).

An easy way to generate the weights in R is by repeated use of the Daniellkernel. For example, with m = 1 and L = 2m + 1 = 3, the Daniell kernel hasweights {hk} = { 1

3 , 13 , 1

3}; applying this kernel to a sequence of numbers, {ut},produces

ut =13ut−1 +

13ut +

13ut+1.

We can apply the same kernel again to the ut,

ut =13ut−1 +

13ut +

13ut+1,

11The approximation proceeds as follows: If f·∼ cχ2

ν , where c is a constant, then Ef ≈ cν

and varf ≈ f2∑

kh2

k ≈ c22ν. Solving, c ≈ f∑

kh2

k/2 = f/2Lh and ν ≈ 2(∑

kh2

k

)−1=

2Lh.


which simplifies to

ut =19ut−2 +

29ut−1 +

39ut +

29ut+1 +

19ut+2.

The modified Daniell kernel puts half weights at the end points, so with m = 1the weights are {hk} = { 1

4 , 24 , 1

4} and

ut =14ut−1 +

12ut +

14ut+1.

Applying the same kernel again yields

ut =116

ut−2 +416

ut−1 +616

ut +416

ut+1 +116

ut+2.

These coefficients can be obtained in R by issuing the kernel command.For example, kernel("modified.daniell",c(1,1)) would produce the coef-ficients of the last example. It is also possible to use different values of m, e.g.,try kernel("modified.daniell",c(1,2)) or kernel("daniell",c(1,2)).The other kernels that are currently available in R are the Dirichlet kerneland the Fejer kernel, which we will discuss shortly.

Example 4.11 Smoothed Periodogram of the SOI and RecruitmentSeries

In this example, we estimate the spectra of the SOI and Recruitmentseries using the smoothed periodogram estimate in (4.56). We used amodified Daniell kernel twice, with m = 3 both times. This yields Lh =1/∑m

k=−m h2k = 9.232, which is close to the value of L = 9 used in

Example 4.10. In this case, the bandwidth is Bw = 9.232/480 = .019and the modified degrees of freedom is df = 2Lh453/480 = 17.43. Theweights, hk, can be obtained in R as follows:

> kernel("modified.daniell", c(3,3))coef[-6] = 0.006944 # = coef[ 6]coef[-5] = 0.027778 # = coef[ 5]coef[-4] = 0.055556 # = coef[ 4]coef[-3] = 0.083333 # = coef[ 3]coef[-2] = 0.111111 # = coef[ 2]coef[-1] = 0.138889 # = coef[ 1]coef[ 0] = 0.152778

The resulting spectral estimates can be viewed in Figure 4.7 and we noticethat the estimates more appealing than those in Figure 4.5. Figure 4.7was generated in R as follows; we also show how to obtain df and Bw.

> par(mfrow=c(2,1))> k = kernel("modified.daniell", c(3,3))> soi.smo = spec.pgram(soi, k, taper=0, log="no")


0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.5

1.0

1.5

frequency

spec

trum


bandwidth = 0.00528

0.0 0.1 0.2 0.3 0.4 0.5

020

0060

00

frequency

spec

trum


bandwidth = 0.00528

Figure 4.7 Smoothed spectral estimates of the SOI and Recruitment series;see Example 4.11 for details.

> abline(v=1/12, lty="dotted")> abline(v=1/48, lty="dotted")> #-- Repeat above 3 lines with rec replacing soi> df = soi.smo2$df # df=17.42618> Lh = 1/sum(k[-k$m:k$m]ˆ2) # Lh=9.232413> Bw = Lh/480 # Bw=0.01923419

The bandwidth reported by R is .00528, which is approximately Bw/√

12;type bandwidth.kernel to see how R computes bandwidth. Reissuingthe spec.pgram commands with log="no" removed will result in a figuresimilar to Figure 4.6. Finally, we mention that R uses the modifiedDaniell kernel by default. For example, an easier way to obtain soi.smois to issue the command:

> soi.smo = spectrum(soi, spans=c(7,7), taper=0)

Notice that spans is a vector of odd integers, given in terms of L = 2m+1instead of m. These values give the widths of modified Daniell smootherto be used to smooth the periodogram.


We are now ready to introduce the concept of tapering; this will lead usto the notion of a spectral window. For example, suppose xt is a mean-zero,stationary process with spectral density fx(ω). If we replace the original seriesby the tapered series

yt = htxt, (4.61)

for t = 1, 2, . . . , n, and use the modified DFT

dy(ωj) = n−1/2n∑

t=1

htxte−2πiωjt, (4.62)

and let Iy(ωj) = |dy(ωj)|2, we obtain (see Problem 4.15)

E[Iy(ωj)] =∫ 1/2

−1/2Wn(ωj − ω) fx(ω) dω (4.63)

whereWn(ω) = |Hn(ω)|2 (4.64)

and

Hn(ω) = n−1/2n∑

t=1

hte−2πiωt. (4.65)

The value Wn(ω) is called a spectral window because, in view of (4.63), it isdetermining which part of the spectral density fx(ω) is being “seen” by theestimator Iy(ωj) on average. In the case that ht = 1 for all t, Iy(ωj) = Ix(ωj)is simply the periodogram of the data and the window is

Wn(ω) =sin2(nπω)n sin2(πω)

(4.66)

with Wn(0) = n, which is known as the Fejer or modified Bartlett kernel. Ifwe consider the averaged periodogram in (4.47), namely

fx(ω) =1L

m∑k=−m

Ix(ωj + k/n),

the window, Wn(ω), in (4.63) will take the form

Wn(ω) =1

nL

m∑k=−m

sin2[nπ(ω + k/n)]sin2[π(ω + k/n)]

. (4.67)

Tapers generally have a shape that enhances the center of the data relativeto the extremities, such as a cosine bell of the form

ht = .5[1 + cos

(2π(t − t)

n

)], (4.68)


−0.04 −0.02 0.00 0.02 0.04

010

2030

4050

Smoothed Fejer

frequency

−0.04 −0.02 0.00 0.02 0.04

−60

−40

−20

0

Smoothed Fejer − logged

frequency

−0.04 −0.02 0.00 0.02 0.04

010

2030

4050

Cosine Taper

frequency

−0.04 −0.02 0.00 0.02 0.04

−60

−40

−20

0Cosine Taper − logged

frequency

Figure 4.8 Averaged Fejer window (top row) and the corresponding cosinetaper window (bottom row) for L = 9, n = 480.

where t = (n + 1)/2, favored by Blackman and Tukey (1959). In Figure 4.8,we have plotted the shapes of two windows, Wn(ω), for n = 480 and L = 9,when (i) ht ≡ 1, in which case, (4.67) applies, and (ii) ht is the cosine taper in(4.68). In both cases the predicted bandwidth should be Bw = 9/480 = .01875cycles per point, which corresponds to the “width” of the windows shownin Figure 4.8. Both windows produce an integrated average spectrum overthis band but the untapered window in the top panels shows considerableripples over the band and outside the band. The ripples outside the band arecalled sidelobes and tend to introduce frequencies from outside the interval thatmay contaminate the desired spectral estimate within the band. For example,a large dynamic range for the values in the spectrum introduces spectra incontiguous frequency intervals several orders of magnitude greater than thevalue in the interval of interest. This effect is sometimes called leakage. Finally,


0.0 0.1 0.2 0.3 0.4 0.5

0.02

0.10

0.50

2.00

frequency

spec

trum

No Taper

bandwidth = 0.00528

0.0 0.1 0.2 0.3 0.4 0.5

0.05

0.20

1.00

frequency

spec

trum

10% Taper

bandwidth = 0.00528

0.0 0.1 0.2 0.3 0.4 0.5

0.02

0.10

0.50

2.00

frequency

spec

trum

50% Taper

bandwidth = 0.00528

Figure 4.9 Smoothed spectral estimates of the SOI (on a log10 scale) withouttapering (top), with 10% tapering (middle) and with 50% or complete tapering(bottom); see Example 4.12 for details.

the logged values in Figure 4.8 emphasize the suppression of the sidelobes inthe Fejer kernel when a cosine taper is used.

Example 4.12 The Effect of Tapering the SOI Series

In this example we examine the effect of various tapers on the estimateof the spectrum of the SOI series. The results for the Recruitment seriesare similar. Figure 4.9 shows three spectral estimates plotted on a log10scale along with the corresponding approximate 95% confidence intervalsin the upper right. The degree of smoothing here is the same as inExample 4.11. The top of Figure 4.9 shows the estimate without anytapering and hence it is the same as the estimated spectrum displayed inthe top of Figure 4.7. The middle panel in Figure 4.9 shows the effect of


10% tapering (the R default), which means that the cosine taper is beingapplied only to the ends of the series, 10% on each side. The bottom panelshows the results with 50% tapering; that is, (4.68) is being applied tothe entire set of data.

The three spectral estimates are qualitatively similar, but note that inthe fully tapered case, the peak El Nino cycle is at the 42 month (3.5year) cycle instead of the 48 month (4 year) cycle. Also, notice thatthe confidence interval bands are increasing as the tapering increases.This occurrence is due to the fact that by tapering we are decreasingthe amount of information, and hence the degrees of freedom; details,which are similar to the ideas discussed in (4.57)–(4.58), may be foundin Bloomfield (2000, §9.5).

The following R session was used to generate Figure 4.9:

> par(mfrow=c(3,1))> spectrum(soi, spans=c(7,7), taper=0, main="No Taper")> abline(v=1/12,lty="dashed")> abline(v=1/48,lty="dashed")> spectrum(soi, spans=c(7,7), main="10% Taper")> abline(v=1/12,lty="dashed")> abline(v=1/48,lty="dashed")> spectrum(soi, spans=c(7,7), taper=.5, main="50% Taper")> abline(v=1/12,lty="dashed")> abline(v=1/48,lty="dashed")

Example 4.13 Spectra of P and S Components for Earthquake andExplosion

Figure 4.10 shows the spectra computed separately from the two phasesof the earthquake and explosion in Figure 1.7 of Chapter 1. In all caseswe used a modified Daniell smoother with L = 21 being passed twice, andwith 10% tapering. This leads to approximately 54 degrees of freedom.Because the sampling rate is 40 points per second, the folding frequencyis 20 cycles per second or 20 Hertz (Hz). The highest frequency shown inthe plots is .25 cycles per point or 10 Hz because there is no signal activityat frequencies beyond 10 Hz. A fundamental problem in the analysisof seismic data is discriminating between earthquakes and explosionsusing the kind of instruments that might be used in monitoring a nucleartest ban treaty. If we plot an ensemble of earthquakes and explosionscomparable to Figure 1.7, some gross features appear that may lead todiscrimination. The most common differences that we look for are subtledifferences between the spectra of the two classes of events. In this case,note the strong frequency components of the P and S components ofthe explosion are close to the frequency .10 cycles per point or 1 Hz.On the other hand, the spectral content of the earthquakes tends to


occur along a broader frequency band and at lower frequencies for bothcomponents. Often, we assume that the ratio of P to S power is indifferent proportions at different frequencies, and this distinction canform a basis for discriminating between the two classes. In §7.7, we testformally for discrimination using a random effects analysis of varianceapproach.

Figure 4.10 was generated in R as follows:

> x = matrix(scan("/mydata/eq5exp6.dat"), ncol=2)> eqP = x[1:1024, 1]; eqS = x[1025:2048, 1]> exP = x[1:1024, 2]; exS = x[1025:2048, 2]> par(mfrow=c(2,2))> eqPs=spectrum(eqP, spans=c(21,21),+ log="no", xlim=c(0,.25), ylim=c(0,.04))> eqSs=spectrum(eqS, spans=c(21,21),+ log="no", xlim=c(0,.25), ylim=c(0,.4))> exPs=spectrum(exP, spans=c(21,21),+ log="no", xlim=c(0,.25), ylim=c(0,.04))> exSs=spectrum(exS, spans=c(21,21),+ log="no", xlim=c(0,.25), ylim=c(0,.4))> exSs$df

[1] 53.87862

We close this section with a brief discussion of lag window estimators. First,consider the periodogram, I(ωj), which was shown in (4.23) to be of the form

I(ωj) =∑|h|<n

γ(h)e−2πiωjh.

Thus, (4.56) can be written as

f(ω) =∑|k|≤m

hk I(ωj + k/n)

=∑|k|≤m

hk

∑|h|<n

γ(h)e−2πi(ωj+k/n)h

=∑|h|<n

g(h/n) γ(h)e−2πiωjh. (4.69)

where g(h/n) =∑

|k|≤m hk exp(−2πikh/n). Equation (4.69) suggests estima-tors of the form

f(ω) =∑

|h|≤r

w(h/r) γ(h)e−2πiωh (4.70)

where w(·) is a weight function, called the lag window, that satisfies


0.00 0.05 0.10 0.15 0.20 0.25

0.00

0.01

0.02

0.03

0.04

frequency

spec

trum

Series: eqPSmoothed Periodogram

bandwidth = 0.008

0.00 0.05 0.10 0.15 0.20 0.25

0.0

0.1

0.2

0.3

0.4

frequency

spec

trum

Series: eqSSmoothed Periodogram

bandwidth = 0.008

0.00 0.05 0.10 0.15 0.20 0.25

0.00

0.01

0.02

0.03

0.04

frequency

spec

trum

Series: exPSmoothed Periodogram

bandwidth = 0.008

0.00 0.05 0.10 0.15 0.20 0.25

0.0

0.1

0.2

0.3

0.4

frequency

spec

trum

Series: exSSmoothed Periodogram

bandwidth = 0.008

Figure 4.10 Spectral analysis of P and S components of an earthquake and anexplosion, n = 1024. Each estimate is based on a modified Daniell smootherwith L = 21 being passed twice, and with 10% tapering. This leads to approx-imately 54 degrees of freedom. Multiply frequency by 40 to convert to Hertz(cycles/second).

(i) w(0) = 1

(ii) |w(x)| ≤ 1 and w(x) = 0 for |x| > 1,

(iii) w(x) = w(−x).

Note that if w(x) = 1 for |x| < 1 and r = n, then f(ωj) = I(ωj), theperiodogram. This result indicates the problem with the periodogram as anestimator of the spectral density is that it gives too much weight to the valuesof γ(h) when h is large, and hence is unreliable [e.g, there is only one pair of


observations used in the estimate γ(n−1), and so on]. The smoothing windowis defined to be

W (ω) =r∑

h=−r

w(h/r)e−2πiωh, (4.71)

and it determines which part of the periodogram will be used to form theestimate of f(ω). The asymptotic theory for f(ω) holds for f(ω) under thesame conditions and provided r → ∞ as n → ∞ but with r/n → 0. We have

E{f(ω)} → f(ω); (4.72)

n

rcov

(f(ω), f(λ)

)→ f2(ω)

∫ 1

−1w2(x)dx ω = λ �= 0, 1/2. (4.73)

In (4.73), replace f2(ω) by 0 if ω �= λ and by 2f2(ω) if ω = λ = 0 or 1/2.Many authors have developed various windows and Brillinger (2001, Ch 3)

and Brockwell and Davis (1991, Ch 10) are good sources of detailed informationon this topic. We mention a few.

The rectangular lag window, which gives uniform weight in (4.70),

w(x) = 1, |x| ≤ 1,

corresponds to the Dirichlet smoothing window given by

W (ω) =sin(2πr + π)ω

sin(πω). (4.74)

This smoothing window takes on negative values, which may lead to estimatesof the spectral density that are negative a various frequencies. Using (4.73) inthis case, for large n we have

var{f(ω)} ≈ 2r

nf2(ω).

The Parzen lag window is defined to be

w(x) =

{1 − 6x + 6|x|3 |x| < 1/2,2(1 − |x|)3 1/2 ≤ x ≤ 1,0 otherwise.

This leads to an approximate smoothing window of

W (ω) =6

πr3

sin4(rω/4)sin4(ω/2)

.

For large n, the variance of the estimator is approximately

var{f(ω)} ≈ .539f2(ω)/n.


The Tukey-Hanning lag window has the form

w(x) =12(1 + cos(x)), |x| ≤ 1

which leads to the smoothing window

W (ω) =14Dr(2πω − π/r) +

12Dr(2πω) +

14Dr(2πω + π/r)

where Dr(ω) is the Dirichlet kernel in (4.74). The approximate large samplevariance of the estimator is

var{f(ω)} ≈ 3r

4nf2(ω).

The triangular lag window, also known as the Bartlett or Fejer window,given by

w(x) = 1 − |x|, |x| ≤ 1

leads to the Fejer smoothing window:

W (ω) =sin2(πrω)r sin2(πω)

.

In this case, (4.73) yields

var{f(ω)} ≈ 2r

3nf2(ω).

The idealized rectangular smoothing window, also called the Daniell win-dow, is given by

W (ω) ={

r |ω| ≤ 1/2r0 otherwise

,

and leads to the sinc lag window, namely

w(x) =sin(πx)

πx, |x| ≤ 1.

From (4.73) we havevar{f(ω)} ≈ r

nf2(ω).

For lag window estimators, the width of the rectangular window that leadsto the same asymptotic variance as a given lag window estimator is sometimescalled the bandwidth. For example, the bandwidth of the rectangular windowis br = 1/r and the asymptotic variance is 1

nbrf2. The asymptotic variance

of the triangular window is 2r3nf2, so setting 1

nbrf2 = 2r

3nf2 and solving we getbr = 3/2r as the corresponding bandwidth.

4.6: Cross-Spectra 215

4.6 Multiple Series and Cross-Spectra

The notion of analyzing frequency fluctuations using classical statistical ideasextends to the case in which there are several jointly stationary series, forexample, xt and yt. In this case, we can introduce the idea of a correlationindexed by frequency, called the coherence. The results in Appendix C, §C.2,imply the covariance function

γxy(h) = E[(xt+h − µx)(yt − µy)]

has the representation

γxy(h) =∫ 1/2

−1/2fxy(ω)e2πiωh dω h = 0,±1,±2, ..., (4.75)

where the cross-spectrum is defined as the Fourier transform

fxy(ω) =∞∑

h=−∞γxy(h) e−2πiωh − 1/2 ≤ ω ≤ 1/2, (4.76)

assuming that the cross-covariance function is absolutely summable, as was thecase for the autocovariance. The cross-spectrum is generally a complex-valuedfunction, and it is often written as12

fxy(ω) = cxy(ω) − iqxy(ω), (4.77)

where

cxy(ω) =∞∑

h=−∞γxy(h) cos(2πωh) (4.78)

and

qxy(ω) =∞∑

h=−∞γxy(h) sin(2πωh) (4.79)

are defined as the cospectrum and quadspectrum, respectively. Because ofthe relationship γyx(h) = γxy(−h), it follows, by substituting into (4.76) andrearranging, that

fyx(ω) = fxy(ω). (4.80)

This result, in turn, implies that the cospectrum and quadspectrum satisfy

cyx(ω) = cxy(ω) (4.81)

andqyx(ω) = −qxy(ω). (4.82)

12For this section, it will be useful to recall the facts e−iα = cos(α) − i sin(α) and ifz = a + ib, then z = a − ib.


An important example of the application of the cross-spectrum is to theproblem of predicting an output series yt from some input series xt through alinear filter relation such as the three-point moving average considered below.A measure of the strength of such a relation is the squared coherence function,defined as

ρ2y·x(ω) =

|fyx(ω)|2fxx(ω)fyy(ω)

, (4.83)

where fxx(ω) and fyy(ω) are the individual spectra of the xt and yt series,respectively. Although we consider a more general form of this that applies tomultiple inputs later, it is instructive to display the single input case as (4.83)to emphasize the analogy with conventional squared correlation, which takesthe form

ρ2yx =

σ2yx

σ2xσ2

y

,

for random variables with variances σ2x and σ2

y and covariance σyx = σxy. Thismotivates the interpretation of squared coherence and the squared correlationbetween two time series at frequency ω.

Example 4.14 Cross-Spectrum and Coherence of a Process and aThree-Point Moving Average

As a simple example, we compute the cross-spectrum between xt andthe three-point moving average yt = (xt−1 + xt + xt+1)/3, where xt is astationary input process with spectral density fxx(ω). First,

γxy(h) = E[xt+hyt]

=13

E [xt+h (xt−1 + xt + xt+1)]

=13

(γxx(h + 1) + γxx(h) + γxx(h − 1)

)=

13

∫ 1/2

−1/2(e2πiω + 1 + e−2πiω)e2πiωhfxx(ω) dω

=13

∫ 1/2

−1/2[1 + 2 cos(2πω)]fxx(ω)e2πiωh dω.

Using the uniqueness of the Fourier transform, we argue from the spectralrepresentation (4.75) that the above must be the transform of fxy(ω),implying that

fxy(ω) =13

[1 + cos(2πω)] fxx(ω)

so that the cross-spectrum is real in this case. From Example 4.5, thespectral density of yt is

fyy(ω) =19[3 + 4 cos(2πω) + 2 cos(4πω)]fxx(ω)


=19

[1 + 2 cos(2πω)]2 fxx(ω),

using the identity cos(2α) = 2 cos2(α) − 1 in the last step. Substitutinginto (4.83) yields the squared coherence between xt and yt as unity overall frequencies. This is a characteristic inherited by more general linearfilters, as will be shown in Problem 4.23. However, if some noise is addedto the three-point moving average, the coherence is not unity; these kindsof models will be considered in detail later.

Property P4.3: Spectral Representation of a Vector StationaryProcessIf the elements of the p × p autocovariance function matrix

Γ(h) = E[(xxxt+h − µµµ)(xxxt − µµµ)′]

of a p-dimensional stationary time series, xxxt = (xt1, xt2, . . . , xtp)′, has ele-ments satisfying

∞∑h=−∞

|γjk(h)| < ∞ (4.84)

for all j, k = 1, . . . , p, then Γ(h) has the representation

Γ(h) =∫ 1/2

−1/2e2πiωh f(ω) dω h = 0,±1,±2, ..., (4.85)

as the inverse transform of the spectral density matrix, f(ω) = {fjk(ω)}, forj, k = 1, . . . , p, with elements equal to the cross-spectral components. Thematrix f(ω) has the representation

f(ω) =∞∑

h=−∞Γ(h)e−2πiωh − 1/2 ≤ ω ≤ 1/2. (4.86)

Example 4.15 Spectral Matrix of a Bivariate Process

Consider a jointly stationary bivariate process (xt, yt). We arrange theautocovariances in the matrix

Γ(h) =(

γxx(h) γxy(h)γyx(h) γyy(h)

).

The spectral matrix would be given by

f(ω) =(

fxx(ω) fxy(ω)fyx(ω) fyy(ω)

),

where the Fourier transform (4.85) and (4.86) relate the autocovarianceand spectral matrices.


The extension of spectral estimation to vector series is fairly obvious. Forthe vector series xxxt = (xt1, xt2, . . . , xtp)′, we may use the vector of DFTs, sayddd(ωj) = (d1(ωj), d2(ωj), . . . , dp(ωj))′, and estimate the spectral matrix by

f(ω) = L−1m∑

k=−m

I(ωj + k/n) (4.87)

where nowI(ωj) = ddd(ωj)ddd∗(ωj) (4.88)

is a p × p complex matrix.13

Again, the series may be tapered before the DFT is taken in (4.87) and wecan use weighted estimation,

f(ω) =m∑

k=−m

hk I(ωj + k/n) (4.89)

where {hk} are weights as defined in (4.56). The estimate of squared coherencebetween two series, yt and xt is

ρ2y·x(ω) =


. (4.90)

If the spectral estimates in (4.90) are obtained using equal weights, we willwrite ρ2

y·x(ω) for the estimate.Under general conditions, if ρ2

y·x(ω) > 0 then

|ρy·x(ω)| ∼ AN(|ρy·x(ω)|, (1 − ρ2

y·x(ω))2/2Lh

)(4.91)

where Lh is defined in (4.57); the details of this result may be found in Brock-well and Davis (1991, Ch 11). We may use (4.91) to obtain approximateconfidence intervals for the square coherency ρ2

y·x(ω).We can test the hypothesis that ρ2

y·x(ω) = 0 if we use ρ2y·x(ω) for the

estimate with L > 1,14 that is,

ρ2y·x(ω) =


. (4.92)

In this case, under the null hypothesis, the statistic

F2,2L−2 =ρ2

y·x(ω)(1 − ρ2

y·x(ω))(L − 1) (4.93)

13If Z is a complex matrix, then Z∗ = Z′denotes the conjugate transpose operation. That

is, Z∗ is the result of replacing each element of Z by its complex conjugate and transposingthe resulting matrix.

14If L = 1 then ρ2y·x(ω) ≡ 1.


0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

0.8

1.0

frequency

squa

red

cohe

renc

yCoherence − SOI and Recruits

Figure 4.11 Coherence function between the SOI and Recruitment series;L = 19, n = 453, n′ = 480, and α = .001.

has an approximate F-distribution with 2 and 2L−2 degrees of freedom. Whenthe series have been extended to length n′, we replace 2L − 2 by df − 2, wheredf is defined in (4.53). Solving (4.93) for a particular significance level α leadsto

Cα =F2,2L−2(α)

L − 1 + F2,2L−2(α)(4.94)

as the approximate value that must be exceeded for the original squared co-herence to be able to reject ρ2

y·x(ω) = 0 at an a priori specified frequency.

Example 4.16 Coherence Between SOI and Recruitment Series

Figure 4.11 shows the squared coherence between the SOI and Recruit-ment series over a wider band than was used for the spectrum. In thiscase, we used L = 19, df = 2(19)(453/480) ≈ 36 and F2,df−2(.001) ≈8.53 at the significance level α = .001. Hence, we may reject the hypoth-esis of no coherence for values of C.001 > .32. We emphasize that thismethod is crude because, in addition to the fact that the F -statistic is ap-proximate, we are examining the squared coherence across all frequencieswith the Bonferroni inequality, (4.55), in mind. Figure 4.11 also exhibitsconfidence bands as part of the R plotting routine. We emphasize thatthese bands are only valid for ω where ρ2

y·x(ω) > 0.


In this case, the seasonal frequency and the El Nino frequencies rangingbetween about 3 and 7 year periods are strongly coherent. Other fre-quencies are also strongly coherent, although the strong coherence is lessimpressive because the underlying power spectrum at these higher fre-quencies is fairly small. Finally, we note that the coherence is persistentat the seasonal harmonic frequencies.

This example may be reproduced using the following R commands.

> x = ts(cbind(soi,rec))> s = spec.pgram(x, kernel("daniell",9), taper=0)> s$df # df = 35.8625> f = qf(.999, 2, s$df-2) # f = 8.529792> c = f/(18+f) # c = 0.3188779> plot(s, plot.type = "coh", ci.lty = 2)> abline(h = c)

4.7 Linear Filters

Some of the examples of the previous sections have hinted at the possibility thedistribution of power or variance in a time series can be modified by makinga linear transformation. In this section, we explore that notion further bydefining a linear filter and showing how it can be used to extract signals froma time series. The linear filter modifies the spectral characteristics of a timeseries in a predictable way, and the systematic development of methods fortaking advantage of the special properties of linear filters is an important topicin time series analysis.

A linear filter uses a set of specified coefficients at, for t = 0,±1,±2 . . ., totransform a stationary input series, xt, producing an output series, yt, of theform

yt =∞∑

r=−∞arxt−r. (4.95)

The form (4.95) is also called a convolution in some statistical contexts. Thecoefficients, collectively called the impulse response function, are required tosatisfy absolute summability

∞∑t=−∞

|at| < ∞, (4.96)

so (4.95) exists as a limit in mean square and the infinite Fourier transform

Ayx(ω) =∞∑

t=−∞ate−2πiωt , (4.97)

4.7: Linear Filters 221

called the frequency response function, is well defined. We have already en-countered several linear filters, for example, the simple three-point movingaverage in Example 4.5, which can be put into the form of (4.95) by lettinga−1 = a0 = a1 = 1/3 and taking at = 0 for |t| ≥ 2.

The importance of the linear filter stems from its ability to enhance certainparts of the spectrum of the input series. To see this, the autocovariancefunction of the filtered output (4.95) can be derived as

γyy(h) = E[(yt+h − Eyt+h)(yt − Eyt)]

= E

[∑r

∑s

ar(xt+h−r − µ)(xt−s − µ)as

]=

∑r

∑s

arγxx(h − r + s)as

=∑

r

∑s

ar

[∫ 1/2

−1/2e2πiω(h−r+s)fxx(ω)dω

]as

=∫ 1/2

−1/2

(∑r

are−2πiωr

)(∑s

ase2πiωs

)e2πiωhfxx(ω) dω

=∫ 1/2

−1/2e2πiωh|Ayx(ω)|2fxx(ω) dω,

where we have first replaced γxx(·) by its representation (4.12) and then substi-tuted Ayx(ω) from (4.97). The computation is one we do repeatedly, exploitingthe uniqueness of the Fourier transform. Now, because the left-hand side isthe Fourier transform of the spectral density of the output, say, fyy(ω), we getthe important filtering property as follows.

Property P4.4: Output Spectrum of a Filtered Stationary SeriesThe spectrum of the filtered output yt in (4.95) is related to the spectrum ofthe input xt by

fyy(ω) = |Ayx(ω)|2 fxx(ω), (4.98)

where the frequency response function Ayx(ω) is defined in (4.97).

The result (4.98) enables us to calculate the exact effect on the spectrumof any given filtering operation. This important property shows the spectrumof the input series is changed by filtering and the effect of the change canbe characterized as a frequency-by-frequency multiplication by the squaredmagnitude of the frequency response function. Again, an obvious analogyto a property of the variance in classical statistics holds, namely, if x is arandom variable with variance σ2

x, then y = ax will have variance σ2y = a2σ2

x,so the variance of the linearly transformed random variable is changed bymultiplication by a2 in much the same way as the linearly filtered spectrum ischanged in (4.98).


50 100 150 200 250 300 350 400 450−1

−0.5

0

0.5

1SOI Index

50 100 150 200 250 300 350 400 450−1

−0.5

0

0.5

1First Difference SOI

50 100 150 200 250 300 350 400 450−1

−0.5

0

0.5

112 Mo. Moving Avge. SOI

Figure 4.12 SOI series (top) compared with the differenced SOI (middle) anda centered 12-month moving average (bottom).

Example 4.17 First Difference and Moving Average Filters

We illustrate the effect of filtering with two common examples, the firstdifference filter

yt = ∇xt = xt − xt−1

and the symmetric moving average filter

yt =124(xt−6 + xt+6

)+

112

5∑r=−5

xt−r,

which is a modified Daniell kernel with m = 6. The results of filteringthe SOI series using the two filters are shown in the middle and bottompanels of Figure 4.12. Notice that the effect of differencing is to roughenthe series because it tends to retain the higher or faster frequencies. Thecentered moving average smoothes the series because it retains the lowerfrequencies and tends to attenuate the higher frequencies. In general,differencing is an example of a high-pass filter because it retains or passes


0.0 0.1 0.2 0.3 0.4 0.5

0.00.1

0.20.3

0.40.5

frequency

spec

trum

12−month Filtered SOI Series

bandwidth = 0.00525

Figure 4.13 Spectral analysis of the SOI series after applying a 12-monthmoving average filter. The vertical line corresponds to the 52-month cycle.

the higher frequencies, whereas the moving average is a low-pass filterbecause it passes the lower or slower frequencies.

Notice that the slower periods are enhanced in the symmetric movingaverage and the seasonal or yearly frequencies are attenuated. The fil-tered series makes about 9 cycles in the length of the data (about onecycle every 52 months) and the moving average filter tends to enhanceor extract the signal that is associated with El Nino. Moreover, by thelow-pass filtering of the data, we get a better sense of the El Nino effectand its irregularity. Figure 4.13 shows the results of a spectral analysison the low-pass filtered SOI series. It is clear that all high frequencybehavior has been removed and the El Nino cycle is accentuated; thedotted vertical line in the figure corresponds to the 52 months cycle.

Now, having done the filtering, it is essential to determine the exact wayin which the filters change the input spectrum. We shall use (4.97) and(4.98) for this purpose. The first difference filter can be written in theform (4.95) by letting a0 = 1, a1 = −1, and ar = 0 otherwise. Thisimplies that

Ayx(ω) = 1 − e−2πiω,

and the squared frequency response becomes

|Ayx(ω)|2 = (1 − e−2πiω)(1 − e2πiω)= 2[1 − cos(2πω)]. (4.99)

The top panel of Figure 4.14 shows that the first difference filter willattenuate the lower frequencies and enhance the higher frequencies be-


0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

1

2

3

4First Difference

pow

er

frequency

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.2

0.4

0.6

0.8

112 Month Moving Avg.

pow

er

frequency

Figure 4.14 Squared frequency response functions of the first difference and12-month moving average filters.

cause the multiplier of the spectrum, |Ayx(ω)|2, is large for the higherfrequencies and small for the lower frequencies. Generally, the slow riseof this kind of filter does not particularly recommend it as a procedurefor retaining only the high frequencies.

For the centered 12-month moving average, we can take a−6 = a6 = 1/24,ak = 1/12 for −5 ≤ k ≤ 5 and ak = 0 elsewhere. Substituting andrecognizing the cosine terms gives

Ayx(ω) =112

[1 + cos(12πω) + 2

5∑k=1

cos(2πωk)

]. (4.100)

Plotting the squared frequency response of this function as in Figure 4.14shows that we can expect this filter to cut most of the frequency con-tent above .05 cycles per point. This corresponds to eliminating periodsshorter than T = 1/.05 = 20 points. In particular, this drives down theyearly components with periods of T = 12 months and enhances the ElNino frequency, which is somewhat lower. The filter is not completelyefficient at attenuating high frequencies; some power contributions areleft at higher frequencies, as shown in the function |Ayx(ω)|2 and in thefiltered series in Figure 4.3.

The following R session shows how to filter the data, perform the spectral


analysis of this example, and plot the squared frequency response curveof the difference filter.

> par(mfrow=c(3,1))> plot.ts(soi) # the data> plot.ts(diff(soi)) # first difference> k = kernel("modified.daniell", 6) #-- 12 month filter> soif = kernapply(soi,k)> plot.ts(soif)> windows() # open new graphics device - use x11() in unix> spectrum(soif, spans=9, log="no") #-- spectral analysis> abline(v=1/52, lty="dotted")> windows()> w = seq(0,.5, length=1000) #-- frequency response> FR = abs(1-exp(2i*pi*w))ˆ2> plot(w, FR, type="l")

The two filters discussed in the previous example were different in thatthe frequency response function of the first difference was complex-valued,whereas the frequency response of the moving average was purely real. Ashort derivation similar to that used to verify (4.98) shows, when xt and yt arerelated by the linear filter relation (4.95), the cross-spectrum satisfies

fyx(ω) = Ayx(ω)fxx(ω),

so the frequency response is of the form

Ayx(ω) =fyx(ω)fxx(ω)

(4.101)

=cyx(ω)fxx(ω)

− iqyx(ω)fxx(ω)

, (4.102)

where we have used (4.77) to get the last form. Then, we may write (4.102) inpolar coordinates as

Ayx(ω) = |Ayx(ω)| exp{−i φyx(ω)}, (4.103)

where the amplitude and phase of the filter are defined by

|Ayx(ω)| =

√c2yx(ω) + q2

yx(ω)

fxx(ω)(4.104)

and

φyx(ω) = tan−1(

−qyx(ω)cyx(ω)

). (4.105)

A simple interpretation of the phase of a linear filter is that it exhibits timedelays as a function of frequency in the same way as the spectrum represents


the variance as a function of frequency. Additional insight can be gained byconsidering the simple delaying filter

yt = Axt−D,

where the series gets replaced by a version, amplified by multiplying by A anddelayed by D points. For this case,

fyx(ω) = Ae−2πiωDfxx(ω),

and the amplitude is |A|, and the phase is

φyx(ω) = −2πωD,

or just a linear function of frequency ω. For this case, applying a simpletime delay causes phase delays that depend on the frequency of the periodiccomponent being delayed. Interpretation is further enhanced by setting xt =cos(2πωt), in which case yt = A cos(2πωt − 2πωD). Thus, the output series,yt, has the same period as the input series, xt, but the amplitude of the outputhas increased by a factor of |A| and the phase has been changed by a factor of−2πωD.

Example 4.18 Amplitude and Phase of Difference and MovingAverage

We consider calculating the amplitude and phase of the two filters dis-cussed in Example 4.17. The case for the moving average is easy becauseAyx(ω) given in (4.100) is purely real. So, the amplitude is just |Ayx(ω)|and the phase is φyx(ω) = 0. In general, symmetric (at = a−t) filtershave zero phase. The first difference, however, changes this, as we mightexpect from the example above involving the time delay filter. In thiscase, the squared amplitude is given in (4.99). To compute the phase,we write

Ayx(ω) = 1 − e−2πiω

= e−iπω(eiπω − e−iπω)= 2ie−iπω sin(πω)= 2 sin2(πω) + 2i cos(πω) sin(πω)

=cyx(ω)fxx(ω)

− iqyx(ω)fxx(ω)

,

so

φyx(ω) = tan−1(

−qyx(ω)cyx(ω)

)

= tan−1(

cos(πω)sin(πω)

).


Noting thatcos(πω) = sin(−πω + π/2)

and thatsin(πω) = cos(−πω + π/2),

we getφyx(ω) = −πω + π/2,

and the phase is again a linear function of frequency.

The above tendency of the frequencies to arrive at different times in thefiltered version of the series remains as one of two annoying features of the dif-ference type filters. The other weakness is the gentle increase in the frequencyresponse function. If low frequencies are really unimportant and high frequen-cies are to be preserved, we would like to have a somewhat sharper responsethan is obvious in Figure 4.14. Similarly, if low frequencies are important andhigh frequencies are not, the moving average filters are also not very efficient atpassing the low frequencies and attenuating the high frequencies. Improvementis possible by using longer filters, obtained by approximations to the infiniteinverse Fourier transform. The design of filters will be discussed in §4.10 and§4.11.

We will occasionally use results for multivariate series xxxt = (xt1, . . . , xtp)′

that are comparable to the simple property shown in (4.98). Consider thematrix filter

yyyt =∞∑

r=−∞Arxxxt−r, (4.106)

where {Ar} denotes a sequence of q×p matrices such that∑∞

r=−∞ ||Ar|| < ∞,xxxt = (xt1, . . . , xtp)′ is a p×1 stationary vector process with mean vector µµµx andp× p, matrix covariance function Γxx(h) and spectral matrix fxx(ω), and yyyt isthe q × 1 vector output process. Then, we can obtain the following property.

Property P4.5: Output Spectral Matrix of a Linearly FilteredStationary Vector SeriesThe spectral matrix of the filtered output yyyt in (4.106) is related to the spectrumof the input xxxt by

fyy(ω) = A(ω)fxx(ω)A∗(ω), (4.107)

where the matrix frequency response function A(ω) is defined by

A(ω) =∞∑

t=−∞At exp(−2πiωt). (4.108)


4.8 Parametric Spectral Estimation

The methods of §4.5 lead to estimators generally referred to as nonparamet-ric spectra because no assumption is made about the parametric form of thespectral density. In Example 4.6, we derived the spectrum of a second-orderautoregressive series and we might consider basing a spectral estimator on thisfunction, using the estimated parameters φ1, φ2, and σ2

w. Then, substitutingthe parameter estimates into the spectral density fx(ω) determined in thatexample would lead to a parametric estimator for the spectrum. Similarly, wemight fit a p-th order autoregression, with the order p determined by one of themodel selection criteria, such as AIC, AICc, and SIC, defined in (2.18)-(2.20)for the regression model. Parametric autoregressive spectral estimators willoften have superior resolution in problems when several closely spaced narrowspectral peaks are present and are preferred by engineers for a broad vari-ety of problems (see Kay, 1988). The development of autoregressive spectralestimators has been summarized by Parzen (1983).

To be specific, consider the equation determining the order p autoregressivemodel (2.1), written in the form

xt −p∑

k=1

φkxt−k = wt, (4.109)

where wt is a white noise process with mean zero and variance σ2w. Then, note

the linear filter Property P4.4, combined with equating the spectra of the left-and right-hand sides of the defining equation above yields

|φ(e−2πiω)|2fx(ω) = σ2w, (4.110)

where

φ(e−2πiω) = 1 −p∑

k=1

φke−2πiωk. (4.111)

Then, denoting the maximum likelihood or least squares estimators of themodel parameters by φ1, φ2, . . . , φp and σ2

w, we may substitute them into theform of the spectrum implied by (4.110), obtaining

fx(ω) =σ2

w

|φ(e−2πiω)|2. (4.112)

The asymptotic distribution of the autoregressive spectral estimator has beenobtained by Berk (1974) under the conditions p → ∞, p3/n → 0 as p, n → ∞,which may be too severe for most applications. The limiting results imply aconfidence interval of the form

fx(ω)(1 + Czα/2)

≤ fx(ω) ≤ fx(ω)(1 − Czα/2)

, (4.113)

4.8: Parametric Spectral Estimation 229

whereC =

√2p/n (4.114)

and zα/2 is ordinate corresponding to the upper α/2 probability of the standardnormal distribution. If the sampling distribution is to be checked, we suggestapplying the bootstrap estimator to get the sampling distribution of fx(ω)using a procedure similar to the one used for p = 1 in Example 3.33. Analternative for higher order autoregressive series is to put the AR(p) in state-space form and use the bootstrap procedure discussed in §6.7.

An interesting fact about rational spectra of the form (4.110) is that anyspectral density can be approximated, arbitrarily close by the spectrum of anAR process.

Property P4.6: Approximating a Spectral Density with an ARSpectrumLet g(ω) be the spectral density of a stationary process. Then, given ε > 0,there is a time series with the representation

xt =p∑

k=1

φkxt−k + wt

where wt is white noise with variance σ2w, such that

|fx(ω) − g(ω)| < ε all ω ∈ [−1/2, 1/2].

Moreover, p is finite and the roots of φ(z) = 1 −∑pk=1 φkzk are outside the

unit circle.

One drawback of the property is that it does not tell us how large p must bebefore the approximation is reasonable; in some situations p may be extremelylarge. Property P4.6 also holds for MA and for ARMA processes in general, anda proof of the result may be found in Fuller (1996, Ch 4). For an ARMA(p, q)process we would have

fx(ω) = σ2w

|θ(e−2πiω)|2|φ(e−2πiω)|2 (4.115)

where θ(z) = 1 +∑q

k=1 θkzk. We demonstrate the technique in the followingexample.

Example 4.19 Autoregressive Spectral Estimator of the SOI Series

Consider obtaining results comparable to the nonparametric estimatorsshown in Figure 4.5 for the SOI series. Fitting successively higher or-der models for p = 1, 2, . . . , 30 yields a minimum SIC at p = 15 and aminimum AICc at p = 16, as shown in Figure 4.15. We can see fromFigure 4.15 that SIC is very definite about which model it chooses; thatis, the minimum SIC is very distinct. On the other hand, it is not clear


0 5 10 15 20 25 30−1.6

−1.55

−1.5

−1.45

−1.4

−1.35

−1.3

−1.25

−1.2

AR order

SIC

AICc

Figure 4.15 Model selection criteria AICc and SIC as a function of order pfor autoregressive models fitted to the SOI series.

what is going to happen with AICc; that is, the minimum is not so clear,and there is some concern that AICc will start decreasing after p = 30.Minimum AIC selects the p = 15 model (but suffers from the same un-certainty as AICc) as will be seen in the R example. The spectra of thetwo cases are almost identical, as shown in Figure 4.16, and we note thestrong peaks at 52 months and 12 months corresponding to the nonpara-metric estimators obtained in §4.5. In addition, the harmonics of theyearly period is evident in the estimated spectrum.

To perform a similar analysis in R, the command spec.ar can be usedto fit the best model via AIC and plot the resulting spectrum. A quickway to obtain the AIC values is to run the ar command as follows.

> spec.ar(soi, log="no") # plot min AIC spectrum> abline(v=1/52, lty="dotted") # locate El Nino period> abline(v=1/12, lty="dotted") # locate yearly period> soi.ar = ar(soi, order.max=30) # obtain AICs> plot(0:30, soi.ar$aic, type="l") # plot AICs> soi.ar # results

4.8: Parametric Spectral Estimation 231

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Frequency

AR S

pect

ra

Figure 4.16 Autoregressive spectral estimators for the SOI series using modelsselected by AIC and SIC (p = 15, solid line) and by AICc (p = 16, dashedline). The first peak corresponds to the El Nino period of 52 months.

Coefficients:1 2 3 4 5

0.4237 0.0803 0.1411 0.0750 -0.04466 7 8 9 10

-0.0816 -0.0686 -0.0640 0.0159 0.109911 12 13 14 15

0.1656 0.1482 0.0231 -0.1814 -0.1406

Order selected 15 sigmaˆ2 estimated as 0.07575

Use the command spec.ar(soi, order=16, log="no") to obtain theAR(16) spectrum.

Finally, it should be mentioned that any parametric spectrum, say f(ω; θθθ),depending on the vector parameter θθθ can be estimated via the approximateWhittle likelihood, see Whittle (1961), using the approximate properties of thediscrete Fourier transform derived in Appendix C. We have that the DFTs,d(ωj), are approximately complex normally distributed with mean zero andvariance f(ωj ; θθθ) and are approximately independent for ωj �= ωk. This implies


that an approximate log likelihood can be written in the form

lnL(xxx; θθθ) ≈ −∑

0<ωj<1/2

(ln fx(ωj ; θθθ) +

|d(ωj)|2fx(ωj ; θθθ)

), (4.116)

where the sum is sometimes expanded to include the frequencies ωj = 0, 1/2.If the form with the two additional frequencies is used, the multiplier of thesum will be unity, except for the purely real points at ωj = 0, 1/2 for which themultiplier is 1/2. For a discussion of applying the Whittle approximation to theproblem of estimating parameters in an ARMA spectrum, see Anderson (1978).Although this yields valid answers, it seems more involved than simply usingthe time domain methods discussed in Chapter 3. The Whittle likelihood willbe useful in fitting long memory models that will be discussed in Chapter 5.

4.9 Dynamic Fourier Analysis and Wavelets

If a time series, xt, is stationary, its second-order behavior remains the same,regardless of the time t. It makes sense to match a stationary time series withsines and cosines because they, too, behave the same forever. Indeed, basedon the Spectral Representation Theorem (Appendix C, §C.1), we may regarda stationary series as the superposition of sines and cosines that oscillate atvarious frequencies. As seen in this text, however, many time series are notstationary. Typically, the data are coerced into stationarity via transforma-tions, or we restrict attention to parts of the data where stationarity appearsto adhere. In some cases, the nonstationarity of a time series is of interest.That is to say, it is the local behavior of the process, and not the global be-havior of the process, that is of concern to the investigator. As a case in point,we mention the explosion and earthquake series first presented in Example 1.7(see Figure 1.7) and subsequently analyzed using Fourier methods in Exam-ple 4.13. The following example emphasizes the importance of dynamic (ortime-frequency) Fourier analysis.

Example 4.20 Dynamic Fourier Analysis of the Explosion andEarthquake Series

Consider the earthquake and explosion series displayed in Figure 1.7. Asa summary of the local behavior of these series, the estimated spectraof the P and S waves in Example 4.13 leave a lot to be desired. Fig-ures 4.17 and 4.18 show the time-frequency analysis of the earthquakeand explosion series, respectively. The idea here is to summarize thespectral behavior of the signal as it evolves over time. First, a Fourieranalysis is performed on a short section of the data. Then, the sectionis shifted, and a Fourier analysis is performed on the new section. Thisprocess is repeated until the end of the data, and the results are plot-ted as in Figures 4.17 and 4.18. Specifically, in this example, let xt, for

4.9: Dynamic Fourier Analysis and Wavelets 233

Figure 4.17 Time-frequency plot for the dynamic Fourier analysis of the earth-quake series shown in Figure 1.7.

t = 1, . . . , 2048, represent the series of interest. Then, the sections of thedata that were analyzed were {xtk+1, . . . , xtk+256}, for tk = 128k, andk = 0, 1, . . . , 14. Each section was tapered using a cosine bell, and spec-tral estimation was performed using a triangular set of L = 5 weights.The sections overlap each other, however, this practice is not necessaryand sometimes not desirable; see Percival and Walden (1993, §6.17) fora further discussion of this problem.

The results of the dynamic analysis are shown as the estimated spectra(for frequencies up to ω = .25) for each starting location (time), tk =128k, with k = 0, 1, . . . , 14. The S component for the earthquake showspower at the low frequencies only, and the power remains strong for along time. In contrast, the explosion shows power at higher frequenciesthan the earthquake, and the power of the signals (P and S waves) doesnot last as long as in the case of the earthquake.

The following is an R session that corresponds to a similar analysis ofthe explosion series in this example.

> eqexp = matrix(scan("/mydata/eq5exp6.dat"), ncol=2)> ex = eqexp[,2] # the explosion series> ## -- dynamic spectral analysis -- ##> nobs = length(ex) # number of observations


Figure 4.18 Time-frequency plot for the dynamic Fourier analysis of the ex-plosion series shown in Figure 1.7.

> wsize = 256 # window size> overlap = 128 # overlap> ovr = wsize-overlap> nseg = floor(nobs/ovr)-1; # number of segments> krnl = kernel("daniell", c(1,1)) # kernel> ex.spec = matrix(0,wsize/2,nseg)> for (k in 1:nseg) {+ a = ovr*(k-1)+1+ b = wsize+ovr*(k-1)+ ex.spec[,k]=spectrum(ex[a:b],krnl,taper=.5,plot=F)$spec+ }> ## -- plot results -- ##> x = seq(0, .5, len = nrow(ex.spec))> y = seq(0, ovr*nseg, len = ncol(ex.spec))> persp(x, y, ex.spec, zlab="Power", xlab="frequency",+ ylab="time", ticktype = "detailed", theta=25, d=2)

One way to view the time-frequency analysis of Example 4.20 is to considerit as being based on local transforms of the data xt of the form

dj,k = n−1/2n∑

t=1

xtψj,k(t), (4.117)


Figure 4.19 Local, tapered cosines at various frequencies.

where

ψj,k(t) ={

(n/m)1/2ht e−2πitj/m t ∈ [tk + 1, tk + m]0 otherwise

(4.118)

where ht is a taper and m is some fraction of n. In Example 4.20, n = 2048,m = 256, tk = 128k, for k = 0, 1, . . . , 14, and ht was a cosine bell taperover 256 points. In (4.117) and (4.118), j indexes frequency, ωj = j/m, forj = 1, 2, . . . , [m/2], and k indexes the location, or time shift, of the transform.In this case, the transforms are based on tapered cosines and sines that havebeen zeroed out over various regions in time. The key point here is that thetransforms are based on local sinusoids. Figure 4.19 shows an example of fourlocal, tapered cosine functions at various frequencies. In that figure, the lengthof the data is considered to be one, and the cosines are localized to a fourth ofthe data length.

In addition to dynamic Fourier analysis as a method to overcome therestriction of stationarity, researchers have sought various alternative meth-ods. A recent, and successful, alternative is wavelet analysis. A websitehttp://www.wavelet.org is devoted to wavelets, which includes informationabout books, technical papers, software, and links to other sites. In addi-


tion, we mention the monograph on wavelets by Daubechies (1992), the textby Percival and Walden (2000), and we note that many statistical softwaremanufacturers have wavelet modules that sit on top of their base package. Inthis section, we rely primarily on the S-PLUS wavelets module (with a manualwritten by Bruce and Gao, 1996), however, we will present some R code wherepossible. The basic idea of wavelet analysis is to imitate dynamic Fourieranalysis, but with functions (wavelets) that may be better suited to capturethe local behavior of nonstationary time series.

Wavelets come in families generated by a father wavelet, φ, and a motherwavelet, ψ. The father wavelets are used to capture the smooth, low-frequencynature of the data, whereas the mother wavelets are used to capture the de-tailed, and high-frequency nature of the data. The father wavelet integratesto one, and the mother wavelet integrates to zero∫

φ(t)dt = 1 and∫

ψ(t)dt = 0. (4.119)

For a simple example, consider the Haar function,

ψ(t) =

⎧⎨⎩ 1, 0 ≤ t < 1/2,−1, 1/2 ≤ t < 1,

0, otherwise.(4.120)

The father in this case is φ(t) = 1 for t ∈ [0, 1) and zero otherwise. TheHaar functions are useful for demonstrating properties of wavelets, but theydo not have good time-frequency localization properties. Figure 4.20 displaystwo of the more commonly used wavelets that are available with the S-PLUSwavelets module, the daublet4 and symmlet8 wavelets, which are described indetail in Daubechies (1992). The number after the name refers to the widthand smoothness of the wavelet; for example, the symmlet10 wavelet is widerand smoother than the symmlet8 wavelet. Daublets are one of the first typeof continuous orthogonal wavelets with compact support, and symmlets wereconstructed to be closer to symmetry than daublets. In general, wavelets donot have an analytical form, but instead they are generated using numericalmethods.

Figure 4.20 was generated in S-PLUS using the wavelet module as follows:

> d4f <- wavelet("d4", mother=F)> d4m <- wavelet("d4")> s8f <- wavelet("s8", mother=F)> s8m <- wavelet("s8")> par(mfrow=c(2,2))> plot(d4f)> plot(d4m)> plot(s8f)> plot(s8m)


‘d4’ father, phi(0,0)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00.5

1.0

‘d4’ mother, psi(0,0)

-1.0 0.0 0.5 1.0 1.5 2.0

-1.0

-0.5

0.00.5

1.01.5

‘s8’ father, phi(0,0)

0 2 4 6

-0.2

0.00.2

0.40.6

0.81.0

1.2

‘s8’ mother, psi(0,0)

-2 0 2 4

-1.0

-0.5

0.00.5

1.01.5

Figure 4.20 Father and mother daublet4 wavelets (top row); father andmother symmlet8 wavelets (bottom row).

It is possible to draw some wavelets in R using the wavethresh pack-age. In that package, daublets are called DaubExPhase and symmlets arecalled DaubLeAsymm. The following R session displays some of the availablewavelets (this will not reproduce Figure 4.20) and it assumes the wavethreshpackage has been downloaded into R and then loaded at the start of the session.The filter.number determines the width and smoothness of the wavelet.

> par(mfrow=c(2,2))> draw.default(filter.number=2, family="DaubExPhase")> draw.default(filter.number=4, family="DaubExPhase")> draw.default(filter.number=4, family="DaubLeAsymm")> draw.default(filter.number=9, family="DaubLeAsymm")

When we depart from periodic functions, such as sines and cosines, theprecise meaning of frequency, or cycles per unit time, is lost. When usingwavelets, we typically refer to scale rather than frequency. The orthogonal



-5 0 5 10 15 20

-0.5

0.00.5


-5 0 5 10 15 20

-0.5

0.00.5


-5 0 5 10 15 20

-0.5

0.00.5


-5 0 5 10 15 20

-0.5

0.00.5

Figure 4.21 Scaled and translated daublet4 wavelets, ψ1,0(t) and ψ2,1(t) (toprow); scaled and translated symmlet8 wavelets, ψ1,0(t) and ψ2,1(t) (bottomrow).

wavelet decomposition of a time series, xt, for t = 1, . . . , n is

xt =∑

k

sJ,kφJ,k(t) +∑

k

dJ,kψJ,k(t)

+∑

k

dJ−1,kψJ−1,k(t) + · · · +∑

k

d1,kψ1,k(t), (4.121)

where J is the number of scales, and k ranges from one to the number of coeffi-cients associated with the specified component (see Example 4.21). In (4.121),the wavelet functions φJ,k(t), ψJ,k(t), ψJ−1,k(t), . . . , ψ1,k(t) are generated fromthe father wavelet, φ(t), and the mother wavelet, ψ(t), by translation (shift)and scaling:

φJ,k(t) = 2−J/2φ

(t − 2Jk

2J

), (4.122)

ψj,k(t) = 2−j/2ψ

(t − 2jk

2j

), j = 1, . . . , J. (4.123)


The choice of dyadic shifts and scales is arbitrary but convenient. The shift ortranslation parameter is 2jk, and scale parameter is 2j . The wavelet functionsare spread out and shorter for larger values of j (or scale parameter 2j) andtall and narrow for small values of the scale. Figure 4.21 shows ψ1,0(t) andψ2,1(t) generated from the daublet4 (top row), and the symmlet8 (bottom row)mother wavelets. We may think of 1/2j (or 1/scale) in wavelet analysis as beingthe analogue of frequency (ωj = j/n) in Fourier analysis. For example, whenj = 1, the scale parameter of 2 is akin to the Nyquist frequency of 1/2, andwhen j = 6, the scale parameter of 26 is akin to a low frequency (1/26 ≈ 0.016).In other words, larger values of the scale refer to slower, smoother (or coarser)movements of the signal, and smaller values of the scale refer to faster, choppier(or finer) movements of the signal. Figure 4.21 was generated in S-PLUS usingthe wavelet module as follows:

> d4.1 <- wavelet("d4", level=1, shift=0)> d4.2 <- wavelet("d4", level=2, shift=1)> s8.1 <- wavelet("s8", level=1, shift=0)> s8.2 <- wavelet("s8", level=2, shift=1)> par(mfrow=c(2,2))> plot(d4.1, ylim=c(-.8,.8), xlim=c(-6,20))> plot(d4.2, ylim=c(-.8,.8), xlim=c(-6,20))> plot(s8.1, ylim=c(-.8,.8), xlim=c(-6,20))> plot(s8.2, ylim=c(-.8,.8), xlim=c(-6,20))

The discrete wavelet transform (DWT) of the data xt are the coefficientssJ,k and dj,k for j = J, J − 1, . . . , 1, in (4.121). To some degree of approxima-tion, they are given by15

sJ,k = n−1/2n∑

t=1

xtφJ,k(t), (4.124)

dj,k = n−1/2n∑

t=1

xtψj,k(t) j = J, J − 1, . . . , 1. (4.125)

It is the magnitudes of the coefficients that measure the importance of thecorresponding wavelet term in describing the behavior of xt. As in Fourieranalysis, the DWT is not computed as shown but is calculated using a fastalgorithm. The sJ,k are called the smooth coefficients because they representthe smooth behavior of the data. The dj,k are called the detail coefficientsbecause they tend to represent the finer, more high-frequency nature, of thedata.

15The actual DWT coefficients are defined via a set of filters whose coefficients are closeto what you would get by sampling the father and mother wavelets, but not exactly so; seethe discussion surrounding Figures 471 and 478 in Percival and Walden (2000).


0 500 1000 1500 2000

s6

d6

d5

d4

d3

d2

d1

idwt

Earthquake

Figure 4.22 Discrete wavelet transform of the earthquake series using thesymmlet8 wavelets, and J = 6 levels of scale.

0 500 1000 1500 2000

s6

d6

d5

d4

d3

d2

d1

idwt

Explosion

Figure 4.23 Discrete wavelet transform of the explosion series using thesymmlet8 wavelets and J = 6 levels of scale.


Example 4.21 Wavelet Analysis of the Explosion and EarthquakeSeries

Figures 4.22 and 4.23 show the DWTs, based on the symmlet8 waveletbasis, for the earthquake and explosion series, respectively. Each series isof length n = 211 = 2048, and in this example, the DWTs are calculatedusing J = 6 levels. In this case, n/2 = 210 = 1024 values are in d1 ={d1,k; k = 1, . . . , 210}, n/22 = 29 = 512 values are in d2 = {d2,k; k =1, . . . , 29}, and so on, until finally, n/26 = 25 = 32 values are in d6and in s6. The detail values d1,k, . . . , d6,k are plotted at the same scale,and hence, the relative importance of each value can be seen from thegraph. The smooth values s6,k are typically larger than the detail valuesand plotted on a different scale. The top of Figures 4.22 and 4.23 showthe inverse DWT (IDWT) computed from all of the coefficients. Thedisplayed IDWT is a reconstruction of the data, and it reproduces thedata except for round-off error.

Comparing the DWTs, the earthquake is best represented by waveletswith larger scale than the explosion. One way to measure the importanceof each level, d1, d2, . . . , d6, s6, is to evaluate the proportion of the totalpower (or energy) explained by each. The total power of a time seriesxt, for t = 1, . . . , n, is TP =

∑nt=1 x2

t . The total power associated witheach level of scale is (recall n = 211),

TP s6 =

n/26∑k=1

s26,k and TP d

j =n/2j∑k=1

d2j,k, j = 1, . . . , 6.

Because we are working with an orthogonal basis, we have

TP = TP s6 +

6∑j=1

TP dj ,

and the proportion of the total power explained by each level of detailwould be the ratios TP d

j /TP for j = 1, . . . , 6, and for the smooth level,it would be TP s

6 /TP . These values are listed in Table 4.2. From thattable nearly 80% of the total power of the earthquake series is explainedby the higher scale details d4 and d5, whereas 90% of the total power isexplained by the smaller scale details d2 and d3 for the explosion.

Figures 4.24 and 4.25 show the time-scale plots based on the DWT of theearthquake series and the explosion series, respectively. These figures arethe wavelet analog of the time-frequency plots shown in Figures 4.17 and4.18. The power axis represents the magnitude of each value djk or s6,k.The time axis matches the time axis in the DWTs shown in Figures 4.22and 4.23, and the scale axis is plotted as 1/scale, listed from the coarsestscale to the finest scale. On the 1/scale axis, the coarsest scale values,


Table 4.2 Fraction of the Total Power for the DWTsof the Earthquake and the Explosion

Component Earthquake Explosions6 0.009 0.002d6 0.043 0.002d5 0.377 0.007d4 0.367 0.015d3 0.160 0.559d2 0.040 0.349d1 0.003 0.066

Figure 4.24 Time-scale plot of the earthquake series.

represented by the smooth coefficients s6, are plotted over the range[0, 2−6), the coarsest detail values, d6, are plotted over [2−6, 2−5), andso on. In these figures, we did not plot the finest scale values, d1, so thefinest scale values exhibited in Figures 4.24 and 4.25 are in d2, which areplotted over the range [2−2, 2−1).

The conclusions drawn from these plots are the same as those drawn fromFigures 4.17 and 4.18. That is, the S wave for the earthquake shows powerat the high scales (or low 1/scale) only, and the power remains strongfor a long time. In contrast, the explosion shows power at smaller scales(or higher 1/scale) than the earthquake, and the power of the signals (Pand S waves) do not last as long as in the case of the earthquake.

The analyses of this example were performed using the S-PLUS wavelets


Figure 4.25 Time-scale plot of the explosion series.

module (which must be loaded prior to the analyses) as follows:

> eqexp <- matrix(scan("/mydata/eq5exp6.dat"), ncol=2)> eq <- eqexp[,1] # the earthquake series> ex <- eqexp[,2] # the explosion series> eq.dwt <- dwt(eq)> ex.dwt <- dwt(ex)> plot(eq.dwt)> plot(ex.dwt)> # -- energy distributions (Table 4.2) --#> dotchart(eq.dwt) # a graphic> summary(eq.dwt) # numerical details> dotchart(ex.dwt)> summary(ex.dwt)> #-- time scale plots (Figs 4.24-4.25 but not in 3d) --#> time.scale.plot(eq.dwt)> time.scale.plot(ex.dwt)

Similar analyses may be performed in R using the wavethresh or thewaveslim packages. We exhibit the analysis for the earthquake seriesusing waveslim, assuming it has been downloaded into R and then loadedat the start of the R session.

> eq.dwt = dwt(eq, n.levels=6)> # -- plot the dwt and calculate TP -- #> TP = matrix(0,7,1)> par(mfcol=c(7,1), pty="m", mar=c(3,4,2,2))> for(i in 1:6){


+ plot.ts(up.sample(eq.dwt[[i]], 2ˆi), type="h", axes=F,+ ylab=names(eq.dwt)[i])+ abline(h=0)+ axis(side=2)+ TP[i]=sum(eq.dwt[[i]]ˆ2)+ }> plot.ts(up.sample(eq.dwt[[7]], 2ˆ6), type="h", axes=F,+ ylab=names(eq.dwt)[7])> abline(h=0)> axis(side=2)> axis(side=1)> TP[7]=sum(eq.dwt[[7]]ˆ2)> TP/sum(eqˆ2) # the energy distribution

In the R code, we plotted the wavelet transform on different scales. Toplot the ordinates of the wavelet transforms on the same scale, include acommand like ylim=c(-1.5,1.5) in each plot.ts() command.

Wavelets can be used to perform nonparametric smoothing along the linesfirst discussed in §2.4, but with an emphasis on localized behavior. Although aconsiderable amount of literature exists on this topic, we will present the basicideas. For further information, we refer the reader to Donoho and Johnstone(1994, 1995). As in §2.4, we suppose the data xt can be written in terms of asignal plus noise model as

xt = st + εt. (4.126)

The goal here is to remove the noise from the data, and obtain an estimate ofthe signal, st, without having to specify a parametric form of the signal. Thetechnique based on wavelets is referred to as waveshrink.

The basic idea behind waveshrink is to shrink the wavelet coefficients inthe DWT of xt toward zero in an attempt to denoise the data and then toestimate the signal via (4.121) with the new coefficients. One obvious way toshrink the coefficients toward zero is to simply zero out any coefficient smallerin magnitude than some predetermined value, λ. Such a shrinkage rule isdiscontinuous and sometimes it is preferable to use a continuous shrinkagefunction. One such method, termed soft shrinkage, proceeds as follows. Ifthe value of a coefficient is a, we set that coefficient to zero if |a| ≤ λ, andto sign(a)(|a| − λ) if |a| > λ. The choice of a shrinkage method is based onthe goal of the signal extraction. This process entails choosing a value for theshrinkage threshold, λ, and we may wish to use a different threshold value, say,λj , for each level of scale j = 1, . . . , J . One particular method that works wellif we are interested in a relatively high degree of smoothness in the estimateis to choose λ = σε

√2 log n for all scale levels, where σε is an estimate of

the scale of the noise, σε. Typically a robust estimate of σε is used, e.g., themedian of the absolute deviations of the data from the median (MAD). Forother thresholding techniques or for a better understanding of waveshrink, see

4.10: Lagged Regression 245

Donoho and Johnstone (1994, 1995), or the S-PLUS wavelets module manual(Bruce and Gao, 1996, Ch 6).

Example 4.22 Waveshrink Analysis of the Explosion and EarthquakeSeries

Figure 4.26 shows the results of a waveshrink analysis on the earthquakeand explosion series. In this example, soft shrinkage was used with auniversal threshold of λ = σε

√2 log n where σε is the MAD. Figure 4.26

displays the data xt, the estimated signal st, as well as the residualsxt − st. According to this analysis, the earthquake is mostly signal andcharacterized by prolonged energy, whereas the explosion is comprised ofshort bursts of energy.

Figure 4.26 was generated in S-PLUS using the wavelets module. Forexample, the analysis of the earthquake series was performed as follows.

> eq.dwt <- dwt(eq)> eq.shrink <- waveshrink(eq.dwt, shrink.rule="universal",+ shrink.fun="soft")

In R, using the waveslim package for the earthquake series, use thefollowing commands.

> eq.dwt = dwt(eq, n.levels=6)> eq.trsh = universal.thresh(eq.dwt, hard=F)> eq.smo = idwt(eq.trsh)> par(mfrow=c(3,1))> plot.ts(eq, ylab="Earthquake", ylim=c(-.5,.5))> plot.ts(eq.smo,ylab="Smoothed Earthquake",ylim=c(-.5,.5))> plot.ts(eq-eq.smo, ylab="Noise", ylim=c(-.5,.5))

4.10 Lagged Regression Models

One of the intriguing possibilities offered by the coherence analysis of therelation between the SOI and Recruitment series discussed in Example 4.16would be extending classical regression to the analysis of lagged regressionmodels of the form

yt =∞∑

r=−∞βrxt−r + vt, (4.127)

where vt is a stationary noise process, xt is the observed input series, andyt is the observed output series. We are interested in estimating the filtercoefficients βr relating the adjacent lagged values of xt to the output series yt.

In the case of SOI and Recruitment series, we might identify the El Ninodriving series, SOI, as the input, xt, and yt, the Recruitment series, as the


0 500 1000 1500 2000

Resid

Signal

Data

Earthquake

0 500 1000 1500 2000

Resid

Signal

Data

Explosion

Figure 4.26 Waveshrink estimates of the earthquake signal and of the explo-sion signal.

output. In general, there will be more than a single possible input series andwe may envision a q × 1 vector of driving series. This multivariate inputsituation is covered in Chapter 7. The model given by (4.127) is useful underseveral different scenarios, corresponding to different assumptions that can bemade about the components.

We assume that the inputs and outputs have zero means and are jointlystationary with the 2 × 1 vector process (xt, yt)′ having a spectral matrix of


the form

f(ω) =(


). (4.128)

Here, fxy(ω) is the cross-spectrum relating the input xt to the output yt, andfxx(ω) and fyy(ω) are the spectra of the input and output series, respectively.Generally, we observe two series, regarded as input and output and search forregression functions {βt} relating the inputs to the outputs. We assume allautocovariance functions satisfy the absolute summability conditions of theform (4.31).

Then, minimizing the mean squared error

MSE = E

(yt −

∞∑r=−∞

βrxt−r

)2

(4.129)

leads to the usual orthogonality conditions

E

[(yt −

∞∑r=−∞

βrxt−r

)xt−s

]= 0 (4.130)

for all s = 0,±1,±2, . . .. Taking the expectations inside leads to the normalequations

∞∑r=−∞

βr γxx(s − r) = γyx(s) (4.131)

for s = 0,±1,±2, . . .. These equations might be solved, with some effort, ifthe covariance functions were known exactly. If data (xt, yt) for t = 1, ..., nare available, we might use a finite approximation to the above equationswith γxx(h) and γyx(h) substituted into (4.131). If the regression vectors areessentially zero for |s| ≥ M/2, and M < n, the system (4.131) would be of fullrank and the solution would involve inverting an (M − 1) × (M − 1) matrix.

A frequency domain approximate solution is easier in this case for tworeasons. First, the computations depend on spectra and cross-spectra that canbe estimated from sample data using the techniques of §4.6. In addition, nomatrices will have to be inverted, although the frequency domain ratio will haveto be computed for each frequency. In order to develop the frequency domainsolution, substitute the representation (4.85) into the normal equations, usingthe convention defined in (4.128). The left side of (4.131) can then be writtenin the form∫ 1/2

−1/2

∞∑r=−∞

βr e2πiω(s−r) fxx(ω) dω =∫ 1/2

−1/2e2πiωsB(ω)fxx(ω) dω,

where

B(ω) =∞∑

r=−∞βr e−2πiωr (4.132)


is the Fourier transform of the regression coefficients βt. Now, because γyx(s) isthe inverse transform of the cross-spectrum fyx(ω), we might write the systemof equations in the frequency domain, using the uniqueness of the Fouriertransform, as

B(ω)fxx(ω) = fyx(ω), (4.133)

which then become the analogs of the usual normal equations. Then, we maytake

B(ωk) =fyx(ωk)

fxx(ωk)(4.134)

as the estimator for the Fourier transform of the regression coefficients, evalu-ated at some subset of fundamental frequencies ωk = k/M with M << n. Gen-erally, we assume smoothness of B(·) over intervals of the form {ωk + /n; =−(L− 1)/2, . . . , (L− 1)/2}. The inverse transform of the function B(ω) wouldgive βt, and we note that the discrete time approximation can be taken as

βt = M−1M−1∑k=0

B(ωk)e2πiωkt (4.135)

for t = 0,±1,±2, . . . ,±(M/2 − 1). If we were to use (4.135) to define βt for|t| ≥ M/2, we would end up with a sequence of coefficients that is periodic witha period of M . In practice we define βt = 0 for |t| ≥ M/2 instead. Problem4.32 explores the error resulting from this approximation.

Example 4.23 Lagged Regression Results for SOI and RecruitmentSeries

The high coherence between the SOI and Recruitment series noted inExample 4.16 suggests a lagged regression relation between the two series.A natural direction for the implication in this situation is implied becausewe feel that the sea surface temperature or SOI should be the input andthe Recruitment series should be the output. With this in mind, let xt

be the SOI series and yt the Recruitment series.

Although we think naturally of the SOI as the input and the Recruitmentas the output, two input-output configurations are of interest. With SOIas the input, the model is

yt =∞∑

r=−∞arxt−r + wt

whereas a model that reverses the two roles would be

xt =∞∑

r=−∞bryt−r + vt,


−15 −10 −5 0 5 10 15−0.03

−0.02

−0.01

0

0.01

0.02

0.03

Input: Recruits

−15 −10 −5 0 5 10 15−30

−20

−10

0

10

Input: SOI

Figure 4.27 Estimated impulse response functions relating SOI to Recruit-ment (top) and Recruitment to SOI (bottom) L = 15, M = 32.

where wt and vt are white noise processes. Even though there is noplausible environmental explanation for the second of these two models,displaying both possibilities helps to settle on a parsimonious transferfunction model. The two estimated regression or impulse response func-tions with M = 32 and L = 15 are shown in Figure 4.27. Note thenegative peak at a lag of five points in the first of the two situationswhere the SOI series is assumed to be the input. The fall-off after lagfive seems to be approximately exponential. A possible model for thissituation is

yt = −22xt−5 − 15xt−6 − 11xt−7 − 10xt−8 − 7xt−9 − . . . + wt.

If we examine the inverse relation, namely, a regression model with theRecruitment series yt as the input, we get a much simpler model that


seems to depend on only two coefficients, namely,

xt = .012yt+4 − .018yt+5 + vt,

or, shifting by five points and transposing,

yt = .667yt−1 − 56xt−5 + εt,

where εt is white noise. Using the backshift operator, we may write

(1 − .667B)yt = −56B5xt + εt.

The analysis of this example was performed using the time series packageASTSA, which is available for download from the website of this text.

The example shows we can get a clean estimator for the transfer functionsrelating the two series if the coherence ρ2

xy(ω) is large. The reason is that wecan write the minimized mean squared error (4.129) as

MSE = E

[(yt −

∞∑r=−∞

βrxt−r

)yt

]= γyy(0) −

∞∑r=−∞

βrγxy(−r),

using the result about the orthogonality of the data and error term in the Pro-jection theorem. Then, substituting the spectral representations of the autoco-variance and cross-covariance functions and identifying the Fourier transform(4.132) in the result leads to

MSE =∫ 1/2

−1/2[fyy(ω) − B(ω)fxy(ω)] dω

=∫ 1/2

−1/2fyy(ω)[1 − ρ2

yx(ω)]dω, (4.136)

where ρ2yx(ω) is just the squared coherence given by (4.83). The similarity of

(4.136) to the usual mean square error that results from predicting y from xis obvious. In that case, we would have

E(y − βx)2 = σ2y(1 − ρ2

xy)

for jointly distributed random variables x and y with zero means, variances σ2x

and σ2y, and covariance σxy = ρxyσxσy. Because the mean squared error in

(4.136) satisfies MSE ≥ 0 with fyy(ω) a non-negative function, it follows thatthe coherence satisfies

0 ≤ ρ2xy(ω) ≤ 1

for all ω. Furthermore, Problem 4.33 shows the squared coherence is onewhen the output are linearly related by the filter relation (4.127), and there

4.11: Signal Extraction 251

is no noise, i.e., vt = 0. Hence, the multiple coherence gives a measure of theassociation or correlation between the input and output series as a function offrequency.

The matter of verifying that the F -distribution claimed for (4.93) will holdwhen the sample coherence values are substituted for theoretical values stillremains. Again, the form of the F -statistic is exactly analogous to the usual t-test for no correlation in a regression context. We give an argument leading tothis conclusion later using the results in Appendix C, §C.3. Another questionthat has not been resolved in this section is the extension to the case of multipleinputs xt1, xt2, . . . , xtq. Often, more than just a single input series is presentthat can possibly form a lagged predictor of the output series yt. An exampleis the cardiovascular mortality series that depended on possibly a number ofpollution series and temperature. We discuss this particular extension as apart of the multivariate time series techniques considered in Chapter 7.

4.11 Signal Extraction and Optimum Filtering

A model closely related to regression can be developed by assuming again that

yt =∞∑

r=−∞βrxt−r + vt, (4.137)

but where the βs are known and xt is some unknown random signal that isuncorrelated with the noise process vt. In this case, we observe only yt andare interested in an estimator for the signal xt of the form

xt =∞∑

r=−∞aryt−r. (4.138)

In the frequency domain, it is convenient to make the additional assumptionsthat the series xt and vt are both mean-zero stationary series with spectrafxx(ω) and fvv(ω), often referred to as the signal spectrum and noise spectrum,respectively. Often, the special case βt = δt, in which δt is the Kronecker delta,is of interest because (4.137) reduces to the simple signal plus noise model

yt = xt + vt (4.139)

in that case. In general, we seek the set of filter coefficients at that minimizethe mean squared error of estimation, say,

MSE = E[(xt −∞∑

r=−∞aryt−r)2]. (4.140)

This problem was originally solved by Kolmogorov (1941) and by Wiener(1949), who derived the result in 1941 and published it in classified reportsduring World War II.


We can apply the orthogonality principle to write

E[(xt −∞∑

r=−∞aryt−r)yt−s] = 0

for s = 0,±1,±2, . . ., which leads to

∞∑r=−∞

arγyy(s − r) = γxy(s),

to be solved for the filter coefficients. Substituting the spectral representationsfor the autocovariance functions into the above and identifying the spectraldensities through the uniqueness of the Fourier transform produces

A(ω)fyy(ω) = fxy(ω), (4.141)

where A(ω) and the optimal filter at are Fourier transform pairs, as in Defini-tion 4.1 for B(ω) and βt. Now, a special consequence of the model is that (seeProblem 4.23)

fxy(ω) = B(ω)fxx(ω) (4.142)

andfyy(ω) = |B(ω)|2fxx(ω) + fvv(ω), (4.143)

implying the optimal filter would be Fourier transform of

A(ω) =B(ω)(

|B(ω)|2 + fvv(ω)fxx(ω)

) , (4.144)

where the second term in the denominator is just the inverse of the signal tonoise ratio, say,

SNR(ω) =fxx(ω)fvv(ω)

. (4.145)

The result shows the optimum filters can be computed for this model if thesignal and noise spectra are both known or if we can assume knowledge of thesignal-to-noise ratio SNR(ω) as function of frequency. In Chapter 7, we showsome methods for estimating these two parameters in conjunction with randomeffects analysis of variance models, but we assume here that it is possible tospecify the signal-to-noise ratio a priori. If the signal-to-noise ratio is known,the optimal filter can be computed by the inverse transform of the functionA(ω). It is more likely that the inverse transform will be intractable and afinite filter approximation like that used in the previous section can be appliedto the data. In this case, we will have

aMt = M−1

M−1∑k=0

A(ωk)e2πiωkt (4.146)


−40 −20 0 20 40−0.05

0

0.05

0.1

0.15Impulse Response

No taper

−40 −20 0 20 40−1

0

1

2

3

4

5x 10

−3

Tapered

−0.5 0 0.50

0.2

0.4

0.6

0.8

1

1.2

1.4

Frequency Response

−0.5 0 0.50

0.5

1

1.5

2x 10

−3

Figure 4.28 Impulse Response and frequency response functions for designedSOI filters. Note the ripples in the top panel frequency response of the unta-pered filter.

as the estimated filter function. It will often be the case that the form of thespecified frequency response will have some rather sharp transitions betweenregions where the signal-to-noise ratio is high and regions where there is littlesignal. In these cases, the shape of the frequency response function will haveripples that can introduce frequencies at different amplitudes. An aestheticsolution to this problem is to introduce tapering as was done with spectralestimation in (4.61)-(4.68). We use below the tapered filter at = htat whereht is the cosine taper given in (4.68). The squared frequency response of theresulting filter will be |A(ω)|2, where

A(ω) =∞∑

t=−∞athte−2πiωt. (4.147)

The results are illustrated in the following example that extracts the El Ninocomponent of the sea surface temperature series.


50 100 150 200 250 300 350 400 450−1

−0.5

0

0.5

1

SOI

Filtered SOI

50 100 150 200 250 300 350 400 450−1

−0.5

0

0.5

1

El Nino Signal

Month

Figure 4.29 Original SOI series (top) compared to filtered version showingthe estimated El Nino temperature signal (bottom).

Example 4.24 Estimating the El Nino Signal Using Optimal Filters

Figure 4.5 shows the spectrum of the SOI series, and we note that essen-tially two components have power, the El Nino frequency of about .02cycles per month and a yearly frequency of about .08 cycles per month.We assume, for this example, that we wish to preserve the lower fre-quency as signal and to eliminate the higher order frequencies. In thiscase, we assume the simple signal plus noise model

yt = xt + vt,

so that there is no convolving function βt. Furthermore, the signal-to-noise ratio is assumed to be high to about .06 cycles per month and zerothereafter. The optimal frequency response was assumed to be unity to.05 cycles per point and then to decay linearly to zero in several steps.Figure 4.28 shows the Fourier transform, (4.146), at M = 64 frequen-cies, say, aM

t and the tapered version htaMt . The estimated squared

frequency response, approximated as a long (256 point) transform of theform (4.147), has ripples when tapering is not applied and is relativelysmooth for the tapered filter. Figure 4.28 shows both positive and neg-ative frequencies. Figure 4.29 shows the original and filtered SOI index,and we see a smooth extracted signal that conveys the essence of the


underlying El Nino signal. The frequency response of the designed filtercan be compared with that of the symmetric 12-month moving averageapplied to the same series in Example 4.17. The filtered series, shown inFigure 4.3, shows a good deal of higher frequency chatter riding on thesmoothed version, which has been introduced by the higher frequenciesthat leak through in the squared frequency response, as in Figure 4.14.

The analysis of this example was performed using the time series packageASTSA, which is available for download from the website of this text.

The design of finite filters with a specified frequency response requires someexperimentation with various target frequency response functions and we haveonly touched on the methodology here. The filter designed here, sometimescalled a low-pass filter reduces the high frequencies and keeps or passes thelow frequencies. Alternately, we could design a high-pass filter to keep highfrequencies if that is where the signal is located. An example of a simplehigh-pass filter is the first difference with a frequency response that is shownin Figure 4.14. We can also design band-pass filters that keep frequencies inspecified bands. For example, seasonal adjustment filters are often used ineconomics to reject seasonal frequencies while keeping both high frequencies,lower frequencies, and trend (see, for example, Grether and Nerlove, 1970).

The filters we have discussed here are all symmetric two-sided filters, be-cause the designed frequency response functions were purely real. Alterna-tively, we may design recursive filters to produce a desired response. An ex-ample of a recursive filter is one that replaces the input xt by the filteredoutput

yt =p∑

k=1

φkyt−k + xt −q∑

k=1

θkxt−k. (4.148)

Note the similarity between (4.148) and the ARIMA(p, 1, q) model, in whichthe white noise component is replaced by the input. Transposing the termsinvolving yt and using the basic linear filter result in Property 4.4 leads to

fy(ω) =|θ(e−2πiω)|2|φ(e−2πiω)|2 fx(ω), (4.149)

where

φ(e−2πiω) = 1 −p∑

k=1

φke−2πikω

and

θ(e−2πiω) = 1 −q∑

k=1

θke−2πikω.

Recursive filters such as those given by (4.149) distort the phases of arrivingfrequencies, and we do not consider the problem of designing such filters inany detail.


4.12 Spectral Analysis of MultidimensionalSeries

Multidimensional series of the form xsss, where sss = (s1, s2, . . . , sr)′ is an r-dimensional vector of spatial coordinates or a combination of space and timecoordinates, were introduced in §1.7. The example given there, shown in Fig-ure 1.15, was a collection of temperature measurements taking on a rectangularfield. This data would form a two-dimensional process, indexed by row andcolumn in space. In that section, the multidimensional autocovariance func-tion of an r-dimensional stationary series was given as γx(hhh) = E[xsss+hhhxsss],where the multidimensional lag vector is hhh = (h1, h2, . . . , hr)′.

The multidimensional wavenumber spectrum is given as the Fourier trans-form of the autocovariance, namely,

fx(ωωω) =∑hhh

γx(hhh)e−2πiωωω′hhh. (4.150)

Again, the inverse result

γx(hhh) =∫ 1/2

−1/2fx(ωωω)e2πiωωω′hhhdωωω (4.151)

holds, where the integral is over the multidimensional range of the vector ωωω.The wavenumber argument is exactly analogous to the frequency argument,and we have the corresponding intuitive interpretation as the cycling rate ωi

per distance traveled si in the i-th direction.Two-dimensional processes occur often in practical applications, and the

representations above reduce to

fx(ω1, ω2) =∞∑

h1=−∞

∞∑h2=−∞

γx(h1, h2)e−2πi(ω1h1+ω2h2) (4.152)

and

γx(h1, h2) =∫ 1/2

−1/2

∫ 1/2

−1/2fx(ω1, ω2)e2πi(ω1h1+ω2h2)dω1 dω2 (4.153)

in the case r = 2. The notion of linear filtering generalizes easily to thetwo-dimensional case by defining the impulse response function as1,s2 and thespatial filter output as

ys1,s2 =∑u1

∑u2

au1,u2xs1−u1,s2−u2 . (4.154)

The spectrum of the output of this filter can be derived as

fy(ω1, ω2) = |A(ω1, ω2)|2fx(ω1, ω2), (4.155)

4.12: Multidimensional Spectral Analysis 257

whereA(ω1, ω2) =

∑u1

∑u2

au1,u2e−2πi(ω1u1+ω2u2). (4.156)

These results are analogous to those in the one-dimensional case, described byProperty P4.4.

The multidimensional DFT is also a straightforward generalization of theunivariate expression. In the two-dimensional case with data on a rectangulargrid, {xs1,s2 ; s1 = 1, ..., n1, s2 = 1, ..., n2}, we will write, for −1/2 ≤ ω1, ω2 ≤1/2,

d(ω1, ω2) = (n1n2)−1/2n1∑

s1=1

n2∑s2=1

xs1,s2e−2πi(ω1s1+ω2s2) (4.157)

as the two-dimensional DFT, where the frequencies ω1, ω2 are evaluated atmultiples of (1/n1, 1/n2) on the spatial frequency scale. The two-dimensionalwavenumber spectrum can be estimated by the smoothed sample wavenumberspectrum

fx(ω1, ω2) = (L1L2)−1∑�1,�2

|d(ω1 + 1/n1, ω2 + 2/n2)|2 , (4.158)

where the sum is taken over the grid {−mj ≤ j ≤ mj ; j = 1, 2}, whereL1 = 2m1 + 1 and L2 = 2m2 + 1. The statistic

2L1L2fx(ω1, ω2)fx(ω1, ω2)

·∼ χ22L1L2

(4.159)

can be used to set confidence intervals or make approximate tests againsta fixed assumed spectrum f0(ω1, ω2). We may also extend this analysis toweighted estimation and window estimation as discussed in §4.5.

Example 4.25 Wavenumber Spectrum of Soil Surface Temperatures

As an example, consider the periodogram of the two-dimensional tem-perature series shown in Figure 1.15 and analyzed by Bazza et al. (1988).We recall the spatial coordinates in this case will be (s1, s2), which definethe spatial coordinates rows and columns so that the frequencies in thetwo directions will be expressed as cycles per row and cycles per column.Figure 4.30 shows the periodogram of the two-dimensional temperatureseries, and we note the ridge of strong spectral peaks running over rowsat a column frequency of zero. An obvious periodic component appearsat frequencies of .0625 and −.0625 cycles per row, which corresponds to16 rows or about 272 ft. On further investigation of previous irrigationpatterns over this field, treatment levels of salt varied periodically overcolumns. This analysis is extended in Problem 4.17, where we recoverthe salt treatment profile over rows and compare it to a signal, computedby averaging over columns.


−0.5

0

0.5

−0.5

0

0.50

0.5

1

1.5

2

x 105

cycles/rowcycles/column

Figure 4.30 Two-dimensional periodogram of soil temperature profile showingpeak at .0625 cycles/row. The period is 16 rows, and this corresponds to 16×17ft = 272 ft.

Another application of two-dimensional spectral analysis of agriculturalfield trials is given in McBratney and Webster (1981), who used it to de-tect ridge and furrow patterns in yields. The requirement for regular, equallyspaced samples on fairly large grids has tended to limit enthusiasm for stricttwo-dimensional spectral analysis. An exception is when a propagating sig-nal from a given velocity and azimuth is present so predicting the wavenumberspectrum as a function of velocity and azimuth becomes feasible (see Shumwayet al., 1999).

Problems

Section 4.2

4.1 Repeat the simulations and analyses in Examples 4.1 and 4.2 with thefollowing changes:

(a) Change the sample size to n = 128 and generate and plot the sameseries as in Example 4.1:

xt1 = 2 cos(2πt 6/100) + 3 sin(2πt 6/100),xt2 = 4 cos(2πt 10/100) + 5 sin(2πt 10/100),

Problems 259

xt3 = 6 cos(2πt 40/100) + 7 sin(2πt 40/100),xt = xt1 + xt2 + xt3.

What is the major difference between these series and the seriesgenerated in Example 4.1? (Hint: The answer is fundamental. Butif your answer is the series are longer, you may be punished severely.)

(b) As in Example 4.2, compute and plot the periodogram of the series,xt, generated in (a) and comment.

(c) Repeat the analyses of (a) and (b) but with n = 100 (as in Exam-ple 4.1), and adding noise to xt; that is

xt = xt1 + xt2 + xt3 + wt

where wt ∼ iid N(0, 25). That is, you should simulate and plot thedata, and then plot the periodogram of xt and comment.

4.2 With reference to equations (4.2) and (4.3), let Z1 = U1 and Z2 = −U2 beindependent, standard normal variables. Consider the polar coordinatesof the point (Z1, Z2), that is,

A2 = Z21 + Z2

2 and φ = tan−1(Z2/Z1).

(a) Find the joint density of A2 and φ, and from the result, concludethat A2 and φ are independent random variables, where A2 is a chi-squared random variable with 2 df, and φ is uniformly distributedon (−π, π).

(b) Going in reverse from polar coordinates to rectangular coordinates,suppose we assume that A2 and φ are independent random variables,where A2 is chi-squared with 2 df, and φ is uniformly distributedon (−π, π). With Z1 = A cos(φ) and Z2 = A sin(φ), where A is thepositive square root of A2, show that Z1 and Z2 are independent,standard normal random variables.

4.3 Verify (4.5).

Section 4.3

4.4 A time series was generated by first drawing the white noise series wt froma normal distribution with mean zero and variance one. The observedseries xt was generated from

xt = wt − θwt−1, t = 0,±1,±2, . . . ,

where θ is a parameter.

(a) Derive the theoretical mean value and autocovariance functions forthe series xt and wt. Are the series xt and wt stationary? Give yourreasons.


(b) Give a formula for the power spectrum of xt, expressed in terms ofθ and ω.

4.5 A first-order autoregressive model is generated from the white noise serieswt using the generating equations

xt = φxt−1 + wt,

where φ, for |φ| < 1, is a parameter and the wt are independent randomvariables with mean zero and variance σ2

w.

(a) Show the power spectrum of xt is given by

fx(ω) =σ2

w

1 + φ2 − 2φ cos(2πω).

(b) Verify the autocovariance function of this process is

γx(h) =σ2

w φ|h|

1 − φ2 ,

h = 0,±1,±2, . . ., by showing that the inverse transform of γx(h) isthe spectrum derived in part (a).

4.6 In applications, we will often observe series containing a signal that hasbeen delayed by some unknown time D, i.e.,

xt = st + Ast−D + nt,

where st and nt are stationary and independent with zero means andspectral densities fs(ω) and fn(ω), respectively. The delayed signal ismultiplied by some unknown constant A.

(a) Prove

fx(ω) = [1 + A2 + 2A cos(2πωD)]fs(ω) + fn(ω).

(b) How could the periodicity expected in the spectrum derived in (a)be used to estimate the delay D? (Hint: Consider the case wherefn(ω) = 0; i.e., there is no noise.)

4.7 Suppose xt and yt are stationary zero-mean time series with xt indepen-dent of ys for all s and t. Consider the product series

zt = xtyt.

Prove the spectral density for zt can be written as

fz(ω) =∫ 1/2

−1/2fx(ω − ν)fy(ν) dν.

Problems 261

50 100 150 200 250 300 350 400 4500

20

40

60

80

100

120

140

160

180

200

6month period

No. o

f Sun

spots

Figure 4.31 Smoothed 12-month sunspot numbers sampled twice per year,n = 459.

Section 4.4

4.8 Figure 4.31 shows the biyearly smoothed (12-month moving average)number of sunspots from June 1749 to December 1978 with n = 459points that were taken twice per year. With Example 4.9 as a guide,perform a periodogram analysis of the sunspot data (the data are in thefile sunspots.dat) identifying the predominant periods and obtainingconfidence intervals for the identified periods. Interpret your findings.

4.9 The levels of salt concentration known to have occurred over rows, cor-responding to the average temperature levels for the soil science dataconsidered in Figures 1.15 and 1.16, are shown in Figure 4.32. The dataare in the file salt.dat, which consists of one column of 128 observations;the first 64 observations correspond to the temperature series. Identifythe dominant frequencies by performing separate spectral analyses onthe two series. Include confidence intervals for the dominant frequenciesand interpret your findings.

4.10 Let the observed series xt be composed of a periodic signal and noise soit can be written as

xt = β1 cos(2πωkt) + β2 sin(2πωkt) + wt,

where wt is a white noise process with variance σ2w. The frequency ωk is

assumed to be known and of the form k/n in this problem. Suppose weconsider estimating β1, β2 and σ2

w by least squares, or equivalently, bymaximum likelihood if the wt are assumed to be Gaussian.

(a) Prove, for a fixed ωk, the minimum squared error is attained by(β1β2

)= 2n−1/2

(dc(ωk)ds(ωk)

),


Figure 4.32 Temperature and salt profiles over 64 rows at 17-ft spacing.

where the cosine and sine transforms (4.24) and (4.25) appear onthe right-hand side.

(b) Prove that the error sum of squares can be written as

SSE =n∑

t=1

x2t − 2Ix(ωk)

so that the value of ωk that minimizes squared error is the same asthe value that maximizes the periodogram Ix(ωk) estimator (4.21).

(c) Under the Gaussian assumption and fixed ωk, show that the F -testof no regression leads to an F -statistic that is a monotone functionof Ix(ωk).

4.11 Prove the convolution property of the DFT, namely,

n∑s=1

asxt−s =n−1∑k=0

dA(ωk)dx(ωk) exp{2πωkt},

for t = 1, 2, . . . , n, where dA(ωk) and dx(ωk) are the discrete Fouriertransforms of at and xt, respectively, and we assume that xt = xt+n isperiodic.

Section 4.5

4.12 Repeat Problem 4.8 using a nonparametric spectral estimation proce-dure. In addition to discussing your findings in detail, comment on yourchoice of a spectral estimate with regard to smoothing and tapering.

Problems 263

4.13 Repeat Problem 4.9 using a nonparametric spectral estimation proce-dure. In addition to discussing your findings in detail, comment on yourchoice of a spectral estimate with regard to smoothing and tapering.

4.14 The periodic behavior of a time series induced by echoes can also beobserved in the spectrum of the series; this fact can be seen from theresults stated in Problem 4.6(a). Using the notation of that problem,suppose we observe xt = st + Ast−D + nt, which implies the spectrasatisfy fx(ω) = [1 + A2 + 2A cos(2πωD)]fs(ω) + fn(ω). If the noise isnegligible (fn(ω) ≈ 0) then log fx(ω) is approximately the sum of aperiodic component, log[1+A2+2A cos(2πωD)], and log fs(ω). Bogart etal. (1962) proposed treating the detrended log spectrum as a pseudo timeseries and calculating its spectrum, or cepstrum, which should show apeak at a quefrency corresponding to 1/D. The cepstrum can be plottedas a function of quefrency, from which the delaty D can be estimated.

For the speech series presented in Example 1.3, estimate the pitch periodusing cepstral analysis as follows. The data are in the file speech.dat.

(a) Calculate and display the log-periodogram of the data. Is the peri-odogram periodic, as predicted?

(b) Perform a cepstral (spectral) analysis on the detrended logged pe-riodogram, and use the results to estimate the delay D. How doesyour answer compare with the analysis of Example 1.24, which wasbased on the ACF?

4.15 Use Property P4.1 to verify (4.63). Then verify (4.66) and (4.67)

4.16 Consider two time series

xt = wt − wt−1,

yt =12(wt + wt−1),

formed from the white noise series wt with variance σ2w = 1.

(a) Are xt and yt jointly stationary? Recall the cross-covariance func-tion must also be a function only of the lag h and cannot dependon time.

(b) Compute the spectra fy(ω) and fx(ω), and comment on the differ-ence between the two results.

(c) Suppose sample spectral estimators fy(.10) are computed for theseries using L = 3. Find a and b such that

P

{a ≤ fy(.10) ≤ b

}= .90.

This expression gives two points that will contain 90% of the samplespectral values. Put 5% of the area in each tail.


Section 4.6

4.17 Analyze the coherency between the temperature and salt data discussedin Problem 4.9. Discuss your findings.

4.18 Consider two processes

xt = wt and yt = φxt−D + vt

where wt and vt are independent white noise processes with commonvariance σ2, φ is a constant, and D is a fixed integer delay.

(a) Compute the coherency between xt and yt.

(b) Simulate n = 1024 normal observations from xt and yt for φ = .9,σ2 = 1, and D = 0. Then estimate and plot the coherency betweenthe simulated series for the following values of L and comment:(i) L = 1, (ii) L = 3, (iii) L = 41, and (iv) L = 101.

Section 4.7

4.19 For the processes in Problem 4.18,

(a) Compute the phase between xt and yt.

(b) Simulate n = 1024 observations from xt and yt for φ = .9, σ2 = 1,and D = 1. Then estimate and plot the phase between the simulatedseries for the following values of L and comment:(i) L = 1, (ii) L = 3, (iii) L = 41, and (iv) L = 101.

4.20 Consider the bivariate time series records containing monthly U.S. pro-duction as measured monthly by the Federal Reserve Board ProductionIndex and unemployment as given in Figure 3.22.

(a) Compute the spectrum and the log spectrum for each series, andidentify statistically significant peaks. Explain what might be gen-erating the peaks. Compute the coherence, and explain what ismeant when a high coherence is observed at a particular frequency.

(b) What would be the effect of applying the filter

ut = xt − xt−1

followed byvt = ut − ut−12

to the series given above? Plot the predicted frequency responses ofthe simple difference filter and of the seasonal difference of the firstdifference.

Problems 265

(c) Apply the filters successively to one of the two series and plot theoutput. Examine the output after taking a first difference and com-ment on whether stationarity is a reasonable assumption. Why orwhy not? Plot after taking the seasonal difference of the first dif-ference. What can be noticed about the output that is consistentwith what you have predicted from the frequency response? Verifyby computing the spectrum of the output after filtering.

4.21 Determine the theoretical power spectrum of the series formed by com-bining the white noise series wt to form

yt = wt−2 + 4wt−1 + 6wt + 4wt+1 + wt+2.

Determine which frequencies are present by plotting the power spectrum.

4.22 Let xt = cos(2πωt), and consider the output

yt =∞∑

k=−∞akxt−k,

where∑

k |ak| < ∞. Show

yt = |A(ω)| cos(2πωt + φ(ω)),

where |A(ω)| and φ(ω) are the amplitude and phase of the filter, respec-tively. Interpret the result in terms of the relationship between the inputseries, xt, and the output series, yt.

4.23 Suppose xt is a stationary series, and we apply two filtering operationsin succession, say,

yt =∑

r

arxt−r,

and thenzt =

∑s

bsyt−s.

(a) Show the spectrum of the output is

fz(ω) = |A(ω)|2|B(ω)|2fx(ω),

where A(ω) and B(ω) are the Fourier transforms of the filter se-quences at and bt, respectively.

(b) What would be the effect of applying the filter

ut = xt − xt−1

followed byvt = ut − ut−12

to a time series?


(c) Plot the predicted frequency responses of the simple difference filterand of the seasonal difference of the first difference. Filters like theseare called seasonal adjustment filters in economics because they tendto attenuate frequencies at multiples of the monthly periods. Thedifference filter tends to attenuate low-frequency trends.

4.24 Suppose we are given a stationary zero-mean series xt with spectrumfx(ω) and then construct the derived series

yt = ayt−1 + xt, t = ±1,±2, ... .

(a) Show how the theoretical fy(ω) is related to fx(ω).

(b) Plot the function that multiplies fx(ω) in part (a) for a = .1 andfor a = .8. This filter is called a recursive filter.

Section 4.8

4.25 Often, the periodicities in the sunspot series are investigated by fittingan autoregressive spectrum of sufficiently high order. The main peri-odicity is often stated to be in the neighborhood of 11 years. Fit anautoregressive spectral estimator to the sunspot data using a model se-lection method of your choice. Compare the result with a conventionalnonparametric spectral estimator found in Problem 4.8.

4.26 Fit an autoregressive spectral estimator to the Recruitment series andcompare it to the results of Example 4.11.

4.27 Suppose a sample time series with n = 256 points is available from thefirst-order autoregressive model. Furthermore, suppose a sample spec-trum computed with L = 3 yields the estimated value fx(1/8) = 2.25. Isthis sample value consistent with σ2

w = 1, φ = .5? Repeat using L = 11if we just happen to obtain the same sample value.

4.28 Suppose we wish to test the noise alone hypothesis H0 : xt = nt againstthe signal-plus-noise hypothesis H1 : xt = st + nt, where st and nt

are uncorrelated zero-mean stationary processes with spectra fs(ω) andfn(ω). Suppose that we want the test over a band of L = 2m + 1frequencies of the form ωj:n + k/n, for k = 0,±1,±2, . . . ,±m near somefixed frequency ω. Assume that both the signal and noise spectra areapproximately constant over the interval.

(a) Prove the approximate likelihood-based test statistic for testing H0against H1 is proportional to

T =∑

k

|dx(ωj:n + k/n)|2(

1fn(ω)

− 1fs(ω) + fn(ω)

).

Problems 267

(b) Find the approximate distributions of T under H0 and H1.

(c) Define the false alarm and signal detection probabilities as PF =P{T > K|H0} and Pd = P{T > k|H1}, respectively. Express theseprobabilities in terms of the signal-to-noise ratio fs(ω)/fn(ω) andappropriate chi-squared integrals.

Section 4.9

4.29 Repeat the dynamic Fourier analysis of Example 4.20 on the remainingseven earthquakes and seven explosions in the data file eq+exp.dat. Dothe conclusions about the difference between earthquakes and explosionsstated in the example still seem valid?

4.30 Repeat the wavelet analyses of Examples 4.21 and 4.22 on all earthquakeand explosion series in the data file eq+exp.dat. Do the conclusionsabout the difference between earthquakes and explosions stated in Ex-amples 4.21 and 4.22 still seem valid?

4.31 Using Examples 4.20-4.22 as a guide, perform a dynamic Fourier analy-sis and wavelet analyses (dwt and waveshrink analysis) on the event ofunknown origin that took place near the Russian nuclear test facility inNovaya Zemlya. State your conclusion about the nature of the event atNovaya Zemlya.

Section 4.10

4.32 Consider the problem of approximating the filter output

yt =∞∑

k=−∞akxt−k,

∞∑−∞

|ak| < ∞,

byyM

t =∑

|k|<M/2

aMk xt−k

for t = M/2− 1, M/2, . . . , n−M/2, where xt is available for t = 1, . . . , nand

aMt = M−1

M−1∑k=0

A(ωk) exp{2πiωkt}

with ωk = k/M . Prove

E{(yt − yMt )2} ≤ 4γx(0)

( ∑|k|≥M/2

|ak|)2

.


100 200 300 4000

10

20

30

40te

mpera

ture

100 200 300 400−10

0

10

20

dew

poin

t100 200 300 400

0

0.5

1

cloud c

ove

r

100 200 300 4000

0.5

1

1.5

win

d s

pe

ed

100 200 300 4000

500

1000

pre

cip

itatio

n

month100 200 300 400

0

500

1000

1500

inflo

w

month

Figure 4.33 Monthly values of weather and inflow at Shasta Lake

4.33 Prove the squared coherence ρ2y·x(ω) = 1 for all ω when

yt =∞∑

r=−∞arxt−r,

that is, when xt and yt can be related exactly by a linear filter.

4.34 Figure 4.33 contains 454 months of measured values for the climatic vari-ables air temperature, dew point, cloud cover, wind speed, precipitation,and inflow at Shasta Lake in California. We would like to look at possi-ble relations among the weather factors and between the weather factorsand the inflow to Shasta Lake.

(a) Argue the strongest determinant of the inflow series is precipitationusing the coherence functions. Use transformed inflow It = log it,where it is inflow, and transformed precipitation Pt =

√p

t, where

Problems 269

pt is precipitation. It should be mentioned here that Chapter 6 dis-cusses methods for determining whether inflow might depend jointlyon several input series.

(b) Using the estimated impulse response function, argue for the model

It = α0 +α1

1 − φBPt,

where the notation is as discussed in Chapter 2. What would bea reasonable value for φ? Assume the means are taken out of theseries before the analysis begins.

Section 4.11

4.35 Consider the signal plus noise model

yt =∞∑

r=−∞βrxt−r + vt,

where the signal and noise series, xt and vt are both stationary withspectra fx(ω) and fv(ω), respectively. Assuming that xt and vt areindependent of each other for all t, verify (4.142) and (4.143).

4.36 Consider the modelyt = xt + vt,

wherext = φ1xt−1 + wt,

such that vt is Gaussian white noise and independent of xt with var(vt) =σ2

v , and wt is Gaussian white noise and independent of vt, with var(wt) =σ2

w, and |φ1| < 1 and Ex0 = 0. Prove that the spectrum of the observedseries yt is

fy(ω) =σ2|1 − θ1e

−2πiω|2|1 − φ1e−2πiω|2 ,

where

θ1 =c ± √

c2 − 42

,

σ2 =σ2

vφ1

θ1,

and

c =σ2

w + σ2v(1 + φ2

1)σ2

vφ1.

4.37 Consider the same model as in the preceding problem.


(a) Prove the optimal smoothed estimator of the form

xt =∞∑

s=−∞asyt−s

has

as =σ2

w

σ2

θ|s|1

1 − θ21.

(b) Show the mean square error is given by

E{(xt − xt)2} =σ2

vσ2w

σ2(1 − θ21)

.

(c) Compare mean square error of the estimator in part (b) with thatof the optimal finite estimator of the form

xt = a1yt−1 + a2yt−2

when σ2v = .053, σ2

w = .172, and φ1 = .9.

Section 4.12

4.38 Consider the two-dimensional linear filter given as the output (4.154).

(a) Express the two-dimensional autocovariance function of the output,say, γy(h1, h2), in terms of an infinite sum involving the autocovari-ance function of xsss and the filter coefficients as1,s2 .

(b) Use the expression derived in (a), combined with (4.153) and (4.156)to derive the spectrum of the filtered output (4.155).

The following problems require the supplemental material given in Appendix C

4.39 Let wt be a Gaussian white noise series with variance σ2w. Prove that the

results of Theorem C.4 hold without error for the DFT of wt.

4.40 Show that condition (4.41) implies (C.19) under the assumption thatwt ∼ wn(0, σ2

w).

4.41 Prove Lemma C.4.

4.42 Finish the proof of Theorem C.5.

4.43 For the zero-mean complex random vector zzz = xxxc − ixxxs, with cov(zzz) =Σ = C − iQ, with Σ = Σ∗, define

w = 2Re(aaa∗zzz),

where aaa = aaac − iaaas is an arbitrary non-zero complex vector. Prove

cov(w) = 2aaa∗Σaaa.

Recall ∗ denotes the complex conjugate transpose.

Chapter 5

Additional Time DomainTopics

5.1 Introduction

In this chapter, we present material that may be considered special or advancedtopics in the time domain. Chapter 6 is devoted to one of the most useful andinteresting time domain topics, state-space models. So, we do not cover state-space models or related topics—of which there are many—in this chapter. Thischapter consists of sections of independent topics that may be read in any order.Most of the sections depend on a basic knowledge of ARMA models, forecastingand estimation, which is the material that is covered in Chapter 3, §3.1-§3.8.A few sections, for example the first section on long memory models, requiresome knowledge of spectral analysis and related topics covered in Chapter 4.In addition to long memory, we discuss GARCH models, threshold models,regression with autocorrelated errors, lagged regression or transfer functions,and selected topics in multivariate ARMAX models.

5.2 Long Memory ARMA and FractionalDifferencing

The conventional ARMA(p, q) process is often referred to as a short memoryprocess because the coefficients in the representation

xt =∞∑

j=0

ψjwt−j ,

271

272 Additional Time Domain Topics

obtained by solvingφ(z)ψ(z) = θ(z),

are dominated by exponential decay. As pointed out in §3.3, this result impliesthe ACF of the short memory process ρ(h) → 0 exponentially fast as h → ∞.When the sample ACF of a time series decays slowly, the advice given inChapter 3 has been to difference the series until it seems stationary. Followingthis advice with the glacial varve series first presented in Example 3.31 leadsto the first difference of the logarithms of the data being represented as a first-order moving average. In Example 3.37, further analysis of the residuals leadsto fitting an ARIMA(1, 1, 1) model,

∇xt = φ∇xt−1 + wt + θwt−1,

where we understand xt is the log-transformed varve series. In particular,the estimates of the parameters (and the standard errors) were φ = .23(.05),θ = −.89(.03), and σ2

w = .23. The use of the first difference ∇xt = (1 − B)xt

can be too severe a modification in the sense that the nonstationary modelmight represent an overdifferencing of the original process.

Long memory (or persistent) time series were considered in Hosking (1981)and Granger and Joyeux (1980) as intermediate compromises between the shortmemory ARMA type models and the fully integrated nonstationary processesin the Box–Jenkins class. The easiest way to generate a long memory seriesis to think of using the difference operator (1 − B)d for fractional values of d,say, 0 < d < .5, so a basic long memory series gets generated as

(1 − B)dxt = wt, (5.1)

where wt still denotes white noise with variance σ2w. Now, d becomes a parame-

ter to be estimated along with σ2w. Differencing the original process, as in the

Box–Jenkins approach, may be thought of as simply assigning a value of d = 1.This idea has been extended to the class of fractionally integrated ARMA, orARFIMA models, where we allow −.5 < d < .5; when d is negative, the termantipersistent is used. Long memory processes occur in hydrology (see Hurst,1951, and McLeod and Hipel, 1978) and in environmental series, such as thevarve data we have previously analyzed, to mention a few examples. Longmemory time series data tend to exhibit sample autocorrelations that are notnecessarily large (as in the case of d = 1), but persist for a long time. Fig-ure 5.1 shows the sample ACF, to lag 100, of the log-transformed varve series,which exhibits classic long memory behavior.

The fractionally differenced series (5.1), for |d| < .5, is often called frac-tional noise. To investigate its properties, we can use the binomial expansion(d > −.5) to write

wt = (1 − B)dxt =∞∑

j=0

πjBjxt =

∞∑j=0

πjxt−j (5.2)

5.2: Long Memory ARMA Models 273

Figure 5.1 Sample ACF of the log transformed varve series.

where

πj =Γ(j − d)

Γ(j + 1)Γ(−d)(5.3)

with Γ(x + 1) = xΓ(x) being the gamma function. Similarly (d < .5), we canwrite

xt = (1 − B)−dwt =∞∑

j=0

ψjBjwt =

∞∑j=0

ψjwt−j (5.4)

where

ψj =Γ(j + d)

Γ(j + 1)Γ(d). (5.5)

The processes (5.2) and (5.4) are well-defined stationary processes (seeBrockwell and Davis, 1991, for details). In the case of fractional differencing,however, the coefficients satisfy

∑π2

j < ∞ and∑

ψ2j < ∞ as opposed to the

absolute summability of the coefficients in ARMA processes.Using the representation (5.4)–(5.5), the ACF of xt is seen to be

ρ(h) =Γ(h + d)Γ(1 − d)Γ(h − d + 1)Γ(d)

∼ h2d−1 (5.6)

for large h. From this we see that for 0 < d < .5

∞∑h=−∞

|ρ(h)| = ∞

and hence the term long memory.


In order to examine a series such as the varve series for a possible longmemory pattern, it is convenient to look at ways of estimating d. Using (5.3)it is easy to derive the recursions

πj+1(d) =(j − d)πj(d)

(j + 1), (5.7)

for j = 0, 1, . . ., with π0(d) = 1. Maximizing the joint likelihood of the errorsunder normality, say, wt(d), will involve minimizing the sum of squared errors

Q(d) =∑

w2t (d).

The usual Gauss–Newton method, described in §3.6, leads to the expansion

wt(d) = wt(d0) + w′t(d0)(d − d0),

where

w′t(d0) =

∂wt

∂d

∣∣∣∣d=d0

and d0 is an initial estimate (guess) at to the value of d. Setting up the usualregression leads to

d = d0 −∑

t w′t(d0)wt(d0)∑t w′

t(d0)2 . (5.8)

The derivatives are computed recursively by differentiating (5.7) successivelywith respect to d: π′

j+1(d) = [(j − d)π′j(d) − πj(d)]/(j + 1), where π′

0(d) = 0.The errors are computed from an approximation to (5.2), namely,

wt(d) =t∑

j=0

πj(d)xt−j . (5.9)

It is advisable to omit a number of initial terms from the computation andstart the sum, (5.8), at some fairly large value of t to have a reasonable ap-proximation.

Example 5.1 Long Memory Fitting of the Glacial Varve Series

We consider analyzing the glacial varve series discussed in Examples 2.5and 3.31. Figure 2.6 shows the original and log-transformed series (whichwe denote by xt). In Example 3.37, we noted that xt could be modeledas an ARMA(1, 1, 1) process. We fit the fractionally differenced model,(5.1), to the mean-adjusted series, xt − x. Applying the Gauss–Newtoniterative procedure previously described, starting with d = .1 and omit-ting the first 30 points from the computation, leads to a final value ofd = .384, which implies the set of coefficients πj(.384), as given in Fig-ure 5.2 with π0(.384) = 1. We can compare roughly the performance


Figure 5.2 Coefficients πj(.384), j = 1, 2, . . . , 30 in the representation (5.7).

of the fractional difference operator with the ARIMA model by examin-ing the autocorrelation functions of the two residual series as shown inFigure 5.3. The ACFs of the two residual series are roughly comparablewith the white noise model.

To perform this analysis in R, first download and install the fracdiffpackage from CRAN. Then, load the package and issue the followingcommands (assuming the data are in varve).

> lvarve = log(varve)-mean(log(varve))> varve.fd = fracdiff(lvarve, nar=0, nma=0, M=30)> varve.fd$d

[1] 0.3841688> varve.fd$stderror.dpq

[1] 4.589514e-06

The R package uses a truncated maximum likelihood procedure that wasdiscussed in Haslett and Raftery (1989), which is a little more elaboratethan simply zeroing out initial values. The default truncation value inR is M = 100. In the default case, the estimate is d = .37 with approx-imately the same standard error. The standard error is obtained fromthe Hessian as described in Example 3.28. At this time the R packagefracdiff does not supply the residuals for diagnostics or an estimateof σ2

w, hence some additional programming would be necessary for a fullanalysis.


Figure 5.3 ACF of residuals from the ARIMA(1, 1, 1) fit to the varve series(top) and of the residuals from the long memory model fit, (1 − B)dxt = wt,with d = .384 (bottom).

Forecasting long memory processes is similar to forecasting ARIMA models.That is, (5.2) and (5.7) can be used to obtain the truncated forecasts

xnn+m =

n∑j=1

πj(d) xnn+m−j , (5.10)

for m = 1, 2, . . . . Error bounds can be approximated by using

Pnn+m = σ2

w

⎛⎝m−1∑j=0

ψ2j (d)

⎞⎠ (5.11)

where, as in (5.7),

ψj(d) =(j + d)ψj(d)

(j + 1), (5.12)

with ψ0(d) = 1.No obvious short memory ARMA-type component can be seen in the ACF

of the residuals from the fractionally differenced varve series shown in Fig-ure 5.3. It is natural, however, that cases will exist in which substantial shortmemory-type components will also be present in data that exhibits long mem-ory. Hence, it is natural to define the general ARFIMA(p, d, q), −.5 < d < .5process as

φ(B)∇d(xt − µ) = θ(B)wt, (5.13)


where φ(B) and θ(B) are as given in Chapter 3. Writing the model in the form

φ(B)πd(B)(xt − µ) = θ(B)wt (5.14)

makes it clear how we go about estimating the parameters for the more generalmodel. Forecasting for the ARFIMA(p, d, q) series can be easily done, notingthat we may equate coefficients in

φ(z)ψ(z) = (1 − z)−dθ(z) (5.15)

andθ(z)π(z) = (1 − z)dφ(z) (5.16)

to obtain the representations

xt = µ +∞∑

j=0

ψjwt−j

and

wt =∞∑

j=0

πj(xt−j − µ).

We then can proceed as discussed in (5.10) and (5.11).A comprehensive treatment of long memory models is given in Beran (1994),

and it should be noted that several other techniques for estimating the parame-ters, especially, the long memory parameter, can be developed in the frequencydomain. In this case, we may think of the equations as generated by an infiniteorder autoregressive series with coefficients πj given by (5.7) . Using the sameapproach as before, we obtain

fx(ω) =σ2

w

|∑∞k=0 πke−2πikω|2 (5.17)

= σ2w|1 − e−2πiω|−2d (5.18)

= [4 sin2(πω)]−dσ2w (5.19)

as equivalent representations of the spectrum of a long memory process. Thelong memory spectrum approaches infinity as the frequency ω → 0.

The main reason for defining the Whittle approximation to the log likeli-hood is to propose its use for estimating the parameter d in the long memorycase as an alternative to the time domain method previously mentioned. Thetime domain approach is useful because of its simplicity and easily computedstandard errors. One may also use an exact likelihood approach by developingan innovations form of the likelihood as in Brockwell and Davis (1991).

For the approximate approach using the Whittle likelihood (4.116), we con-sider using the approach of Fox and Taqqu (1986) who showed that maximizing


the Whittle log likelihood leads to a consistent estimator with the usual as-ymptotic normal distribution that would be obtained by treating (4.116) as aconventional log likelihood (see also Dahlhaus, 1989; Robinson, 1995; Hurvichet al., 1998). Unfortunately, the periodogram ordinates are not asymptoti-cally independent (Hurvich and Beltrao, 1993), although a quasi-likelihood inthe form of the Whittle approximation works well and has good asymptoticproperties.

To see how this would work for the purely long memory case, write thelong memory spectrum as

fx(ωk; d, σ2w) = σ2

wg−dk , (5.20)

wheregk = 4 sin2(πωk). (5.21)

Then, differentiating the log likelihood, say,

lnL(xxx; d, σ2w) ≈ −m lnσ2

w + d

m∑k=1

ln gk − 1σ2

w

m∑k=1

gdk I(ωk) (5.22)

at m = n/2 − 1 frequencies and solving for σ2w yields

σ2w(d) =

1m

m∑k=1

gdk I(ωk) (5.23)

as the approximate maximum likelihood estimator for the variance parameter.To estimate d, we use a grid scan of the concentrated log likelihood

lnL(xxx; d) ≈ −m lnσ2w(d) − d

m∑k=1

ln gk − m (5.24)

over the interval (−.5, .5), followed by a Newton–Raphson procedure to con-vergence.

Example 5.2 Long Memory Spectra for the Varve Series

We have previously examined the fit of the long memory model for theglacial varve data that is thought to be a reasonable surrogate for temper-ature. Fitting the long memory model using the Whittle approximationabove gives d = .394, with an estimated standard error of .022. Theearlier time domain method gave d = .384, with a standard error of4.6 × 10−6, so the results of the two methods are different. The errorvariance estimated was σ2

w = .2320. One might also consider fitting anautoregressive model to this data using a procedure similar to that usedin Example 4.19. Following this approach gave an autoregressive modelwith p = 8 and φφφ = (.34, .11, .03, .09, .09, .08, .02, .09)′, with σ2

w = .2303


0 0.05 0.1 0.15 0.2 0.25−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

frequency (cy/pt)

log

pow

er

Figure 5.4 Long Memory (d = .394) and autoregressive AR(8) spectral esti-mators for the paleoclimatic glacial varve series.

as the error variance. The two log spectra are plotted in Figure 5.4 forω > 0, and we note that long memory spectrum is lower for the firstfrequency estimated (ω1 = 1/512) but will eventually become infinite,whereas the AR(8) spectrum is higher at that point, but takes a finitevalue at ω = 0.

It should be noted that there is a strong likelihood that the spectrum willnot be purely long memory, as it seemed to be in the example given above.A common situation has the long memory component multiplied by a shortmemory component, leading to an alternate version of (5.20) of the form

fx(ωk; d, θ) = g−dk f0(ωk; θθθ), (5.25)

where f0(ωk; θ) might be the spectrum of an autoregressive moving averageprocess with vector parameter θθθ, or it might be unspecified. If the spectrumhas a parametric form, the Whittle likelihood can be used. However, there isa substantial amount of semiparametric literature that develops the estima-tors when the underlying spectrum f0(ω; θθθ) is unknown. A class of Gaussiansemi-parametric estimators simply uses the same Whittle likelihood (5.24),evaluated over a sub-band of low frequencies, say m′ =

√n. There is some lat-

itude in selecting a band that is relatively free from low frequency interferencedue to the short memory component in (5.25).


Geweke and Porter–Hudak (1983) developed an approximate method forestimating d based on a regression model, derived from (5.24). Note that wemay write a simple equation for the logarithm of the spectrum as

ln fx(ωk; d) = ln f0(ωk; θθθ) − d ln[4 sin2(πωk)], (5.26)

with the frequencies ωk = k/n restricted to a range k = 1, 2, . . . , m′ near thezero frequency with m′ =

√n as the recommended value. Relationship (5.26)

suggests using a simple linear regression model of the form,

ln I(ωk) = β0 − d ln[4 sin2(πωk)] + ek (5.27)

for the periodogram to estimate the parameters σ2w and d. In this case,

one performs least squares using ln I(ωk) as the dependent variable, andln[4 sin2(πωk)] as the independent variable for k = 1, 2, . . . , m. The resultingslope estimate is then used as an estimate of −d. For a good discussion ofvarious alternative methods for selecting m, see Hurvich and Deo (1999).

One of the above two procedures works well for estimating the long memorycomponent but there will be cases (such as ARFIMA) where there will be aparameterized short memory component f0(ωk; θθθ) that needs to be estimated.If the spectrum is highly parameterized, one might estimate using the Whittlelog likelihood (5.21) and

fx(ωk; θθθ) = g−dk f0(ωk; θθθ)

and jointly estimating the parameters d and θθθ using the Newton–Raphsonmethod. If we are interested in a nonparametric estimator, using the conven-tional smoothed spectral estimator for the periodogram, adjusted for the longmemory component, say gd

k I(ωk) might be a possible approach.

5.3 GARCH Models

Recent problems in finance have motivated the study of the volatility, or vari-ability, of a time series. Although ARMA models assume a constant vari-ance, models such as the autoregressive conditionally heteroscedastic or ARCHmodel, first introduced by Engle (1982), were developed to model changes involatility. These models were later extended to generalized ARCH, or GARCHmodels by Bollerslev (1986).

In §3.8, we discussed the return or growth rate of a series. For example, ifxt is the value of a stock at time t, then the return or relative gain, yt, of thestock at time t is

yt =xt − xt−1

xt−1. (5.28)

Definition (5.28) implies that xt = (1+yt)xt−1. Thus, based on the discussionin §3.8, if the return represents a small (in magnitude) percentage change then

∇[ln(xt)] ≈ yt. (5.29)

5.3: GARCH Models 281

Either value, ∇[ln(xt)] or (xt − xt−1)/xt−1, will be called the return, and willbe denoted by yt. It is the study of yt that is the focus of ARCH, GARCH,and other volatility models. Recently there has been interest in stochasticvolatility models and we will discuss these models in Chapter 6 because theyare state-space models.

Typically, for financial series, the return yt, does not have a constant vari-ance, and highly volatile periods tend to be clustered together. In other words,there is a strong dependence of sudden bursts of variability in a return on theseries own past. For example, Figure 1.4 shows the daily returns of the NewYork Stock Exchange (NYSE) from February 2, 1984 to December 31, 1991.In this case, as is typical, the return yt is fairly stable, except for short-termbursts of high volatility.

The simplest ARCH model, the ARCH(1), models the return as

yt = σtεt (5.30)σ2

t = α0 + α1y2t−1, (5.31)

where εt is standard Gaussian white noise; that is, εt ∼ iid N(0, 1). As withARMA models, we must impose some constraints on the model parametersto obtain desirable properties. One obvious constraint is that α1 must not benegative, or else σ2

t may be negative.As we shall see, the ARCH(1) models return as a white noise process with

nonconstant conditional variance, and that conditional variance depends onthe previous return. First, notice that the conditional distribution of yt givenyt−1 is Gaussian:

yt

∣∣ yt−1 ∼ N(0, α0 + α1y2t−1). (5.32)

In addition, it is possible to write the ARCH(1) model as a non-GaussianAR(1) model in the square of the returns y2

t . To do this, rewrite (5.30)-(5.31)as

y2t = σ2

t ε2t

α0 + α1y2t−1 = σ2

t ,

and subtract the two equations to obtain

y2t − (α0 + α1y

2t−1) = σ2

t ε2t − σ2t .

Now, write this equation as

y2t = α0 + α1y

2t−1 + vt, (5.33)

where vt = σ2t (ε2t − 1). Because ε2t is the square of a N(0, 1) random variable,

ε2t − 1 is a shifted (to have mean-zero), χ21 random variable.

To explore the properties of ARCH, we define Ys = {ys, ys−1, ...}. Then,using (5.32), we immediately see that yt has a zero mean:

E(yt) = EE(yt

∣∣ Yt−1) = EE(yt

∣∣ yt−1) = 0. (5.34)


Because E(yt

∣∣ Yt−1) = 0, the process yt is said to be a martingale difference.Because yt is a martingale difference, it is also an uncorrelated sequence.

For example, with h > 0,

cov(yt+h, yt) = E(ytyt+h) = EE(ytyt+h |Yt+h−1)= E {ytE(yt+h |Yt+h−1)} = 0. (5.35)

The last line of (5.35) follows because yt belongs to the information set Yt+h−1for h > 0, and, E(yt+h |Yt+h−1) = 0, as determined in (5.34).

An argument similar to (5.34) and (5.35) will establish the fact that theerror process vt in (5.33) is also a martingale difference and, consequently, anuncorrelated sequence. If the variance of vt is finite and constant with respectto time, and 0 ≤ α1 < 1, then based on Property P3.1, (5.33) specifies a causalAR(1) process for y2

t . Therefore, E(y2t ) and var(y2

t ) must be constant withrespect to time t. This, implies that

E(y2t ) = var(yt) =

α0

1 − α1(5.36)

and, after some manipulations,

E(y4t ) =

3α20

(1 − α1)21 − α2

1

1 − 3α21, (5.37)

provided 3α21 < 1. These results imply that the kurtosis, κ, of yt is

κ =E(y4

t )[E(y2

t )]2= 3

1 − α21

1 − 3α21, (5.38)

which is always larger than 3 (unless α1 = 0), the kurtosis of the normaldistribution. Thus, the marginal distribution of the returns, yt, is leptokurtic,or has “fat tails.”

In summary, an ARCH(1) process, yt, as given by (5.30)-(5.31), or equiva-lently (5.32), is characterized by the following properties.

• If 0 ≤ α1 < 1, the process yt itself is white noise and its unconditionaldistribution is symmetrically distributed around zero; this distributionis leptokurtic.

• If, in addition, 3α21 < 1, the square of the process, y2

t , follows a causalAR(1) model with ACF given by ρy2(h) = αh

1 ≥ 0, for all h > 0. If3α1 ≥ 1, but α1 < 1, then y2

t is strictly stationary with infinite variance.

Estimation of the parameters α0 and α1 of the ARCH(1) model is typi-cally accomplished by conditional MLE. The conditional likelihood of the datay2, ...., yn given y1, is given by

L(α0, α1∣∣ y1) =

n∏t=2

fα0,α1(yt

∣∣ yt−1), (5.39)


where the density fα0,α1(yt

∣∣ yt−1) is the normal density specified in (5.32).Hence, the criterion function to be minimized, l(α0, α1) ∝ − lnL(α0, α1

∣∣ y1)is given by

l(α0, α1) =12

n∑t=2

ln(α0 + α1y2t−1) +

12

n∑t=2

(y2

t

α0 + α1y2t−1

). (5.40)

Estimation is accomplished by numerical methods, as described in §3.6. Inthis case, analytic expressions for the gradient vector, l(1)(α0, α1), and Hessianmatrix, l(2)(α0, α1), as described in Example 3.28, can be obtained by straight-forward calculations. For example, the 2 × 1 gradient vector, l(1)(α0, α1), isgiven by (

∂l/∂α0∂l/∂α1

)=

n∑t=2

(1

y2t−1

)× α0 + α1y

2t−1 − y2

t

2(α0 + α1y2

t−1

)2 . (5.41)

The calculation of the Hessian matrix is left as an exercise (Problem 5.7).The likelihood of the ARCH model tends to be flat unless n is very large. Adiscussion of this problem can be found in Shephard (1996).

It is also possible to combine a regression or an ARMA model for the meanwith an ARCH model for the errors. For example, a regression with ARCH(1)errors model would have the observations xt as linear function of p regressors,zzzt = (zt1, ..., ztp)′, and ARCH(1) noise yt, say,

xt = βββ′zzzt + yt,

where yt satisfies (5.30)-(5.31), but, in this case, is unobserved. Similarly, forexample, an AR(1) model for data xt exhibiting ARCH(1) errors would be

xt = φ0 + φ1xt−1 + yt.

These types of models were explored by Weiss (1984).

Example 5.3 Analysis of U.S. GNP

In Example 3.35, we fit an MA(2) model and an AR(1) model to theU.S. GNP series and we concluded that the residuals from both fits ap-peared to behave like a white noise process. In Example 3.39 we con-cluded that the AR(1) is probably the better model in this case. Ithas been suggested that the U.S. GNP series has ARCH errors, and inthis example, we will investigate this claim. If the GNP noise term isARCH, the squares of the residuals from the fit should behave like anon-Gaussian AR(1) process, as pointed out in (5.33). Figure 5.5 showsthe ACF and PACF of the squared residuals it appears that there maybe some dependence, albeit small, left in the residuals.

We used the S-PLUS GARCH module to fit an AR(1)-ARCH(1) modelto the U.S. GNP returns with the following results:


Figure 5.5 ACF and PACF of the squares of the residuals from the AR(1) fiton U.S. GNP.

> gnp96 <- matrix(scan("/mydata/gnp96.dat"),ncol=2,byrow=T)> gnpr <- diff(log(gnp96[,2])) # gnp returns> gnpr.mod <- garch(gnpr˜ar(1),˜garch(1,0)) # model call> summary(gnpr.mod)

Estimated Coefficients:Value Std.Error t value Pr(>|t|)

C 0.00522 8.264e-004 6.326 6.990e-010 # AR cnstAR(1) 0.36721 7.888e-002 4.656 2.798e-006 # AR coef

A 0.00007 6.978e-006 10.349 0.000e+000 # ARCH cnstARCH(1) 0.20242 7.031e-002 2.879 2.193e-003 # ARCH coef

Residual Tests:Jarque-Bera P-value # tests normal skewness & kurtosis

8.643 0.01328Shapiro-Wilk P-value # tests normal order statistics

0.9827 0.4829Q-Statistic P-value Chiˆ2-d.f.

13.88 0.3087 12

In this example, we obtain φ0 = .005 and φ1 = .367 for the AR(1)parameter estimates; in Example 3.35 the values were .005 and .347,respectively. The ARCH(1) parameter estimates are α0 = 0 for theconstant and α1 = .202, which is highly significant with a p-value of


about .002. The Jarque–Bera statistic tests the residuals of the fit fornormality based on the observed skewness and kurtosis, and it appearsthat the residuals have some non-normal skewness and kurtosis. TheShapiro–Wilk statistic tests the residuals of the fit for normality basedon the empirical order statistics. In this case, the residuals appear tobe normal. Finally, the Q-statistic is used on the squared residuals, andwe conclude that the squared residuals appear to be an uncorrelatedsequence.

To repeat the analysis in R without the simultaneous estimation, down-load the package tseries from CRAN and load it. Then, perform theAR estimation first and use those residuals for the ARCH fit as follows(assuming gnpr is available as in the S-PLUS example). We note that theresults are similar to the simultaneous estimation results from S-PLUS.> gnpr.ar = ar.mle(gnpr, order.max=1) # recall phi1 =.347> y = gnpr.ar$resid[2:length(gnpr)] # first resid is NA> arch.y = garch(y,order=c(0,1))> summary.garch(arch.y) # partial output below

Coefficient(s):Estimate Std. Error t value Pr(>|t|)

a0 7.403e-05 7.275e-06 10.175 < 2e-16 # ARCH cnsta1 1.939e-01 6.781e-02 2.859 0.00425 # ARCH coef

Jarque Bera Test:X-squared = 8.4801, df = 2, p-value = 0.01441

Box-Ljung test (squared residuals):X-squared = 3e-04, df = 1, p-value = 0.9865

The ARCH(1) model can be extended to the general ARCH(m) model inan obvious way. That is, (5.30) is retained,

yt = σtεt, (5.30)

but (5.31) is extended to

σ2t = α0 + α1y

2t−1 + · · · + αmy2

t−m. (5.42)

Estimation for ARCH(m) also follows in an obvious way from the discussionof estimation for ARCH(1) models. That is, the conditional likelihood of thedata ym+1, ...., yn given y1, . . . , ym, is given by

L(ααα∣∣ y1, . . . , ym) =

n∏t=m+1

fααα(yt

∣∣ yt−1, . . . , yt−m), (5.43)

where ααα = (α0, α1, . . . , αm) and the conditional densities fααα(·|·) in (5.43) arenormal densities; that is, for t > m,

yt

∣∣ yt−1, . . . , yt−m ∼ N(0, α0 + α1y2t−1 + · · · + αmy2

t−m).


Another extension of ARCH is the generalized ARCH or GARCH modeldeveloped by Bollerslev (1986). For example, a GARCH(1, 1) model retains(5.30),

yt = σtεt, (5.30)

but extends (5.31) as follows:

σ2t = α0 + α1y

2t−1 + β1σ

2t−1. (5.44)

Under the condition that α1 + β1 < 1, using similar manipulations as in(5.33), the GARCH(1, 1) model, (5.30) and (5.44), admits a non-GaussianARMA(1, 1) model for the squared process

y2t = α0 + (α1 + β1)y2

t−1 + vt − β1vt−1, (5.45)

where vt is as defined in (5.33). Representation (5.45) follows by writing (5.30)as

y2t − σ2

t = σ2t (ε2t − 1)

β1(y2t−1 − σ2

t−1) = β1σ2t−1(ε

2t−1 − 1),

subtracting the second equation from the first, and using the fact that, from(5.44), σ2

t − β1σ2t−1 = α0 + α1y

2t−1, on the left-hand side of the result. The

GARCH(m, r) model retains (5.30) and extends (5.44) to

σ2t = α0 +

m∑j=1

αjy2t−j +

r∑j=1

βjσ2t−j . (5.46)

Conditional maximum likelihood estimation of the GARCH(m, r) modelparameters is similar to the ARCH(m) case, wherein the conditional likelihood,(5.43), is the product of N(0, σ2

t ) densities with σ2t given by (5.46) and where

the conditioning is on the first max(m, r) observations, with σ21 = · · · = σ2

r = 0.Once the parameter estimates are obtained, the model can be used to obtainone-step-ahead forecasts of the volatility, say σ2

t+1, given by

σ2t+1 = α0 +

m∑j=1

αjy2t+1−j +

r∑j=1

βj σ2t+1−j . (5.47)

We explore these concepts in the following example.

Example 5.4 GARCH Analysis of the NYSE Returns

As previously mentioned, the daily returns of the NYSE shown in Fig-ure 1.4 exhibit classic GARCH features. We used the R tseries packageto fit a GARCH(1, 1) model to the series with the following results:


> nyse = scan("/mydata/nyse.dat")> nyse.g = garch(nyse, order=c(1,1))> summary.garch(nyse.g)

Coefficient(s):Estimate Std. Error t value Pr(>|t|)

a0 6.552e-06 6.761e-07 9.691 <2e-16 # alpha0a1 1.118e-01 4.056e-03 27.554 <2e-16 # alpha1b1 8.086e-01 1.292e-02 62.566 <2e-16 # beta1

Diagnostic Tests:Jarque Bera Test - data: ResidualsX-squared = 3983.873, df = 2, p-value < 2.2e-16Box-Ljung Test - data: Squared.ResidualsX-squared = 1.5874, df = 1, p-value = 0.2077

To explore the GARCH predictions, we calculated and plotted the mid-dle of the data along (which includes the October 19, 1987 crash) withthe one-step-ahead predictions of the corresponding volatility, σ2

t . Theresults are displayed as ±σt as a dashed line surrounding the data in Fig-ure 5.6. These predictions can be obtained easily in R using the tseriespackage.

> u = predict.garch(nyse.g)> plot(800:1000, nyse[800:1000], type="l", xlab="Time",+ ylab="NYSE Returns")> lines(u[,1], col="blue", lty="dashed")> lines(u[,2], col="blue", lty="dashed")

Some key points can be gleaned from the examples of this section. First,it is apparent that the conditional distribution of the returns is rarely normal.S-PLUS allows for long tailed distributions to be fit to the data, whereas Rdoes not. In particular, aside from the Gaussian distribution (the default),the S-PLUS Garch module allows for t, double exponential, and generalizeddouble exponential1 conditional distributions. Also, the predictions shown inFigure 5.6 leave something to be desired. It appears the model is better attelling you what the volatility was rather than what it is going to be; basically,increases or decreases in predicted volatility are a day late. In addition tothese points, some other drawbacks of the GARCH model are: (i) the modelassumes positive and negative returns have the same effect because volatilitydepends on squared returns; (ii) the model is restrictive because of the tightconstraints on the model parameters (e.g., for an ARCH(1), 0 ≤ α2

1 < 13 ); (iii)

the likelihood is flat unless n is very large; (iv) the model tends to overpredictvolatility because it responds slowly to large isolated returns.

1f(x) = pα exp(−αx)I(0,∞)(x) + (1 − p)β exp(βx)I(−∞,0)(x); 0 < p < 1.


800 850 900 950 1000

−0.1

5−0

.10

−0.0

50.

000.

05

Time

NYSE

Ret

urns

Figure 5.6 GARCH predictions of the NYSE volatility, ±σt, displayed asdashed lines.

Various extensions to the original model have been proposed to overcomesome of the shortcomings we have just mentioned. For example, we havealready discussed the fact that the S-PLUS Garch module will fit some non-normal, albeit symmetric, distributions. For asymmetric return dynamics, onecan use the EGARCH (exponential GARCH) model, which is a complex modelthat has different components for positive returns and for negative returns. Inthe case of persistence in volatility, the integrated GARCH (IGARCH) modelmay be used. Recall (5.45) where we showed the GARCH(1, 1) model can bewritten as

y2t = α0 + (α1 + β1)y2

t−1 + vt − β1vt−1

and y2t is stationary if α1 + β1 < 1. The IGARCH model sets α1 + β1 = 1, in

which case the IGARCH(1, 1) model is

yt = σtεt and σ2t = α0 + (1 − β1)y2

t−1 + β1σ2t−1.

There are many different extensions to the basic ARCH model that were de-veloped to handle the various situations noticed in practice. Interested read-ers might find the general discussions in Bollerslev et al. (1994) and Shep-hard (1996) worthwhile reading. Also, Gourieroux (1997) gives a detailedpresentation of ARCH and related models with financial applications and con-tains an extensive bibliography. Two excellent texts on financial time seriesanalysis are Chan (2002) and Tsay (2001).

5.4: Threshold Models 289

Finally, we briefly discuss stochastic volatility models; a detailed treatmentof these models is given in Chapter 6. The volatility component, σ2

t , in theGARCH model is conditionally nonstochastic. In the ARCH(1) model forexample, any time the previous return is zero, i.e., yt−1 = 0, it must be thecase that σ2

t = α0, and so on. This assumption seems a bit unrealistic. Thestochastic volatility model adds a stochastic component to the volatility in thefollowing way. In the GARCH model, a return, say yt, is

yt = σtεt ⇒ log y2t = log σ2

t + log ε2t . (5.48)

In this way, we see that the observations log y2t , are made up of two components,

the unobserved volatility log σ2t , which may be considered a latent variable,

and unobserved noise log ε2t . While, for example, the GARCH(1, 1) modelsvolatility without error, σ2

t+1 = α0+α1r2t +β1σ

2t , the basic stochastic volatility

model assumes the latent variable is an autoregressive process,

log σ2t+1 = φ0 + φ1 log σ2

t + wt (5.49)

where wt ∼ iid N(0, σ2w). The introduction of the noise term wt makes the

latent volatility process stochastic. Together (5.48) and (5.49) comprise thestochastic volatility model. Given n observations, the goals are to estimatethe parameters φ0, φ1 and σ2

w, and then predict future observations log y2n+m.

Details are provided in §6.10.

5.4 Threshold Models

In §3.5 we discussed the fact that, for a stationary time series, best linearprediction forward in time is the same as best linear prediction backward intime. This result followed from the fact that the variance–covariance matrixof xxx1:n = (x1, x2, ..., xn)′, say, Γ = {γ(i − j)}n

i,j=1, is the same as the variance–covariance matrix of xxxn:1 = (xn, xn−1, ..., x1)′. In addition, if the process isGaussian, the distributions of xxx1:n and xxxn:1 are identical. In this case, a timeplot of xxx1:n (that is, the data plotted forward in time) should look similar toa time plot of xxxn:1 (that is, the data plotted backward in time).

There are, however, many series that do not fit into this category. Forexample, Figure 5.7 shows a plot of monthly pneumonia and influenza deathsper 10,000 in the U.S. for 11 years, 1968 to 1978. Typically, the number ofdeaths tends to increase slower than it decreases. Thus, if the data were plottedbackward in time, the backward series would tend to increase faster than itdecreases. Also, if monthly pneumonia and influenza deaths followed a linearGaussian process, we would not expect to see such large bursts of positive andnegative changes that occur periodically in this series. Moreover, although thenumber of deaths is typically largest during the winter months, the data are


Figure 5.7 U.S. monthly pneumonia and influenza deaths per 10,000 over 11years from 1968 to 1978.

not perfectly seasonal. That is, although the peak of the series often occurs inJanuary, in other years, the peak occurs in December, February, or March.

If our goal is to predict flu epidemics, then it should be clear that a Gaussianlinear model would not be appropriate. Many approaches to modeling nonlin-ear series exist that could be used (see Priestley, 1988); here, we focus on theclass of threshold autoregressive models presented in Tong (1983, 1990). Thebasic idea of these models is that of fitting local linear AR(p) models, and theirappeal is that we can use the intuition from fitting global linear AR(p) models.Suppose we know p, and given the vectors xxxt−1 = (xt−1, ..., xt−p)′, we can iden-tify r mutually exclusive and exhaustive regions for xxxt−1, say, R1, ..., Rr, wherethe dynamics of the system changes. The threshold model is then written asr AR(p) models,

xt = α(j) + φ(j)1 xt−1 + · · · + φ(j)

p xt−p + w(j)t , xxxt−1 ∈ Rj , (5.50)

for j = 1, ..., r. In (5.50), the w(j)t are independent white noise series, each with

variance σ2j , for j = 1, ..., r. Model estimation, identification, and diagnostics

proceed as in the case in which r = 1.

Example 5.5 Threshold Modeling of the Influenza Series

As previously discussed, examination of Figure 5.7 leads us to believethat the monthly pneumonia and influenza deaths time series, say flut,is not linear. It is also evident from Figure 5.7 that there is a slightnegative trend in the data. We have found that the most convenient way

5.4: Threshold Models 291

to fit a threshold model to this data set, while removing the trend, is towork with the first difference of the data. The differenced data,

xt = flut − flut−1

is exhibited in Figure 5.8 as the dark solid line with circles representingobservations. The dashed line with squares in Figure 5.8 are the one-month-ahead predictions, and we will discuss this series later.

The nonlinearity of the data is more pronounced in the plot of the firstdifferences, xt. Clearly, the change in the numbers of deaths, xt, slowlyrises for some months and, then, sometime in the winter, has a possibilityof jumping to a large number once xt exceeds about .05. If the processesdoes make a large jump, then a subsequent significant decrease occurs influ deaths. As an initial analysis, we fit the following threshold model

xt = α(1) +p∑

j=1

φ(1)j xt−j + w

(1)t , xt−1 < .05

xt = α(2) +p∑

j=1

φ(2)j xt−j + w

(2)t , xt−1 ≥ .05, (5.51)

with p = 6, assuming this would be larger than necessary.

Model (5.51) is easy to fit using two linear regression runs. That is, letδ(1)t = 1 if xt−1 < .05, and zero otherwise, and let δ

(2)t = 1 if xt−1 ≥ .05,

and zero otherwise. Then, using the notation of §2.2, for t = p + 1, ..., n,either equation in (5.51) can be written as

yt = βββ′zzzt + wt

where, for i = 1, 2,

yt = δ(i)t xt, zzz′

t, = δ(i)t (1, xt−1, ..., xt−p), wt = δ

(i)t w

(i)t ,

andβββ′ = (α(i), φ

(i)1 , φ

(i)2 , ..., φ(i)

p ).

Parameter estimates can then be obtained using the regression techniquesof §2.2 twice, once for i = 1 and again for i = 2.

For each model, an order p = 4 model was finally selected. The finalmodel was

xt = .51(.08)xt−1 − .20(.06)xt−2 + .12(.05)xt−3

−.11(.5)xt−4 + w(1)t , when xt−1 < .05

xt = .40 − .75(.17)xt−1 − 1.03(.21)xt−2 − 2.05(1.05)xt−3

−6.71(1.25)xt−4 + w(2)t , when xt−1 ≥ .05,


Figure 5.8 First differenced U.S. monthly pneumonia and influenza deaths per1,000 (solid line - circles); one-month-ahead predictions (dashed line -squares).

where σ1 = .05 and σ2 = .07. The threshold of .05 was exceeded 17 times.Using the final model, one-month-ahead predictions can be made, andthese are shown in Figure 5.8 as a dashed line with squares. The modeldoes extremely well at predicting a flu epidemic; the peak at t = 96,however, was missed by this model. When we fit a model with a smallerthreshold of .04, flu epidemics were somewhat underestimated, but theflu epidemic in the eighth year was predicted one month early. We chosethe model with a threshold of .05 because the residual diagnostics showedno obvious departure from the model assumption (except for one outlierat t = 96); the model with a threshold of .04 still had some correlationleft in the residuals and there were more than one outliers. Finally,prediction beyond one-month-ahead for this model is very complicated,but some approximate techniques exist (see Tong, 1983).

5.5: Autocorrelated Errors 293

5.5 Regression with Autocorrelated Errors

In §2.2, we covered the classical regression model with uncorrelated errors wt.In this section, we discuss the modifications that might be considered whenthe errors are correlated. That is, consider the regression model

yt = βββ′zzzt + xt, (5.52)

t = 1, . . . , n, where xt is a process with some covariance function γ(s, t). Then,we have the matrix form

yyy = Zβββ + xxx, (5.53)

where xxx = (x1, . . . , xn)′ is a n × 1 vector with n × n covariance matrixΓ = {γ(s, t)}. Note that Z = [zzz1, zzz2, . . . , zzzn]′ is the n × q matrix of inputvariables, as before. If we know the covariance matrix Γ, it is possible to finda transformation matrix A, such that AΓA′ = σ2I, where I denotes the n × nidentity matrix. Then, the underlying model can be transformed into

Ayyy = AZβββ + Axxx= Uβββ + www,

where U = AZ and www is a white noise vector with covariance matrix σ2I asin §2.2. Then, applying least squares or maximum likelihood to the vector Ayyygives

βββw = (U ′U)−1U ′Ayyy

= (Z ′A′AZ)−1Z ′A′Ayyy

= (Z ′Γ−1Z)−1Z ′Γ−1yyy (5.54)

becauseσ2Γ−1 = A′A.

The difficulty in applying (5.54) is, we do not know the form of the matrix Γ.It may be possible, however, in the time series case, to assume a stationary

covariance structure for the error process xt that corresponds to a linear processand try to find an ARMA representation for xt. For example, if we have a pureAR(p) error, then

φ(B)xt = wt,

and φ(B) is the linear transformation that, when applied to the error process,produces the white noise wt. Regarding this transformation as the appropri-ate matrix A of the preceding paragraph produces the transformed regressionequation

φ(B)yt = βββ′φ(B)zzzt + wt,

and we are back to the same model as before. Defining ut = φ(B)yt andvvvt = φ(B)zzzt leads to the simple regression problem

ut = βββ′vvvt + wt (5.55)


considered before. The preceding discussion suggests an algorithm, due toCochrane and Orcutt (1949), for fitting a regression model with autocorrelatederrors.

(i) First, run an ordinary regression of yt on zt (acting as if the errors areuncorrelated). Retain the residuals.

(ii) Fit an ARMA model to the residuals xt = yt − βββ′zzzt, say,

φ(B)xt = θ(B)wt (5.56)

(iii) Then, apply the ARMA transformation to both sides (5.52), that is,

ut =φ(B)

θ(B)yt

and

vvvt =φ(B)

θ(B)zzzt,

to obtain the transformed regression model (5.55).

(iv) Run an ordinary least squares regression model assuming uncorrelatederrors on the transformed regression model (5.55), obtaining

βββw = (V ′V )−1V ′uuu, (5.57)

where V = [vvv1, . . . , vvvn]′ and uuu = (u1, . . . , un)′ are the correspondingtransformed components.

The above procedure can be repeated until convergence and will approach themaximum likelihood solution under normality of the errors (for details, seeSargan, 1964).

Example 5.6 Pollution, Temperature, Mortality with CorrelatedErrors

We consider further the best regression obtained in Example 2.2 of Chap-ter 2, relating adjusted temperature Tt − T·, (Tt − T·)2 and particulatelevels Pt to cardiovascular mortality Mt. Identifying the vectors

zzzt = (1, t, (Tt − T·), (Tt − T·)2, Pt)′

leads to a model of the form (5.52). Taking the residuals from the leastsquares regression, as described in Step (i), the sample ACF and PACF,shown in Figure 5.9, suggest an AR(2) model for the residuals. Note,σ2 = 40.77 and R2 = .59 for this model.

5.6: Transfer Functions 295

Figure 5.9 Sample ACF and PACF of the mortality residuals indicating anAR(2) process.

For the residuals, we obtain a second-order autoregressive model withoperator

φ(B) = 1 − .2207B − .3627B2

which is applied to both sides of the defining equation (5.52) to producethe transformed equation (5.55), as in Step (ii) above. Running theregression, as in Step (iii), yields the model

Mt = 83.54 − .028(.004)t − .196(.039)(Tt − 74.6)

+ .017(.002)(Tt − 74.6)2 + .229(.023)Pt

as the model for transformed mortality, where the coefficients and esti-mated variances have changed slightly because of the transformation.The linear temperature component has decreased in magnitude from−.473 to −.196, whereas the other components stayed almost the same.The new residuals from the transformed model have sample ACF andPACF in Figure 5.10 that show no prominent peaks and can probablybe taken as white noise.

5.6 Lagged Regression: Transfer FunctionModeling

In §4.10, we considered lagged regression in a frequency domain approach basedon coherency. In this section we focus on a time domain approach to the same


Figure 5.10 Sample ACF and PACF of the mortality residuals after fittingan AR(2) model.

problem. In the previous section, we looked at autocorrelated errors but, stillregarded the input series zzzt as being fixed unknown functions of time. Thisconsideration made sense for the time argument t, but was less satisfactoryfor the other inputs, which are probably stochastic processes. For example,consider the SOI and Recruitment series that were presented in Example 1.5.The series are displayed in Figure 1.5. In this case, the interest is in predictingthe output Recruitment series, say, yt, from the input SOI, say xt. We mightconsider the lagged regression model

yt =∞∑

j=0

αjxt−j + ηt = α(B)xt + ηt, (5.58)

where∑

j |αj | < ∞. We assume the input process xt and noise process ηt

in (5.58) are both stationary and mutually independent. The coefficientsα0, α1, . . . describe the weights assigned to past values of xt used in predictingyt and we have used the notation

α(B) =∞∑

j=0

αjBj . (5.59)

In the Box and Jenkins (1970) formulation, we assign ARIMA models,say, ARIMA(p, d, q) and ARIMA(pη, dη, qη), to the series xt and ηt, respec-tively. The components of (5.58) in backshift notation, for the case of simpleARMA(p, q) modeling of the input and noise, would have the representation

φ(B)xt = θ(B)wt (5.60)


andφη(B)ηt = θη(B)zt, (5.61)

where wt and zt are independent white noise processes with variances σ2w and

σ2z , respectively. Box and Jenkins (1970) proposed that systematic patterns

often observed in the coefficients αj , for j = 1, 2, ..., could often be expressedas a ratio of polynomials involving a small number of coefficients, along witha specified delay, d, so

α(B) =δ(B)Bd

ω(B), (5.62)

whereω(B) = 1 − ω1B − ω2B

2 − · · · − ωrBr (5.63)

andδ(B) = δ0 + δ1B + · · · + δsB

s (5.64)

are the indicated operators; in this section, we find it convenient to representthe inverse of an operator, say, [ω(B)]−1, as 1/ω(B).

Determining a parsimonious model involving a simple form for α(B) andestimating all of the parameters in the above model are the main tasks in thetransfer function methodology. Because of the large number of parameters, itis necessary to develop a sequential methodology. Suppose we focus first onfinding the ARIMA model for the input xt and apply this operator to bothsides of (5.58), obtaining the new model

yt =φ(B)θ(B)

yt

= α(B)wt +φ(B)θ(B)

ηt

= α(B)wt + ηt,

where wt and the transformed noise ηt are independent.The series wt is a prewhitened version of the input series, and its cross-

correlation with the transformed output series yt will be just

γyw(h) = E[yt+hwt]

= E[∞∑

j=0

αjwt+h−jwt]

= σ2wαh, (5.65)

because the autocovariance function of white noise will be zero except when j =h in (5.65). Hence, computing the cross-correlation between the prewhitenedinput series and the transformed output series should yield a rough estimateof the behavior of α(B).


Figure 5.11 Sample ACF and PACF of SOI.

Figure 5.12 Sample CCF of the prewhitened, detrended SOI and the sim-ilarly transformed Recruitment series; negative lags indicate that SOI leadsRecruitment.

Example 5.7 Relating the Prewhitened SOI to the TransformedRecruitment Series

We give a simple example of the suggested procedure for the SOI andthe Recruitment series. Figure 5.11 shows the sample ACF and PACFof the detrended SOI index, and it is clear, from the PACF, that an


autoregressive series with p = 1 will do a reasonable job. Fitting theseries gave φ = .589, σ2

w = .092, and we applied the operator (1− .589B)to both xt and yt and computed the cross-correlation function, which isshown in Figure 5.12. Noting the apparent shift of d = 5 months andthe exponential decrease thereafter, it seems plausible to hypothesize amodel of the form

α(B) = δ0B5(1 + ω1B + ω2

1B2 + · · ·)=

δ0B5

1 − ω1B

for the transfer function. In this case, we would expect ω1 to be negative.

In some cases, we may postulate the form of the separate components δ(B)and ω(B), so we might write the equation

yt =δ(B)Bd

ω(B)xt + ηt

asω(B)yt = δ(B)Bdxt + ω(B)ηt,

or in regression form

yt =r∑

k=1

ωkyt−k +s∑

k=0

δkxt−d−k + ut, (5.66)

whereut = ω(B)ηt. (5.67)

The form of (5.66) suggests doing a regression on the lagged versions of boththe input and output series to obtain βββ, the estimate of the (r + s + 1) × 1regression vector

βββ = (ω1, . . . , ωr, δ0, δ1, . . . , δs)′.

The residuals from the regression above, say,

ut = yt − βββ′zzzt,

wherezzzt = (yt−1, . . . , yt−r, xt−d, . . . , xt−d−s)′

denotes the usual vector of independent variables, could be used to approxi-mate the best ARMA model for the noise process ηt, because we can computean estimator for that process from the (5.67), using ut and ω(B) and applyingthe moving average operator to get ηt. Fitting an ARMA(pη, qη) model to thethis estimated noise then completes the specification. The preceding suggeststhe following sequential procedure for fitting the transfer function model todata.


(i) Fit an ARMA model to the input series xt to estimate the parametersφ1, . . . , φp, θ1, . . . , θq, σ

2w in the specification (5.60). Retain ARMA co-

efficients for use in Step (ii) and the fitted residuals wt for use in Step(iii).

(ii) Apply the operator determined in Step (i), that is,

φ(B)yt = θ(B)yt,

to determine the transformed output series yt.

(iii) Use the cross-correlation function between yt and wt in (i) and (ii) tosuggest a form for the components of the polynomial

α(B) =δ(B)Bd

ω(B)

and the estimated time delay d.

(iv) Obtain βββ = (ω1, . . . , ωr, δ0, δ1, . . . , δs) by fitting a linear regression of theform (5.66). Retain the residuals ut for use in Step (v).

(v) Apply the moving average transformation (5.67) to the residuals ut tofind the noise series ηt, and fit an ARMA model to the noise, obtainingthe estimated coefficients in φη(B) and θη(B).

The above procedure is fairly reasonable, but does not have any recogniz-able overall optimality. Simultaneous least squares estimation, based on theobserved xt and yt, can be accomplished by noting that the transfer functionmodel can be written as

yt =δ(B)Bd

ω(B)xt +

θη(B)φη(B)

zt,

which can be put in the form

ω(B)φη(B)yt = φη(B)δ(B)Bdxt + ω(B)θη(B)zt, (5.68)

and it is clear that we may use least squares to minimize∑

t z2t , as in earlier

sections. We may also express the transfer function in state-space form (seeBrockwell and Davis, 1991, Chapter 12). It is often easier to fit a transferfunction model in the spectral domain as presented in §4.10.

Example 5.8 Transfer Function Model for the SOI and RecruitmentSeries

We illustrate the procedure for fitting a transfer function model of theform suggested in Example 5.7 to the detrended SOI series (xt) andthe detrended Recruitment series (yt). The results reported here can be


Figure 5.13 ACF and PACF of the estimated noise ηt departures from thetransfer function model.

compared with the results obtained from the frequency domain approachused in Example 4.23. Note first that Steps (i)-(iii). have already beenapplied to determine the ARMA model

(1 − .589B)xt = wt,

where σ2w = .092. Using the model determined in Example 5.7, we run

the regressionyt = ω1yt−1 + δ0xt−5 + ut,

yielding ω1 = .848, δ0 = −20.54, where the residuals satisfy

ut = (1 − .848B)ηt.

This completes Step (iv). To complete the specification, we apply themoving average operator above to estimate the original noise series ηt andfit a second-order autoregressive model, based on the ACF and PACFshown in Figure 5.13. We obtain

(1 − 1.255B + .410B2)ηt = zt,

with σ2z = 52.46 as the estimated error variance.


5.7 Multivariate ARMAX Models

To understand multivariate time series models and their capabilities, wefirst present an introduction to multivariate time series regression techniques.A useful extension of the basic univariate regression model presented in §2.2is the case in which we have more than one output series, that is, multivariateregression analysis. Suppose, instead of a single output variable yt, a collectionof k output variables yt1, yt2, . . . , ytk exist that are related to the inputs as

yti = βi1zt1 + βi2zt2 + · · · + βirztr + wti (5.69)

for each of the i = 1, 2, . . . , k output variables. We assume the wti variablesare correlated over the variable identifier i, but are still independent over time.Formally, we assume cov{wsi, wtj} = σij for s = t and is zero otherwise. Then,writing (5.69) in matrix notation, with yyyt = (yt1, yt2, . . . , ytk)′ being the vectorof outputs, and B = {βij}, i = 1, . . . , k, j = 1, . . . , r being an k × r matrixcontaining the regression coefficients, leads to the simple looking form

yyyt = Bzzzt + wwwt. (5.70)

Here, the k × 1 vector process wwwt is assumed to be a collection of independentvectors with common covariance matrix E{wwwtwww

′t} = Σw, the k × k matrix

containing the covariances σij . The maximum likelihood estimator, under theassumption of normality, for the regression matrix in this case is

B = Y ′Z(Z ′Z)−1, (5.71)

where Z ′ = [zzz1, zzz2, . . . , zzzn] is as before and Y ′ = [yyy1, yyy2, . . . , yyyn]. The errorcovariance matrix Σw is estimated by

Σw =1

(n − r)

n∑t=1

(yyyt − Bzzzt)(yyyt − Bzzzt)′. (5.72)

The uncertainty in the estimators can be evaluated from

se(βij) =√

σjjcii, (5.73)

for i = 1, . . . , r, j = 1, . . . , k, where se denotes estimated standard error, σjj

is the j-th diagonal element of Σw, and cii is the i-th diagonal element of(∑n

t=1 zzztzzz′t)

−1.Also, the information theoretic criterion changes to

AIC = ln |Σw| +2n

(kr +

k(k + 1)2

). (5.74)

and SIC replaces the second term in (5.74) by K lnn/n where K = kr + k(k +1)/2. Bedrick and Tsai (1994) have given a corrected form for AIC in themultivariate case as

AICc = ln |Σw| +k(r + n)

n − k − r − 1. (5.75)

5.7: Multivariate ARMAX 303

Many data sets involve more than one time series, and we are often in-terested in the possible dynamics relating all series. In this situation, weare interested in modeling and forecasting k × 1 vector-valued time seriesxxxt = (xt1, . . . , xtk)′, t = 0,±1,±2, . . .. Unfortunately, extending univariateARMA models to the multivariate case is not so simple. The multivariate au-toregressive model, however, is a straight-forward extension of the univariateAR model.

For the first-order vector autoregressive model, VAR(1), we take

xxxt = ααα + Φxxxt−1 + wwwt, (5.76)

where Φ is a k × k transition matrix that expresses the dependence of xxxt onxxxt−1. The vector white noise process wwwt is assumed to be multivariate normalwith mean-zero and covariance matrix

E (wwwtwww′t) = ΣΣΣw. (5.77)

The vector ααα = (α1, α2, . . . , αk)′ appears as the constant in the regressionsetting. If E(xxxt) = µµµ, then ααα = (I − Φ)µµµ.

Note the similarity between the VAR model and the multivariate linearregression model (5.70). The regression formulas carry over, and we can, onobserving xxx1, . . . , xxxn, set up the model (5.76) with yyyt = xxxt, B = (ααα,Φ) and zzzt =(1, xxx′

t−1)′. Then, write the solution as (5.71) with the conditional maximum

likelihood estimator for the covariance matrix given by

Σw = (n − 1)−1n∑

t=2

(xxxt − ααα − Φxxxt−1)(xxxt − ααα − Φxxxt−1)′. (5.78)

Example 5.9 Pollution, Weather, and Mortality

For example, for the three-dimensional series composed of detrended car-diovascular mortality xt1, temperature xt2, and particulate levels xt3,introduced in Example 2.2, take xxxt = (xt1, xt2, xt3)′ as a vector of di-mension k = 3. We might envision dynamic relations among the threeseries defined as the first order relation,

xt1 = α1 + φ11xt−1,1 + φ12xt−1,2 + φ13xt−1,3 + wt1,

which expresses the current value of mortality as a linear combination ofits immediate past value and the past values of temperature and partic-ulate levels. Similarly,

xt2 = α2 + φ21xt−1,1 + φ22xt−1,2 + φ23xt−1,3 + wt2

andxt3 = α3 + φ31xt−1,1 + φ32xt−1,2 + φ33xt−1,3 + wt3


express the dependence of temperature and particulate levels on the otherseries. Of course, methods for the preliminary identification of thesemodels exist, and we will discuss these methods shortly.

For this particular case, we obtain ααα = (−4.57, 6.09, 19.78)′ and

Φ =

⎛⎝ .47(.04) −.36(.03) .10(.02)−.24(.04) .49(.04) −.13(.02)−.13(.08) −.48(.07) .58(.04)

⎞⎠ ,

where the standard errors, computed as in (5.73), are given in parenthe-ses. Hence, for the vector (xt1, xt2, xt3) = (Mt, Tt, Pt), with Mt, Tt andPt denoting mortality, temperature, and particulate level, respectively,we obtain the prediction equation for mortality,

Mt = −4.57 + .47Mt−1 − .36Tt−1 + .10Pt−1.

Comparing observed and predicted mortality with this model leads to anR2 of about .78, whereas the value in the regression model fitted by themethod of Example 2.2 gave an R2 = .69.

It is easy to extend the VAR(1) process to higher orders, VAR(p). To dothis, we use the notation of (5.70) and write the vector of regressors as

zzzt = (1, xxx′t−1, xxx

′t−2, . . . xxx

′t−p)

′

and the regression matrix as B = (ααα,Φ1, Φ2, . . . ,Φp). Then, this regressionmodel can be written as

xxxt = ααα +p∑

j=1

Φjxxxt−j + wwwt (5.79)

for t = p + 1, . . . , n. The k × k error sum of products matrix becomes

RSP =n∑

t=p+1

(xxxt − Bzzzt)(xxxt − Bzzzt)′, (5.80)

so that the conditional maximum likelihood estimator for the error covariancematrix Σw is

Σw = RSP/(n − p), (5.81)

as in the multivariate regression case, except now only n − p residuals exist in(5.80). For the multivariate case, we have found that the Schwarz criterion

SIC = log |Σw| + k2p lnn/n, (5.82)

gives more reasonable classifications than either AIC or corrected version AICc.The result is consistent with those reported in simulations by Lutkepohl (1985).


Table 5.1 Summary Statistics for Example 5.10

Order (p) k2p |Σw| SIC AICc

1 505 118,520 11.79 14.712 503 74,708 11.44 14.263 501 70,146 11.49 14.214 499 65,268 11.53 14.155 497 59,684 11.55 14.08

Example 5.10 Mortality, Pollution and Temperature Data

A trivariate AR(2) model for the data in Example 5.9 yields

Φ1 =

⎛⎝ .30(.04) −.20(.04) .04(.02)−.11(.05) .26(.05) −.05(.03)

.08(.09) −.39(.09) .39(.05)

⎞⎠ ,

Φ2 =

⎛⎝ .28(.04) −.08(.04) .07(.03)−.04(.05) .36(.05) −.09(.03)−.33(.09) .05(.09) .38(.05)

⎞⎠ .

In Table 5.1, fitting successively higher order models beyond p = 2 doesnot improve the value of SIC, and we would tend to settle on the second-order model. Note that the value of AICc continues to decrease as themodel order increases.

A k × 1 vector-valued time series xxxt, for t = 0,±1,±2, . . ., is said to beVARMA(p, q) if xxxt is stationary and

xxxt = ααα + Φ1xxxt−1 + · · · + Φpxxxt−p + wwwt + Θ1wwwt−1 + · · · + Θqwwwt−q, (5.83)

with Φp �= 0, Θq �= 0, and Σw > 0 (that is, Σw is positive definite). Thecoefficient matrices Φj ; j = 1, ..., p and Θj ; j = 1, ..., q are, of course, p × pmatrices. If xxxt has mean µµµ then ααα = (I −Φ1 −· · ·−Φp)µµµ. As in the univariatecase, we will have to place a number of conditions on the multivariate ARMAmodel to ensure the model is unique and has desirable properties such ascausality. These conditions will be discussed shortly.

The special form assumed for the constant component, ααα, of the vectorARMA model in (5.83) can be generalized to include a fixed r × 1 vector ofinputs, uuut. That is, we could have proposed the vector ARMAX model,

xxxt = Γuuut +p∑

j=1

Φjxxxt−j +q∑

k=1

Θkwwwt−k + wwwt, (5.84)

where Γ is a p×r parameter matrix. The X in ARMAX refers to the exogenousvector process we have denoted here by uuut. The introduction of exogenous


variables through replacing ααα by Γuuut does not present any special problems inmaking inferences. For example, the case of the ARX model, that is, q = 0in (5.84), can be estimated using standard regression results. In this case, themodel can be written as a multivariate regression model in which the vectorof regressors are

zzzt = (uuu′t, xxx

′t−1, ..., xxx

′t−p)

′ (5.85)

and the new regression matrix is

B = [Γ, Φ1, Φ2, ...,Φp]. (5.86)

The general VARMA model, (5.83), is a special case of the vector ARMAXmodel, (5.84), with r = 1, uuut = 1, and Γ = ααα.

As previously indicated, extending univariate AR (or pure MA) models tothe vector case is fairly easy, but extending univariate ARMA models to themultivariate case is not a simple matter. Our discussion will be brief, butinterested readers can get more details in Lutkepohl (1993), Reinsel (1997),and Tiao and Tsay (1989).

In the multivariate case, the autoregressive operator is

Φ(B) = I − Φ1B − · · · − ΦpBp, (5.87)

and the moving average operator is

Θ(B) = I + Θ1B + · · · + ΘqBq, (5.88)

The zero-mean VARMA(p, q) model is then written in the concise form as

Φ(B)xxxt = Θ(B)wwwt. (5.89)

The model is said to be causal if the roots of |Φ(z)| (where | · | denotesdeterminant) are outside the unit circle, |z| > 1; that is, |Φ(z)| �= 0 for anyvalue z such that |z| ≤ 1. In this case, we can write

xxxt = Ψ(B)wwwt,

where Ψ(B) =∑∞

j=0 ΨjBj , Ψ0 = I, and

∑∞j=0 ||Ψj || < ∞. The model is said

to be invertible if the roots of |Θ(z)| lie outside the unit circle. Then, we canwrite

wwwt = Π(B)xxxt,

where Π(B) =∑∞

j=0 ΠjBj , Π0 = I, and

∑∞j=0 ||Πj || < ∞. Analogous to

the univariate case, we can determine the matrices Ψj by solving Ψ(z) =Φ(z)−1Θ(z), |z| ≤ 1, and the matrices Πj by solving Π(z) = Θ(z)−1Φ(z), |z| ≤1.

For a causal model, we can write xxxt = Ψ(B)wwwt so the general autocovari-ance structure of an ARMA(p, q) model is

Γ(h) = cov(xxxt+h, xxxt) = E(xxxt+hxxx′t) =

∞∑j=0

Ψj+hΣwΨ′j . (5.90)


Note, Γ(−h) = Γ′(h) so we will only exhibit the autocovariances for h ≥ 0.For pure MA(q) processes, (5.90) becomes

Γ(h) =q−h∑j=0

Θj+hΣwΘ′j , (5.91)

where Θ0 = I. Of course, (5.91) implies Γ(h) = 0 for h > q. For pure AR(p)models, the autocovariance structure leads to the multivariate version of theYule–Walker equations:

Γ(h) =p∑

j=1

ΦjΓ(h − j), h = 1, 2, ..., (5.92)

Γ(0) =p∑

j=1

ΦjΓ(−j) + Σw. (5.93)

As in the univariate case, we will need conditions for model uniqueness.These conditions are similar to the condition in the univariate case the theautoregressive and moving average polynomials have no common factors. Toexplore the uniqueness problems that we encounter with multivariate ARMAmodels, consider a bivariate AR(1) process, xxxt = (xt,1, xt,2)′, given by

xt,1 = φxt−1,2 + wt,1,

xt,2 = wt,2,

where wt,1 and wt,2 are independent white noise processes and |φ| < 1. Bothprocesses, xt,1 and xt,2 are causal and invertible. Moreover, the processes arejointly stationary because cov(xt+h,1, xt,2) = φ cov(xt+h−1,2, xt,2) ≡ φ γ2,2(h−1) = φσ2

w2δh1 does not depend on t; note, δh

1 = 1 when h = 1, otherwise, δh1 = 0.

In matrix notation, we can write this model as

xxxt = Φxxxt−1 + wwwt, (5.94)

where

Φ =[

0 φ0 0

].

We can write (5.94) in operator notation as

Φ(B)xxxt = wwwt

where

Φ(z) =[

1 −φz0 1

].

In addition, model (5.94) can be written as a bivariate ARMA(1,1) model

xxxt = Φ1xxxt−1 + Θ1wwwt−1 + wwwt, (5.95)


where

Φ1 =[

0 φ + θ0 0

]and Θ1 =

[0 −θ0 0

],

and θ is arbitrary. To verify this, we write (5.95), as Φ1(B)xxxt = Θ1(B)wwwt, or

Θ1(B)−1Φ1(B)xxxt = wwwt,

where

Φ1(z) =[

1 −(φ + θ)z0 1

]and Θ1(z) =

[1 −θz0 1

].

Then,

Θ1(z)−1Φ1(z) =[

1 θz0 1

] [1 −(φ + θ)z0 1

]=[

1 −φz0 1

]= Φ(z),

where Φ(z) is the polynomial associated with the bivariate AR(1) model in(5.94). Because θ is arbitrary, the parameters of the ARMA(1,1) model givenin (5.95) are not identifiable. No problem exists, however, in fitting the AR(1)model given in (5.94).

The problem in the previous discussion was caused by the fact that bothΘ(B) and Θ(B)−1 are finite; such a matrix operator is called unimodular.If U(B) is unimodular, |U(z)| is constant. It is also possible for two seem-ingly different multivariate ARMA(p, q) models, say, Φ(B)xxxt = Θ(B)wwwt andΦ∗(B)xxxt = Θ∗(B)wwwt, to be related through a unimodular operator, U(B) asΦ∗(B) = U(B)Φ(B) and Θ∗(B) = U(B)Θ(B), in such a way that the orders ofΦ(B) and Θ(B) are the same as the orders of Φ∗(B) and Θ∗(B), respectively.For example, consider the bivariate ARMA(1,1) models given by

Φxxxt ≡[

1 −φB0 1

]xxxt =

[1 θB0 1

]wwwt ≡ Θwt

and

Φ∗(B)xxxt ≡[

1 (α − φ)B0 1

]xxxt =

[1 (α + θ)B0 1

]wwwt ≡ Θ∗(B)wwwt,

where α, φ, and θ are arbitrary constants. Note,

Φ∗(B) ≡[

1 (α − φ)B0 1

]=[

1 αB0 1

] [1 −φB0 1

]≡ U(B)Φ(B)

and

Θ∗(B) ≡[

1 (α + θ)B0 1

]=[

1 αB0 1

] [1 θB0 1

]≡ U(B)Θ(B).

In this case, both models have the same infinite MA representation xxxt =Ψ(B)wwwt, where

Ψ(B) = Φ(B)−1Θ(B) = Φ(B)−1U(B)−1U(B)Θ(B) = Φ∗(B)−1Θ∗(B).


This result implies the two models have the same autocovariance function Γ(h).Two such ARMA(p, q) models are said to be observationally equivalent.

As previously mentioned, in addition to requiring causality and invertiblity,we will need some additional assumptions in the multivariate case to makesure that the model is unique. To ensure the identifiability of the parametersof the multivariate ARMA(p, q) model, we need the following additional twoconditions: (i) the matrix operators Φ(B) and Θ(B) have no common leftfactors other than unimodular ones; that is, if Φ(B) = U(B)Φ∗(B) and Θ(B) =U(B)Θ∗(B), the common factor must be unimodular; and (ii) with q as smallas possible and p as small as possible for that q, the matrix [Φp, Θq] must befull rank, k. One suggestion for avoiding most of the aforementioned problemsis to fit only vector AR(p) models in multivariate situations. Although thissuggestion might be reasonable for many situations, this philosophy is not inaccordance with law of parsimony because we might have to fit a large numberof parameters to describe the dynamics of a process.

Analogous to the univariate case, we can define a sequence of matrices,Φhh, for h = 1, 2, ..., called the partial autoregression matrices at lag h. Thesematrices are obtained by solving the Yule–Walker equations of order h, namely,

Γ() =h∑

j=1

ΦjhΓ( − j), = 1, 2, ..., h. (5.96)

The partial autoregression matrices can be viewed as the result of successiveAR(h) fits to the data; that is,

xxxt =h∑

j=1

Φjhxxxt−j + wwwt, h = 1, 2, ... . (5.97)

If the process is truly an AR(p), the partial autoregression matrices have theproperty that Φpp = Φp and Φhh = 0 for h > p. Unlike the univariate case,however, the elements of these matrices are not partial correlations, or correla-tions of any kind. As in the univariate case, the Φhh can be obtained iterativelyusing a multivariate extension of the Durbin-Levinson algorithm; details canbe found in Reinsel (1997).

The partial canonical correlations can be viewed as the multivariate ex-tension of the PACF in the univariate case. In general, the first canonicalcorrelation, λ1, between the k1 × 1 random vector XXX1 and the k2 × 1 randomvector XXX2, k1 ≤ k2, with variance–covariance matrices Σ11 and Σ22, respec-tively, is the largest possible correlation between a linear combination of thecomponents of XXX1, say, ααα′XXX1, and a linear combination of the components ofXXX2, say, βββ′XXX2, where ααα is k1 × 1 and βββ is k2 × 1. That is,

λ1 = maxααα,βββ

corr(ααα′XXX1, βββ

′XXX2),

subject to the constraints var(ααα′XXX1) = ααα′Σ11ααα = 1 and var(βββ′XXX2) = βββ′Σ22βββ =1. If we let Σij = cov(XXXi,XXXj), for i, j = 1, 2, then λ2

1 is the largest eigenvalue


of the matrix Σ−111 Σ12Σ−1

22 Σ21; see Johnson and Wichern (1992, Chapter 10) fordetails. We call the solutions U1 = ααα′

1XXX1 and V1 = βββ′1XXX2 the first canonical

variates, that is, λ1 = corr(U1, V1), and ααα1 and βββ1 are the coefficients ofthe linear combinations that maximize the correlation. In a similar fashion,the second canonical correlation, λ2, is then the largest possible correlationbetween ααα′XXX1 and βββ′XXX2 such that ααα is orthogonal to ααα1 (that is, ααα′ααα1 = 0), andβββ is orthogonal to βββ1 (βββ′βββ1 = 0) . If we call the solutions U2 = ααα′

2XXX1 and V2 =βββ′

2XXX2, then corr(U1, U2) = 0 = corr(V1, V2), corr(Ui, Vj) = 0 for i �= j, and bydesign, λ2

1 ≥ λ22. Also, λ2

2 is the second largest eigenvalue of Σ−111 Σ12Σ−1

22 Σ21.Continuing this way, we obtain the squared canonical correlations 1 ≥ λ2

1 ≥λ2

2 ≥ · · · ≥ λ2k1

≥ 0 as the ordered eigenvalues of Σ−111 Σ12Σ−1

22 Σ21. The canonicalcorrelations, λj , are typically taken to be nonnegative.

We can extend this idea to obtain partial canonical correlations betweenXXX1 and XXX2 given another random k3 × 1 vector XXX3. Let Σij = cov(XXXi,XXXj),for i, j = 1, 2, 3. The regression of XXX1 on XXX3 is Σ13Σ−1

33 XXX3 so that XXX1|3 =XXX1 −Σ13Σ−1

33 XXX3 can be thought of as XXX1 with the linear effects of XXX3 removed(partialled out). Similarly, XXX2|3 = XXX2 − Σ23Σ−1

33 XXX3 can be thought of as XXX2with the linear effects of XXX3 partialled out. The partial variance–covariancematrices are Σij|3 = cov(XXXi|3,XXXj|3) = Σij − Σi3Σ−1

33 Σ3j , for i, j = 1, 2. Thesquared partial canonical correlations between XXX1 and XXX2 given XXX3 are theordered eigenvalues of Σ−1

11|3Σ12|3Σ−122|3Σ21|3.

For a stationary vector process xxxt, the partial canonical correlations at lagh, for h = 2, 3, ..., denoted λ1(h) ≥ λ2(h) ≥ · · · ≥ λk(h) ≥ 0, are definedto be the partial canonical correlations between xxxh and xxx0 with the effectsof XXX = (xxx′

h−1, ..., xxx′1)

′ removed. For ease of notation, we put r = h − 1.

Let Σ00|X = Γ(0) − Γ(r)1 Γ−1

r,rΓ(r)′

1 , where Γr,r = {Γ(i − j)}ri,j=1 is a kr × kr

symmetric matrix, and Γ(r)1 = [Γ(r)′, Γ(r − 1)′, ...,Γ(1)′] is k × kr. Similarly,

let Σhh|X = Γ(0) − Γ(1)r Γ−1

r,rΓ(1)′r , where Γ(1)

r = [Γ(1),Γ(2), ...,Γ(r)] is k × kr.

Also needed are Σh0|X = Γ(r)−Γ(1)r Γ−1

r,rΓ(r)′

1 and Σ0h|X = Σ′h0|X . The squared

partial canonical correlations, λ2j (h), j = 1, ..., k at lag h, h = 2, 3, ..., are given

by the ordered eigenvalues of Σ−100|XΣ0h|XΣ−1

hh|XΣh0|X . The inversion of Γr,r,when h is large will, be a problem; see Reinsel (1997) for methods that avoidhaving to invert Γr,r. Finally, we will define the partial canonical correlationsbetween xxxt and xxxt−1 to be the lag-one canonical correlations. In this case,λ2

j (1), j = 1, ..., k are the ordered eigenvalues of Γ(0)−1Γ(1)Γ(0)−1Γ(1)′.Prediction and estimation for identifiable multivariate ARMA models fol-

low analogously to the univariate case, except in the general case, the estima-tion of the coefficient parameters and Σw must be done simultaneously. Pre-liminary identification of the model uses the sample autocovariance matrices,the sample partial autoregression matrices, and the sample partial canonicalcorrelations. We illustrate the techniques using the mortality data of Examples2.2, 5.9, and 5.10.


Example 5.11 Identification, Estimation and Prediction for theMortality Series

As in Example 5.10, we consider the trivariate series composed of de-trended cardiovascular mortality xt1, temperature xt2, and particulatelevels xt3, and set xxxt = (xt1, xt2, xt3)′ as the three-dimensional data vec-tor.

Estimation of the autocovariance matrix is similar to the univariate case,that is, with xxx = n−1∑n

t=1 xxxt, as an estimate of µµµ = Exxxt,

Γ(h) = n−1n−h∑t=1

(xxxt+h − xxx)(xxxt − xxx)′, h = 0, 1, 2, .., n − 1, (5.98)

and Γ(−h) = Γ(h)′. If γi,j(h) denotes the element in the i-th row andj-th column of Γ(h), the cross-correlation functions (CCF), as discussedin (1.35), are estimated by

ρi,j(h) =γi,j(h)√

γi,i(0)√

γj,j(0)h = 0, 1, 2, .., n − 1. (5.99)

When i = j in (5.99), we get the estimated autocorrelation function(ACF) of the individual series. The first six estimated autocovariancematrices, Γ(h), h = 0, 1, ..., 5, are (we have rounded the entries to integersto ease the display):

Γ(0) =

⎡⎣ 79 −37 62−37 81 −2

62 −2 227

⎤⎦ Γ(1) =

⎡⎣ 56 −46 52−45 49 −45

44 −35 125

⎤⎦

Γ(2) =

⎡⎣ 56 −42 62−42 50 −48

35 −20 136

⎤⎦ Γ(3) =

⎡⎣ 47 −42 59−41 44 −55

27 −18 123

⎤⎦ (5.100)

Γ(4) =

⎡⎣ 44 −34 72−39 46 −53

16 −9 120

⎤⎦ Γ(5) =

⎡⎣ 38 −35 68−39 39 −67

7 3 104

⎤⎦ .

Inspecting the autocovariance matrices, we find mortality, xt1, and tem-perature, xt2, are negatively correlated at about the same strength forboth positive and negative lags. The strongest cross-correlation oc-curs at lag ±1, where ρ12(−1) ≈ −45/

√79

√81 = −.56, and ρ12(1) ≈

−46/√

79√

81 = −.58. Also, mortality xt1 and particulates xt3 are posi-tively correlated, the strongest correlation being when particulates leadsmortality by about one month, ρ13(4) ≈ 72/

√79

√227 = .54. Finally,

we note that particulates and temperature are negatively correlated, thestrongest displayed value (which is approximately the strongest overall


correlation between the two series) is when particulates leads tempera-ture by about five weeks, ρ23(5) ≈ −67/

√81

√227 = .49. The autocovari-

ance matrices do not cut off at any small lag, and hence a pure movingaverage model is not indicated.

Replacing Γ(h) by Γ(h) in (5.96), we can obtain estimates of the partialautoregression matrices. The first four estimated matrices are

Φ11 =

⎡⎣ .47 −.36 .10−.25 .49 −.13−.12 −.48 .58

⎤⎦ Φ22 =

⎡⎣ .27 −.08 .07−.04 .35 −.09−.33 .05 .38

⎤⎦

Φ33 =

⎡⎣ −.04 .02 −.01.00 .11 −.03

−.21 .07 .17

⎤⎦ Φ44 =

⎡⎣ −.04 .08 .06−.07 .17 .01−.26 .12 .13

⎤⎦ .

As explained above (5.97), we can use (5.96) to estimate successiveAR(h) models with parameter estimates Φj = Φjh, j = 1, . . . , h, andh = 1, 2, . . .. Note, Φ11 is practically the same as Φ in Example 5.9, andΦ22 is practically the same as Φ2 in Example 5.10. The only difference inthe estimates is that we are using Yule–Walker here, whereas regressionwas used in the other examples. These matrices contain small compo-nents after lag two, indicating the AR(2) relationship, although there isevidence of some relationship between mortality and particulates at lagsof three and four weeks.

The estimated autocovariance matrices can also be used to obtain esti-mates of the partial canonical correlations. For example, to estimate thelag h = 3 partial canonical correlations, {λ2

1(3), λ22(3), λ2

3(3)}, we wouldput

Γ22 =[

Γ(0) Γ(1)Γ(1)′ Γ(0)

], (5.101)

which represents, in this case, a 6 × 6 matrix of the estimated autoco-variances that were displayed in (5.100). In addition, we will need thematrices

Γ(2)1 =

[Γ(2)′, Γ(1)′

]and Γ(1)

2 =[Γ(1), Γ(2)

],

which are both, in this example, 3×6 matrices. From these matrices, weconstruct the 3 × 3 matrices

Σ00|21 = Γ(0) − Γ(2)1 Γ−1

22 Γ(2)′

1 ,

Σ33|21 = Γ(0) − Γ(1)2 Γ−1

22 Γ(1)′

2 ,

andΣ30|21 = Γ(2) − Γ(1)

2 Γ−122 Γ(2)′

1 = Σ′03|21.


Finally, the squared partial canonical correlations, λ2j (3), for j = 1, 2, 3,

are obtained as the ordered eigenvalues of Σ−100|21Σ03|21Σ−1

33|21Σ33|21.

In this example we obtain

(λ2

1(h), λ22(h), λ2

3(h))

=

⎧⎪⎨⎪⎩(.81, .24, .02) h = 1(.22, .14, .06) h = 2(.05, .01, .00) h = 3(.05, .02, .00) h = 4,

which also suggests an AR(2) model for the data.

In addition, successive Yule–Walker estimates, for h = 1, 2, ..., of theerror variance–covariance matrix can be obtained from (5.93), that is,

Σ(h)w = Γ(0) −

h∑j=1

ΦjhΓ(−j). (5.102)

For this data, we obtained (entries are rounded to integers)

Σ(1)w =

⎡⎣ 31 6 176 41 42

17 42 144

⎤⎦ , Σ(2)w =

⎡⎣ 28 7 167 37 40

16 40 123

⎤⎦ ,

Σ(3)w =

⎡⎣ 28 7 167 37 40

16 40 118

⎤⎦ , Σ(4)w =

⎡⎣ 27 6 146 36 38

14 38 114

⎤⎦ .

The estimates stabilize (except for perhaps the variance of the particulateseries) after h = 2, indicating the AR(3) and AR(4) fits do not improvemuch over the AR(2) fit. Recall the comparison of the autoregressionsof order one to five using the SIC, as reported in Table 5.1 also indicatedthe AR(2) model.

At this point, we would settle on the AR(2) model estimated in Ex-ample 5.10 on the detrended data. We will write the estimated modelas

xxxt = Φ1xxxt−1 + Φ2xt−2 + wwwt, (5.103)

where Φ1 and Φ2 are given in Example 5.10. The estimate of Σw for thismodel is Σ(2)

w , which is listed below (5.102). Residual analysis, performedon the residuals wwwt = xxxt − Φ1xxxt−1 − Φ2xxxt−2, for t=3,...,508, suggests themodel fits well. Individual residual analyses on the wwwti, for i = 1, 2, 3,show, except for the particulate series, wt3, the residuals are Gaussianwhite noise. For the particulate series, a small, but significant, amountof autocorrelation is still left in that series. In this case, we may wish tofit a higher order (higher than two) model to the particulate series only.In addition, we might be inclined to fit a reduced rank model, and we


will discuss this matter later. Inspection of the pairwise CCF betweenall residual series shows no obvious departures from independence.

Once the model has been estimated, estimated forecasts can be obtained.Analogous to the univariate case, the m-step-ahead forecast, m = 1, 2, ...,in this example (n = 508), is obtained as follows:

xxxnn+m = Φ1xxx

nn+m−1 + Φ2xxx

nn+m−2, (5.104)

where xxxnt = xxxt when 1 ≤ t ≤ n. The mean square prediction error

matrices can be calculated in a manner similar to the univariate case,(3.67). In the general case of vector ARMA or ARMAX models, forecastsand their mean square prediction errors can be obtained by using thestate-space formulation of the model and the Kalman filter (see §6.6).Analogous to (3.67), the general form of the m-step-ahead mean squareprediction error matrix is,

Pnn+m = E

(xxxn+m − xxxn

n+m

) (xxxn+m − xxxn

n+m

)′ (5.105)

= Γ(0) − Γ(m)n Γ−1

nnΓ(m)′n , (5.106)

where Γ(m)n = [Γ(m), Γ(m + 1), ...,Γ(m + n − 1)], is a k × nk matrix, and

Γnn = {Γ(i− j)}ni,j=1, is an nk ×nk symmetric matrix. Of course, Pn

n+m

can be estimated by substituting Γ(h) for Γ(h) in (5.106). The analogueof (3.77) for multivariate ARMA models is

Pnn+m =

m−1∑j=0

ΨjΣwΨ′j . (5.107)

When the model is autoregressive, as in this example, a simplificationoccurs by noticing a k-dimensional AR(p) model can be written as a kp-dimensional AR(1) model. For example, we can write the vector AR(2)model as

XXXt = ααα + A(XXXt−1 − ααα) + ηηηt, (5.108)

where

XXXt =[

xxxt

xxxt−1

]ααα =

[µµµµµµ

]A =

[Φ1 Φ2I 0

]ηηηt =

[wwwt

000

].

Of course, this technique generalizes to any dimension k and any orderp. From (5.108) we immediately obtain the forecasts and mean squareprediction errors as

XXXnn+m = ααα + Am(XXXn − ααα)

and

Qnn+m = E

(XXXn+m − XXXn

n+m

) (XXXn+m − XXXn

n+m

)′= ΓX(0) − AmΓX(0)A′m,


where

ΓX(0) =[

Γ(0) Γ(1)Γ(1)′ Γ(0)

].

We can then obtain the desired mean square prediction error matricesPn

n+m as submatrices of Qnn+m. In addition, Yule–Walker estimation and

forecasting can be accomplished by substituting autocovariance matricesby their sample equivalents obtained via (5.98).

For this numerical example,

A =[

Γ(1) Γ(2)Γ(0) Γ(1)

] [Γ(0) Γ(1)Γ(1)′ Γ(0)

]−1

=[

Φ1 Φ2I 0

],

where Φ1 and Φ2 are as given in Example 5.10. In the general case,we obtain the coefficient estimates from the top k rows of A. Similarly,estimated forecasts in this example are found as follows:[

xxxnn+m

xxxnn+m−1

]= Am

[xxxn

xxxn−1

].

Because xxx507 = (8.62,−1.85, 12.16)′ and xxx508 = (4.71,−4.67, 17.20)′, wecan, for example, calculate the one-step-ahead and two-step-ahead fore-casts by putting m = 2 and using the numerical values given in Example5.10 to construct A2,

⎡⎣ xxx508510

xxx508509

⎤⎦ = A2[

xxx508xxx507

]=

⎡⎢⎢⎢⎢⎢⎢⎢⎣

6.13−5.9411.23

6.43−4.7710.53

⎤⎥⎥⎥⎥⎥⎥⎥⎦.

Substituting autocovariance matrices with their estimates, we may write

Q508510 =

[Γ(0) Γ(1)Γ(1)′ Γ(0)

]− A2

[Γ(0) Γ(1)Γ(1)′ Γ(0)

]A′2

=

⎡⎣ P 508510 P 508

510,509

P 508509,510 P 508

509

⎤⎦ ,

where we have written Pns,t to be the estimate of E{(xxxs − xxxn

s ) (xxxt − xxxnt )′}.


In this example, we found (entries are rounded)

Q508510 =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

31 5 19 8 −4 25 39 38 −2 7 2

19 38 135 6 2 33

8 −2 6 28 7 16−4 7 2 7 37 40

2 2 33 16 40 123

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦.

Note, P 508509 = Σw = Σ(2)

w . The diagonal elements of Q508510 give the indi-

vidual mean-square prediction errors. For example, an approximate 95%prediction interval for x508

510,1 is 6.13 ± 2√

31 or (−5.0, 17.2).

Although the estimation in Example 5.11 was performed using Yule–Walkerestimation, we could have also used conditional or unconditional maximumlikelihood estimation, or conditional (as in Example 5.10) or unconditionalleast squares estimation. Because, as we have seen, any k-dimensional AR(p)model can be written as a kp-dimensional AR(1) model, any of these estimationtechniques are straightforward multivariate extensions to the univariate casepresented in equations (2.124)-(2.133). Also, as in the univariate case, theYule–Walker estimators, the maximum likelihood estimators, and the leastsquares estimators are asymptotically equivalent. To exhibit the asymptoticdistribution of the autoregression parameter estimators, we write

φφφ = vec (Φ1, ...,Φp) ,

where the vec operator stacks the columns of a matrix into a vector. Forexample, for a bivariate AR(2) model,

φφφ = vec (Φ1, Φ2) = (Φ111 , Φ121 , Φ112 , Φ122Φ211 , Φ221 , Φ212 , Φ222)′,

where Φ�ijis the ij-th element of Φ�, = 1, 2. Because (Φ1, ...,Φp) is a k × kp

matrix, φφφ is a k2p × 1 vector. We now state the following property.

Property P5.1: Large Sample Distribution of the VectorAutoregression EstimatorsLet φφφ denote the vector of parameter estimators (obtained via Yule–Walker,least squares, or maximum likelihood) for a k-dimensional AR(p) model. Then,

√n(φφφ − φφφ

)∼ AN(000, Σw ⊗ Γ−1

pp ), (5.109)

where Γpp = {Γ(i − j)}pi,j=1 is a kp × kp matrix, Σw ⊗ Γ−1

pp = {σijΓ−1pp }k

i,j=1 isa k2p × k2p matrix, and σij is the ij-th element of Σw.

The variance–covariance matrix of the estimator φφφ is approximated by replac-ing Σw by Σw, and replacing Γ(h) by Γ(h) in Γpp. The square root of the


diagonal elements of Σw ⊗ Γ−1pp divided by

√n gives the individual standard

errors. For the mortality data example, the estimated standard errors for theVAR(2) fit are listed in Example 5.10; although those standard errors weretaken from a regression run, they could have also been calculated using Prop-erty P5.1 along with the numerical values taken from Σ(2)

w given below (5.102)and Γ22 given in (5.101).

Asymptotic inference for the general case of vector ARMA models is morecomplicated than pure AR models; details can be found in Reinsel (1997) orLutkepohl (1993), for example. We also note that estimation for VARMAmodels can be recast into the problem of estimation for state-space modelsthat will be discussed in Chapter 6.

A simple algorithm for fitting multivariate ARMA models from Spliid (1983)is worth mentioning because it repeatedly uses the multivariate regressionequations. Consider a general ARMA(p, q) model for a time series with anonzero mean

xxxt = ααα + Φ1xxxt−1 + · · · + Φpxxxt−p + wwwt + Θ1wwwt−1 + · · · + Θqwwwt−q. (5.110)

If µµµ = Exxxt, then ααα = (I − Φ1 − · · · − Φp)µµµ. If wwwt−1, ...,wwwt−q were observed, wecould rearrange (5.110) as a multivariate regression model

xxxt = Bzzzt + wwwt, (5.111)

withzzzt = (1, xxx′

t−1, ..., xxx′t−p,www

′t−1, ...,www

′t−q)

′ (5.112)

andB = [ααα,Φ1, ...,Φp, Θ1, ...,Θq], (5.113)

for t = p + 1, ..., n. Given an initial estimator B0, of B, we can reconstruct{wwwt−1, ...,wwwt−q} by setting

wwwt−j = xxxt−j − B0zzzt−j , t = p + 1, ..., n, j = 1, ..., q, (5.114)

where, if q > p, we put wwwt−j = 000 for t−j ≤ 0. The new values of {wwwt−1, ...,wwwt−q}are then put into the regressors zzzt and a new estimate, say, B1, is obtained.The initial value, B0, can be computed by fitting a pure autoregression of orderp or higher, and taking Θ1 = · · · = Θq = 000. The procedure is then iterateduntil the parameter estimates stabilize. The algorithm usually converges, butnot to the maximum likelihood estimators. Experience suggests the estimatorsare reasonably close to the maximum likelihood estimators.

As previously discussed, the special form assumed for the constant compo-nent, ααα, of the general ARMA model in (5.110) can be generalized to includea fixed r × 1 vector of inputs, say, uuut. In this case we have a k-dimensionalARMAX model:

xxxt = Γuuut +p∑

j=1

Φjxxxt−j +q∑

j=1

Θjwwwt−j + wwwt, (5.115)


Figure 5.14 CCF between prewhitened mortality and temperature (positivelag means temperature leads mortality).

where Γ is a k × r parameter matrix. Recall the X in ARMAX refers to theexogenous vector process we have denoted here by uuut and the introductionof exogenous variables through setting ααα = Γuuut does not present any specialproblems in making inferences.

Example 5.12 An ARMAX Model for Cardiovascular Mortality

In Example 2.2, we regressed the cardiovascular mortality series, Mt, ontime t, temperature Tt, and particulate pollution Pt. There, the interestwas an analysis of the effect of temperature and pollution on cardiovas-cular mortality. In Example 5.10, we fit a multivariate ARMA model tothe trivariate vector (Mt, Tt, Pt), as if modeling the behavior of tempera-ture and pollution was equally as important as modeling the behavior ofmortality. In this example, we are interested in using temperature andpollution to explain some of the variation in the mortality series.

To examine the CCF between mortality and temperature, and betweenmortality and pollution, we first prewhitened mortality by fitting anAR(2) to the detrended data. That is, we first fit the model

Mt = β0 + β1t + φ1Mt−1 + φ2Mt−2 + εt.

Using the residuals of the fit, say, εt, we then calculated the CCF betweenεt and Tt, and between εt and Pt. Figure 5.14 shows the cross-correlationof prewhitened mortality and temperature (positive lag means tempera-ture leads mortality) and a significant correlation is seen at lag h = 1.


Figure 5.15 CCF between prewhitened mortality and particulate pollution(positive lag means pollution leads mortality).

Figure 5.15 shows a similar plot for the CCF of prewhitened mortalityand pollution, and significant correlations are seen at lags h = 0, 2, 4, 7.After some preliminary fitting, the final model uses the exogenous vari-ables uuut = (1, t, Tt−1, T

2t−1, Pt, Pt−4)′, along with an AR(2) on mortality,

Mt; the inclusion of particulate pollution at lags two and seven were notsignificant when lags zero and four are in the model. In this case, theARMAX model is

Mt = Γuuut + φ1Mt−1 + φ2Mt−2 + wt,

where Γ = [γ0, γ1, γ2, γ3, γ4].

Estimation was accomplished using the regression approach described in(5.85) and (5.86). In this case, the fitted model was (values are rounded)

Mt = 42.9 − .01(.002)t − .18(.03)Tt−1 + .11(.02)Pt + .05(.02)Pt−4+ .31(.04)Mt−1 + .30(.04)Mt−2 + wt,

where σ2w = 25.7 and R2 = 74.3%. Each coefficient is significant, as seen

from the estimated standard errors listed below each parameter estimate.Finally, an analysis of the residuals, wt, shows, except for a few outliers,the model fits well. The value of the Ljung–Box–Pierce statistic forH=24 was Q=25.7, which when compared to a χ2

22, is not significant. Inaddition, a Q-Q plot shows no departure from the Gaussian assumption,except for the few outliers. Our general conclusions are that decreasein cardiovascular mortality occurred during the period studied, and an


increase in mortality is associated with lower temperatures the previousweek and higher particulate pollution both currently and one monthprior.

Problems

Section 5.2

5.1 The data set labeled fracdiff.dat is n = 1000 simulated observationsfrom a fractionally differenced ARIMA(1, 1, 0) model with φ = .75 andd = .4.

(a) Plot of the data and comment.

(b) Plot the ACF and PACF of the data and comment.

(c) Estimate the parameters and test for the significance of the esti-mates φ and d.

(d) Explain why, using the results of part (a) and (b), it would seemreasonable to difference the data prior to the analysis. That is, if xt

represents the data, explain why we might choose to fit an ARMAmodel to ∇xt.

(e) Plot the ACF and PACF of ∇xt and comment.

(f) Fit an ARMA model to ∇xt and comment.

5.2 The data in globtemp2.dat are annual global temperature deviationsfrom 1880 to 2004 (there are three columns in the data file; work withthe annual means and not the 5-year smoothed data). The data arean update to the Hansen–Lebedeff global temperature data displayed inFigure 1.2. The url of the data source is in the file, you can go there forfurther explanation of the data. Fit an ARFIMA model to this series.

5.3 Compute the sample ACF of the absolute values of the NYSE returnsdisplayed in Figure 1.4 up to lag 200 and comment on whether the ACFindicates long memory. Fit an ARFIMA model to the absolute valuesand comment.

Section 5.3

5.4 Investigate whether the monthly returns of a stock dividend yield listedin the file sdyr.dat exhibit GARCH behavior. If so, fit an appropriatemodel to the returns. The data are monthly returns of a stock dividendyield from January 1947 through May 1993 and are taken from Hamiltonand Lin (1996).

Problems 321

5.5 Investigate whether the growth rate of the monthly Oil Prices exhibitGARCH behavior. If so, fit an appropriate model to the growth rate.

5.6 The stats package of R contains the daily closing prices of four majorEuropean stock indices; type help(EuStockMarkets) for details. Fit aGARCH model to the returns of these series and discuss your findings.(Note: The data set contains actual values, and not returns. Hence, thedata must be transformed prior to the model fitting.)

5.7 The 2 × 1 gradient vector, l(1)(α0, α1), given for an ARCH(1) model wasdisplayed in (5.41). Verify (5.41) and then use the result to calculate the2 × 2 Hessian matrix

l(2)(α0, α1) =(

∂2l/∂α20 ∂2l/∂α0∂α1

∂2l/∂α0∂α1 ∂2l/∂α21

).

Section 5.4

5.8 The sunspot data are plotted in Chapter 4, Figure 4.31. From a timeplot of the data, discuss why is it reasonable to fit a threshold model tothe data, and then fit a threshold model.

Section 5.5

5.9 Let St represent the monthly sales data listed in sales.dat (n = 150),and let Lt be the leading indicator listed in lead.dat. Fit the regressionmodel ∇St = β0 + β1∇Lt−3 + xt, where xt is an ARMA process.

5.10 Consider the correlated regression model, defined in the text by (5.53),say,

yyy = Zβββ + xxx,

where xxx has mean zero and covariance matrix Γ. In this case, we knowthat the weighted least squares estimator is (5.54), namely,

βββw = (Z ′Γ−1Z)−1Z ′Γ−1yyy.

Now, a problem of interest in spatial series can be formulated in terms ofthis basic model. Suppose yi = y(σi), i = 1, 2, . . . , n is a function of thespatial vector coordinates σi = (si1, si2, . . . , sir)′, the error is xi = x(σi),and the rows of Z are defined as zzz(σi)′, i = 1, 2, . . . , n. The Krigingestimator is defined as the best spatial predictor of y0 = zzz′

0βββ + x0 usingthe estimator

y0 = aaa′yyy,

subject to the unbiased condition Ey0 = Ey0, and such that the meansquare prediction error

MSE = E[(y0 − y0)2]

is minimized.


(a) Prove the estimator is unbiased when Z ′aaa = zzz0.(b) Show the MSE is minimized by solving the equations

Γaaa + Zλλλ = γγγ0

andZ ′aaa = zzz0,

where γγγ0 = E[xxxx0] represents the vector of covariances between theerror vector of the observed data and the error of the new point thevector λλλ is a q × 1 vector of LaGrangian multipliers.

(c) Show the predicted value can be expressed as

y0 = zzz′0βββw + γγγ′

0Γ−1(yyy − Zβββw),

so the optimal prediction is a linear combination of the usual pre-dictor and the least squares residuals.

Section 5.6

5.11 The file labeled clim-hyd has 454 months of measured values for theclimatic variables air temperature, dew point, cloud cover, wind speed,precipitation (pt), and inflow (it), at Shasta Lake. We would like to lookat possible relations between the weather factors and the inflow to ShastaLake.

(a) Fit ARIMA(0, 0, 0)× (0, 1, 1)12 models to (i) transformed precipita-tion Pt =

√pt and (ii) transformed inflow It = log it.

(b) Apply the ARIMA model fitted in part (a) for transformed precip-itation to the flow series to generate the prewhitened flow residualsassuming the precipitation model. Compute the cross-correlationbetween the flow residuals using the precipitation ARIMA modeland the precipitation residuals using the precipitation model andinterpret. Use the coefficients from the ARIMA model to constructthe transformed flow residuals.

5.12 Consider predicting the transformed flows It = log it from transformedprecipitation values Pt =

√pt using a transfer function model of the form

(1 − B12)It = α(B)(1 − B12)Pt + nt,

where we assume that seasonal differencing is a reasonable thing to do.The data are the 454 monthly values of precipitation and inflow from theShasta Lake reservoir in the file clim-hyd. You may think of it as fitting

yt = α(B)xt + nt,

where yt and xt are the seasonally differenced transformed flows andprecipitations.

Problems 323

(a) Argue that xt can be fitted by a first-order seasonal moving average,and use the transformation obtained to prewhiten the series xt.

(b) Apply the transformation applied in (a) to the series yt, and com-pute the cross-correlation function relating the prewhitened seriesto the transformed series. Argue for a transfer function of the form

α(B) =δ0

1 − ω1B.

(c) Write the overall model obtained in regression form to estimate δ0and ω1. Note that you will be minimizing the sums of squaredresiduals for the transformed noise series (1 − ω1B)nt. Retain theresiduals for further modeling involving the noise nt. The observedresidual is ut = (1 − ω1B)nt.

(d) Fit the noise residuals obtained in (c) with an ARMA model, andgive the final form suggested by your analysis in the previous parts.

(e) Discuss the problem of forecasting yt+m using the infinite past ofyt and the present and infinite past of xt. Determine the predictedvalue and the forecast variance.

Section 5.7

5.13 Consider the data set containing quarterly U.S. unemployment, U.S. GNP,consumption, and government and private investment from 1948-III to1988-II. The seasonal component has been removed from the data. Con-centrating on unemployment (Ut), GNP (Gt), and consumption (Ct), fita vector ARMA model to the data after first logging each series, andthen removing the linear trend. That is, fit a vector ARMA model toxxxt = (x1t, x2t, x3t)′, where, for example, x1t = log(Ut) − β0 − β1t, whereβ0 and β1 are the least squares estimates for the regression of log(Ut) ontime, t. Run a complete set of diagnostics on the residuals.

Chapter 6

State-Space Models

6.1 Introduction

A very general model that seems to subsume a whole class of special cases ofinterest in much the same way that linear regression does is the state-spacemodel or the dynamic linear model, which was introduced in Kalman (1960)and Kalman and Bucy (1961). Although the model was originally introduced asa method primarily for use in aerospace-related research, it has been appliedto modeling data from economics (Harrison and Stevens, 1976; Harvey andPierse, 1984; Harvey and Todd, 1983; Kitagawa and Gersch 1984, Shumwayand Stoffer, 1982), medicine (Jones, 1984) and the soil sciences (Shumway,1985). An excellent modern treatment of time series analysis based on thestate space model is the text by Durbin and Koopman (2001).

Although there are some packages available for R that focus on variousaspects of state-space modeling and Kalman filtering, we prefer to write ourown code. As a result, the code we have written is long and will most likely besubject to frequent updates. Hence, we have decided to distribute the R codefor this chapter on the website for the text.

The state-space model or dynamic linear model (DLM), in its basic form,employs an order one, vector autoregression as the state equation,


where the state equation determines the rule for the generation of the p × 1state vector xxxt from the past p× 1 state xxxt−1, for time points t = 1, . . . , n. Weassume the wwwt are p × 1 independent and identically distributed, zero-meannormal vectors with covariance matrix Q. In the DLM, we assume the processstarts with a normal vector xxx0 that has mean µµµ0 and p × p covariance matrixΣ0.

The DLM, however, adds an additional component to the model in assum-ing we do not observe the state vector xxxt directly, but only a linear transformed

324


version of it with noise added, say

yyyt = Atxxxt + vvvt (6.2)

where At is a q×p measurement or observation matrix; equation (6.2) is calledthe observation equation. The model arose originally in the space tracking set-ting, where the state equation defines the motion equations for the position orstate of a spacecraft with location xxxt and yyyt reflects information that can beobserved from a tracking device such as velocity and azimuth. The observeddata are in the q×1 vectors yyyt, which can be larger than or smaller than p, thedimension of the underlying series of interest. The additive observation noisevvvt is assumed to be white and Gaussian with q×q covariance matrix R. In ad-dition, we initially assume, for simplicity, {wwwt} and {vvvt} are uncorrelated; thisassumption is not necessary, but it helps in the explanation of first concepts.The case of correlated errors is discussed in §6.6. Of course, we can furthermodify the basic model, (6.1) and (6.2), to include exogenous variables, andwe will also discuss this in §6.6.

As in the ARMAX model of §5.7, exogenous variables, or fixed inputs, mayenter into the states or into the observations. In this case, we suppose we havean r × 1 vector of inputs uuut, and write the model as

xxxt = Φxxxt−1 + Υuuut + wwwt (6.3)

yyyt = Atxxxt + Γuuut + vvvt (6.4)

where Υ is p × r and Γ is q × r.

Example 6.1 A Biomedical Example

Suppose we consider the problem of monitoring the level of several bio-medical parameters monitored after a cancer patient undergoes a bonemarrow transplant. The data in Figure 6.1, used by Jones (1984), aremeasurements made for 91 days on the three variables, log(white bloodcount), log(platelet), and hematocrit (HCT), denoted yt1, yt2, and yt3,respectively. Approximately 40% of the values are missing, with missingvalues occurring primarily after the 35th day. The main objectives areto model the three variables using the state-space approach, and to es-timate the missing values. According to Jones, “Platelet count at about100 days post transplant has previously been shown to be a good indi-cator of subsequent long term survival.” For this particular situation, wemodel the three variables in terms of the state equation (6.1); that is,⎛⎝xt1

xt2xt3

⎞⎠ =

⎛⎝φ11 φ12 φ13φ21 φ22 φ23φ31 φ32 φ33

⎞⎠⎛⎝xt−1,1xt−1,2xt−1,3

⎞⎠+

⎛⎝wt1wt2wt3

⎞⎠ . (6.5)

The 3×3 observation matrix, At, is either the identity matrix, or the iden-tity matrix with all zeros in a row when that variable is missing. The co-variance matrices R and Q are 3×3 matrices with R = diag{r11, r22, r33},a diagonal matrix, required for a simple approach when data are missing.

326 State-Space Models

Figure 6.1 Longitudinal series of blood parameter levels monitored,log (white blood count) [top], log (platelet) [middle], and hematocrit (HCT)[bottom], after a bone marrow transplant (n = 91 days).

The model given in (6.1) involving only a single lag is not unduly restrictive.A multivariate model with m lags, such as the VAR(m) discussed in §5.7, couldbe developed by replacing the p×1 state vector, xxxt, by the pm×1 state vectorXXXt = (xxx′

t, xxx′t−1, . . . , xxx

′t−m+1)

′ and the transition matrix by

Φ =

⎛⎜⎜⎜⎜⎝Φ1 Φ2 . . . Φm−1 Φm

I 0 . . . 0 00 I . . . 0 0...

.... . .

......

0 0 . . . I 0

⎞⎟⎟⎟⎟⎠ . (6.6)

Letting WWW t = (www′t,000

′, . . . ,000′)′ be the new pm × 1 state error vector, the newstate equation will be

XXXt = ΦXXXt−1 + WWW t, (6.7)

where the new matrix “Q” now has the form of a pm × pm matrix with Q inthe upper right-hand corner and zeros elsewhere. The observation equationcan then be written as

yyyt =[At

∣∣ 0 ∣∣ · · · ∣∣ 0]XXXt + vvvt. (6.8)

This simple recoding shows one way of handling higher order lags within the


Figure 6.2 Two average global temperature deviations for n = 108 years indegrees centigrade (1880-1987). The solid line is the land-based series whereasthe dotted line shows the marine-based series.

context of the single lag structure. Further discussion of this notion is givenin §6.6.

The real advantages of the state-space formulation, however, do not reallycome through in the simple example given above. The special forms that canbe developed for various versions of the matrix At and for the transition schemedefined by the matrix Φ allow fitting more parsimonious structures with fewerparameters needed to describe a multivariate time series. We will give someexamples of structural models in §6.5, but the simple example shown below isinstructive.

Example 6.2 Global Warming

Figure 6.2 shows two different estimators for the global temperature seriesfrom 1880 to 1987, plotted on the same scale. The solid line is consideredin the first chapter (see Jones, 1994), which gives average surface airtemperature computed from land-based observation stations. The secondseries (see Parker et al., 1996) gives averages from a number of marine-based stations. Conceptually, both series should be measuring the sameunderlying climatic signal, and we may consider the problem of extractingthis underlying signal. We suppose both series are observing the samesignal with different noises; that is,

yt1 = xt + vt1

andyt2 = xt + vt2,


where xt is the unknown common signal. Suppose it can be modeled asa random walk of the form

xt = xt−1 + wt, (6.9)

which we can argue for by noting the stability of the first difference ashas been noted in Chapter 2. Furthermore, the first difference of theobserved series will be a first-order moving average under this model,arguing from the fact that the first difference of the land-based series hasa peak at lag 1. In this example, p = 1, q = 2, and Φ = 1 is held ata constant value. The observation equation (6.2) can be written in theform (

yt1yt2

)=(

11

)xt +

(vt1vt2

), (6.10)

and we have the covariance matrices given by Q = q11 and

R =(

r11 r12r21 r22

).

The introduction of the state-space approach as a tool for modeling data inthe social and biological sciences requires model identification and parameterestimation because there is rarely a well-defined differential equation describingthe state transition. The questions of general interest for the dynamic linearmodel (6.3) and (6.4) relate to estimating the unknown parameters containedin Φ, Υ, Q,Γ, At, and R, that define the particular model, and estimating orforecasting values of the underlying unobserved process xxxt. The advantagesof the state-space formulation are in the ease with which we can treat variousmissing data configurations and in the incredible array of models that can begenerated from (6.1) and (6.2). The analogy between the observation matrixAt and the design matrix in the usual regression and analysis of variance settingis a useful one. We can generate fixed and random effect structures that areeither constant or vary over time simply by making appropriate choices for thematrix At and the transition structure Φ. We will give only a few examples inthis chapter; for further examples, see Durbin and Koopman (2001), Harvey(1993) or Shumway (1988) to mention a few.

Before continuing our investigation of the more complex model, it is in-structive to consider a simple univariate state-space model.

Example 6.3 An AR(1) Process with Observational Noise

Consider a univariate state-space model where the observations satisfy

yt = xt + vt, (6.11)

and the signal (state) is an AR(1) process,

xt = φxt−1 + wt, (6.12)


for t = 1, 2, . . . , n, where vt ∼ iid N(0, σ2v), wt ∼ iid N(0, σ2

w), and x0 ∼N(0, σ2

w/(1 − φ2)); {vt}, {wt}, x0 are independent.

In Chapter 3, we investigated the properties of the state, xt, because it isa stationary AR(1) process (recall Problem 3.2e). For example, we knowthe autocovariance function of xt is

γx(h) =σ2

w

1 − φ2 φh, h = 0, 1, 2, . . . . (6.13)

But, here, we must investigate how the addition of observation noiseaffects the dynamics. Although it is not a necessary assumption, wehave assumed in this example that xt is stationary. In this case, theobservations are also stationary because yt is the sum of two independentstationary components xt and vt. We have

γy(0) = var(yt) = var(xt + vt) =σ2

w

1 − φ2 + σ2v , (6.14)

and, when h ≥ 1,

γy(h) = cov(yt, yt−h) = cov(xt + vt, xt−h + vt−h) = γx(h). (6.15)

Consequently, for h ≥ 1, the ACF of the observations is

ρy(h) =γy(h)γy(0)

=(

1 +σ2

v

σ2w

(1 − φ2))−1

φh. (6.16)

It should be clear from the correlation structure given by (6.16) theobservations, yt, are not AR(1) unless σ2

v = 0. In addition, the autocor-relation structure of yt is identical to the autocorrelation structure of anARMA(1,1) process, as presented in Example 3.11. Thus, the observa-tions can also be written in an ARMA(1,1) form,

yt = φyt−1 + θut−1 + ut,

where ut is Gaussian white noise with variance σ2u, and with θ and σ2

u

suitably chosen (see Example 6.11).

Although an equivalence exists between stationary ARMA models and sta-tionary state-space models (see §6.6), it is sometimes easier to work with oneform than another. As previously mentioned, in the case of missing data, com-plex multivariate systems, mixed effects, and certain types of nonstationarity,it is easier to work in the framework of state-space models; in this chapter, weexplore some of these situations.


6.2 Filtering, Smoothing, and Forecasting

From a practical view, the primary aims of any analysis involving the state-space model as defined by (6.1)-(6.2), or by (6.3)-(6.4), would be to pro-duce estimators for the underlying unobserved signal xxxt, given the data Ys ={yyy1, . . . , yyys}, to time s. When s < t, the problem is called forecasting or predic-tion. When s = t, the problem is called filtering, and when s > t, the problemis called smoothing. In addition to these estimates, we would also want tomeasure their precision. The solution to these problems is accomplished viathe Kalman filter and smoother and is the focus of this section.

Throughout this chapter, we will use the following definitions:

xxxst = E(xxxt

∣∣ Ys) (6.17)

andP s

t1,t2 = E{(xxxt1 − xxxs

t1)(xxxt2 − xxxst2)

′} . (6.18)

When t1 = t2 (= t say) in (6.18), we will write P st for convenience.

In obtaining the filtering and smoothing equations, we will rely heavily onthe Gaussian assumption. Some knowledge of the material covered in Appen-dix B, §B.1, will be helpful in understanding the details of this section (althoughthese details may be skipped on a casual reading of the material). Even in thenon-Gaussian case, the estimators we obtain are the minimum mean-squarederror estimators within the class of linear estimators. That is, we can think ofE in (6.17) as the projection operator in the sense of §B.1 rather than expec-tation and P s

t as the corresponding mean-squared error. When we assume, asin this section, the processes are Gaussian, (6.18) is also the conditional errorcovariance; that is,

P st1,t2 = E

{(xxxt1 − xxxs

t1)(xxxt2 − xxxst2)

′ ∣∣ Ys

}.

This fact can be seen, for example, by noting the covariance matrix between(xxxt − xxxs

t ) and Ys, for any t and s, is zero; we could say they are orthogonal inthe sense of §B.1. This result implies that (xxxt − xxxs

t ) and Ys are independent(because of the normality), and hence, the conditional distribution of (xxxt −xxxs

t )given Ys is the unconditional distribution of (xxxt − xxxs

t ). Derivations of thefiltering and smoothing equations from a Bayesian perspective are given inMeinhold and Singpurwalla (1983); more traditional approaches based on theconcept of projection and on multivariate normal distribution theory are givenin Jazwinski (1970) and Anderson and Moore (1979).

First, we present the Kalman filter, which gives the filtering and forecastingequations. The name filter comes from the fact that xxxt

t is a linear filter of theobservations yyy1, . . . , yyyt; that is, xxxt

t =∑t

s=1 Bsyyys for suitably chosen p × qmatrices Bs. The advantage of the Kalman filter is that it specifies how toupdate the filter from xxxt−1

t−1 to xxxtt once a new observation yyyt is obtained, without

having to reprocess the entire data set yyy1, . . . , yyyt.

6.2: Filtering, Smoothing, and Forecasting 331

Property P6.1: The Kalman FilterFor the state-space model specified in (6.3) and (6.4), with initial conditionsxxx0

0 = µµµ0 and P 00 = Σ0, for t = 1, . . . , n,

xxxt−1t = Φxxxt−1

t−1 + Υuuut, (6.19)

P t−1t = ΦP t−1

t−1 Φ′ + Q, (6.20)

withxxxt

t = xxxt−1t + Kt(yyyt − Atxxx

t−1t − Γuuut), (6.21)

P tt = [I − KtAt]P t−1

t , (6.22)

whereKt = P t−1

t A′t[AtP

t−1t A′

t + R]−1 (6.23)

is called the Kalman gain. Prediction for t > n is accomplished via (6.19) and(6.20) with initial conditions xxxn

n and Pnn .

Proof. The derivations of (6.19) and (6.20) follow from straight forward cal-culations, because from (6.3) we have

xxxt−1t = E(xxxt

∣∣ Yt−1) = E(Φxxxt−1 + Υuuut + wwwt

∣∣ Yt−1) = Φxt−1t−1 + Υuuut,

and thus

P t−1t = E

{(xxxt − xxxt−1

t )(xxxt − xxxt−1t )′}

= E{[

Φ(xxxt−1 − xxxt−1t−1) + wwwt

] [Φ(xxxt−1 − xxxt−1

t−1) + wwwt

]′}= ΦP t−1

t−1 Φ′ + Q.

To derive (6.21), we first define the innovations as

εεεt = yyyt − E(yyyt

∣∣ Yt−1) = yyyt − Atxxxt−1t − Γuuut, (6.24)

for t = 1, . . . , n. Note, E(εεεt) = 000 and

Σtdef= var(εεεt) = var[At(xxxt − xxxt−1

t ) + vvvt] = AtPt−1t A′

t + R (6.25)

In addition, E(εεεtyyy′s) = 0 for s < t, which in view of the fact the innovation

sequence is a Gaussian process, implies that the innovations are independentof the past observations. Furthermore, the conditional covariance between xxxt

and εεεt given Yt−1 is

cov(xxxt, εεεt

∣∣ Yt−1) = cov(xxxt, yyyt − Atxxxt−1t − Γuuut

∣∣ Yt−1)

= cov(xxxt − xxxt−1t , yyyt − Atxxx

t−1t − Γuuut

∣∣ Yt−1)

= cov[xxxt − xxxt−1t , At(xxxt − xxxt−1

t ) + vvvt]= P t−1

t A′t. (6.26)


Using these results we have that joint conditional distribution of xxxt and εεεt

given Yt−1 is normal(xxxt

εεεt

) ∣∣∣ Yt−1 ∼ N([

xxxt−1t

000

],

[P t−1

t P t−1t A′

t

AtPt−1t Σt

]). (6.27)

Thus, using (B.9) of Appendix B, we can write

xxxtt = E(xxxt

∣∣ yyy1, . . . , yyyt−1, yyyt) = E(xxxt

∣∣ Yt−1, εεεt) = xxxt−1t + Ktεεεt, (6.28)

whereKt = P t−1

t A′tΣ

−1t = P t−1

t A′t(AtP

t−1t A′

t + R)−1.

The evaluation of P tt is easily computed from (6.27) [see (B.10)] as

P tt = cov

(xxxt

∣∣ Yt−1, εεεt

)= P t−1

t − P t−1t A′

tΣ−1t AtP

t−1t ,

which simplifies to (6.22).

Next, we explore the model, prediction, and filtering from a density pointof view. For the sake of brevity, consider the Gaussian DLM without inputs,as described in (6.1) and (6.2); that is,

xxxt = Φxxxt−1 + wwwt and yyyt = Atxxxt + vvvt.

Recall wwwt and vvvt are independent, white Gaussian sequences, and the initialstate is normal, say, xxx0 ∼ N(µµµ0, Σ0); we will denote the initial p-variate statenormal density by f0(xxx0). Now, letting pΘ(·) denote a generic density functionwith parameters represented by Θ, we could describe the state relationship as

pΘ(xxxt

∣∣ xxxt−1, xxxt−2, . . . , xxx0) = pΘ(xxxt

∣∣ xxxt−1) = fw(xxxt − Φxxxt−1), (6.29)

where fw(·) denotes the p-variate normal density with mean zero and variance-covariance matrix Q. In (6.29), we are stating the process is Markovian, linear,and Gaussian. The relationship of the observations to the state process iswritten as

pΘ(yyyt

∣∣ xxxt, Yt−1) = pΘ(yyyt

∣∣ xxxt) = fv(yyyt − Atxxxt), (6.30)

where fv(·) denotes the q-variate normal density with mean zero and variance-covariance matrix R. In (6.30), we are stating the observations are condition-ally independent given the state, and the observations are linear and Gaussian.Note, (6.29), (6.30), and the initial density, f0(·), completely specify the modelin terms of densities, namely,

pΘ(xxx0, xxx1, . . . , xxxn, yyy1, . . . , yyyn) = f0(xxx0)n∏

t=1

fw(xxxt−Φxxxt−1)fv(yyyt−Atxxxt), (6.31)

where Θ = {µµµ0, Σ0, Φ, Q, R}.


Given the data, Yt−1 = {yyy1, . . . , yyyt−1}, and the current filter density,pΘ(xxxt−1 |Yt−1), Property P6.1 tells us, via conditional means and variances,how to recursively generate the Gaussian forecast density, pΘ(xxxt |Yt−1), andhow to update the density given the current observation, yyyt, to obtain theGaussian filter density, pΘ(xxxt |Yt). In terms of densities, the Kalman filtercan be seen as a simple Bayesian updating scheme, where, to determine theforecast and filter densities, we have

pΘ(xxxt

∣∣ Yt−1) =∫

Rp

pΘ(xxxt, xxxt−1∣∣ Yt−1) dxxxt−1

=∫

Rp

pΘ(xxxt

∣∣ xxxt−1)pΘ(xxxt−1∣∣ Yt−1) dxxxt−1

=∫

Rp

fw(xxxt − Φxxxt−1)pΘ(xxxt−1∣∣ Yt−1)dxxxt−1, (6.32)

which simplifies to the p-variate N(xxxt−1t , P t−1

t ) density, and

pΘ(xxxt

∣∣ Yt) = pΘ(xxxt

∣∣ yyyt, Yt−1)

∝ pΘ(yyyt

∣∣ xxxt) pΘ(xxxt

∣∣ Yt−1),

= fv(yyyt − Atxxxt)pΘ(xxxt

∣∣ Yt−1), (6.33)

from which we can deduce pΘ(xxxt

∣∣ Yt) is the p-variate N(xxxtt, P

tt ) density. These

statements are true for t = 1, . . . , n, with initial condition pΘ(xxx0∣∣ Y0) =

f0(xxx0). The prediction and filter recursions of Property P6.1 could also havebeen calculated directly from the density relationships (6.32) and (6.33) usingmultivariate normal distribution theory. The following example illustrates theBayesian updating scheme.

Example 6.4 Bayesian Analysis of a Local Level Model

In this example, we suppose that we observe a univariate series yt thatconsists of a trend component, µt, and a noise component, vt, where

yt = µt + vt (6.34)

and vt ∼ iid N(0, σ2v). In particular, we assume the trend is a random

walk given byµt = µt−1 + wt (6.35)

where wt ∼ iid N(0, σ2w) is independent of {vt}. Recall Example 6.2,

where we suggested this type of trend model for the global temperatureseries.

The model is, of course, a state-space model with (6.34) being the ob-servation equation, and (6.35) being the state equation. For forecasting,we seek the posterior density p(µt

∣∣ Yt−1). We will use the followingnotation introduced in Blight (1974) for the multivariate case. Let

{x; µ, σ2} = exp{

− 12σ2 (x − µ)2

}, (6.36)


then simple manipulation shows

{x; µ, σ2} = {µ; x, σ2} (6.37)

and

{x; µ1, σ21}{x; µ2, σ

22} =

{x;

µ1/σ21 + µ2/σ2

2

1/σ21 + 1/σ2

2, (1/σ2

1 + 1/σ22)−1

}× {

µ1; µ2, σ21 + σ2

2}

. (6.38)

Thus, using (6.32), (6.37) and (6.38) we have

p(µt

∣∣ Yt−1) ∝∫ {

µt; µt−1, σ2w

} {µt−1; µt−1

t−1, Pt−1t−1

}dµt−1

=∫ {

µt−1; µt, σ2w

} {µt−1; µt−1

t−1, Pt−1t−1

}dµt−1

={µt; µt−1

t−1, Pt−1t−1 + σ2

w

}. (6.39)

From (6.39) we conclude that

µt

∣∣ Yt−1 ∼ N(µt−1t , P t−1

t ) (6.40)

whereµt−1

t = µt−1t−1 and P t−1

t = P t−1t−1 + σ2

w (6.41)

which agrees with the first part of Property P6.1.

To derive the filter density using (6.33) and (6.37) we have

p(µt

∣∣ Yt) ∝ {yt; µt, σ

2v

} {µt; µt−1

t , P t−1t

}=

{µt; yt, σ

2v

} {µt; µt−1

t , P t−1t

}. (6.42)

An application of (6.38) gives

µt

∣∣ Yt ∼ N(µtt, P

tt ) (6.43)

with

µtt =

σ2vµt−1

t

P t−1t + σ2

v

+P t−1

t yt

P t−1t + σ2

v

= µt−1t + Kt(yt − µt−1

t ), (6.44)

where we have defined

Kt =P t−1

t

P t−1t + σ2

v

, (6.45)

and

P tt =

(1σ2

v

+1

P t−1t

)−1

=σ2

vP t−1t

P t−1t + σ2

v

= (1 − Kt)P t−1t . (6.46)

The filter for this specific case, of course, agrees with Property P6.1.


Next, we consider the problem of obtaining estimators for xxxt based on theentire data sample yyy1, . . . , yyyn, where t ≤ n, namely, xxxn

t . These estimatorsare called smoothers because a time plot of the sequence {xxxn

t ; t = 1, . . . , n}is typically smoother than the forecasts {xxxt−1

t ; t = 1, . . . , n} or the filters{xxxt

t; t = 1, . . . , n}. As is obvious from the above remarks, smoothing impliesthat each estimated value is a function of the present, future, and past, whereasthe filtered estimator depends on the present and past. The forecast dependsonly on the past, as usual.

Property P6.2: The Kalman SmootherFor the state-space model specified in (6.3) and (6.4), with initial conditionsxxxn

n and Pnn obtained via Property P6.1, for t = n, n − 1, . . . , 1,

xxxnt−1 = xxxt−1

t−1 + Jt−1(xxxn

t − xxxt−1t

), (6.47)

Pnt−1 = P t−1

t−1 + Jt−1(Pn

t − P t−1t

)J ′

t−1, (6.48)

whereJt−1 = P t−1

t−1 Φ′ [P t−1t

]−1. (6.49)

Proof. The smoother can be derived in many ways. Here we provide a proofthat was given in Ansley and Kohn (1982). First, for 1 ≤ t ≤ n, define

Yt−1 = {yyy1, . . . , yyyt−1} and ηt = {vvvt, . . . , vvvn,wwwt+1, . . . ,wwwn},

with Y0 being empty, and let

qqqt−1 = E{xxxt−1∣∣ Yt−1, xxxt − xxxt−1

t , ηt}.

Then, because Yt−1, {xxxt − xxxt−1t }, and ηt are mutually independent, and xxxt−1

and ηt are independent, using (B.9) we have

qqqt−1 = xxxt−1t−1 + Jt−1(xxxt − xxxt−1

t ), (6.50)

whereJt−1 = cov(xxxt−1, xxxt − xxxt−1

t )[P t−1t ]−1 = P t−1

t−1 Φ′[P t−1t ]−1.

Finally, because Yt−1, xxxt − xxxt−1t , and ηt generate Yn = {yyy1, . . . , yyyn},

xnt−1 = E{xt−1

∣∣ Yn} = E{qqqt−1

∣∣ Yn} = xxxt−1t−1 + Jt−1(xxxn

t − xxxt−1t ),

which establishes (6.47).The recursion for the error covariance, Pn

t−1, is obtained by straight-forwardcalculation. Using (6.47) we obtain

xxxt−1 − xxxnt−1 = xxxt−1 − xxxt−1

t−1 − Jt−1(xxxn

t − Φxxxt−1t−1

),


or (xxxt−1 − xxxn

t−1)

+ Jt−1xxxnt =

(xxxt−1 − xxxt−1

t−1

)+ Jt−1Φxxxt−1

t−1. (6.51)

Multiplying each side of (6.51) by the transpose of itself and talking expecta-tion, we have

Pnt−1 + Jt−1E(xxxn

t xxxn′t )J ′

t−1 = P t−1t−1 + Jt−1ΦE(xxxt−1

t−1xxxt−1′t−1 )Φ′J ′

t−1, (6.52)

using the fact the cross-product terms are zero. But,

E(xxxnt xxxn′

t ) = E(xxxtxxx′t) − Pn

t = ΦE(xxxt−1xxx′t−1)Φ

′ + Q − Pnt ,

andE(xxxt−1

t−1xxxt−1′t−1 ) = E(xxxt−1xxx

′t−1) − P t−1

t−1 ,

so (6.52) simplifies to (6.48).

Example 6.5 Prediction, Filtering and Smoothing for the Local LevelModel

For this example, we simulated n = 50 observations from the local leveltrend model discussed in Example 6.4. We generated a random walk

µt = µt−1 + wt (6.53)

with wt ∼ iid N(0, 1) and µ0 ∼ N(0, 1). We then supposed that weobserve a univariate series yt consisting of the trend component, µt, anda noise component, vt ∼ iid N(0, 1), where

yt = µt + vt. (6.54)

The sequences {wt}, {vt} and µ0 were generated independently. We thenran the Kalman filter and smoother, Properties P6.1 and P6.2, using theactual parameters. The top panel of Figure 6.3 shows the actual valuesof µt as points, and the predictions µt−1

t superimposed on the graph as

a line. In addition, we display µt−1t ± 2

√P t−1

t as dashed lines on the

plot. The middle panel displays the filters, µtt as a line with µt

t ± 2√

P tt

as dashed lines. The bottom panel of Figure 6.3 shows a similar plot forthe smoothers µn

t .

Table 6.1 shows the first 10 observations as well as the correspondingstate values, the predictions, filters and smoothers. Note that in Ta-ble 6.1, one-step-ahead prediction is more uncertain than the correspond-ing filtered value, which, in turn, is more uncertain than the correspond-ing smoother value (that is P t−1

t > P tt > Pn

t ). Also, in each case, theerror variances stabilize quickly. The R code for this example may befound on the website for the text.


Prediction

Time

mu[

−1]

0 10 20 30 40 50

−5

05

10

Filter

Time

mu

0 10 20 30 40 50

−5

05

10

Smoother

Time

mu

0 10 20 30 40 50

−5

05

10

Figure 6.3 Displays for Example 6.5. The simulated values of µt, for t =1, . . .,50, given by (6.53) are shown as points. Top: The predictions µt−1

t

obtained via the Kalman filter are shown as a line. Error bounds, µt−1t ±

2√

P t−1t , are shown as dashed lines. Middle: The filter µt

t obtained via the

Kalman filter are shown as a line. Error bounds, µtt ± 2

√P t

t , are shown asdashed lines. Bottom: The smoothers µn

t obtained via the Kalman smootherare shown as a line. Error bounds, µn

t ± 2√

Pnt , are shown as dashed lines.

In the next section, we will need a set of recursions for obtaining Pnt,t−1, as

defined in (6.18). We give the necessary recursion in the following property.

Property P6.3: The Lag-One Covariance SmootherFor the state-space model specified in (6.3) and (6.4), with Kt, Jt (t = 1, . . . , n),and Pn

n obtained from Properties P6.1 and P6.2, and with initial condition

Pnn,n−1 = (I − KnAn)ΦPn−1

n−1 , (6.55)


Table 6.1 Forecasts, Filters, and Smoothers for Example 6.5.

t yt µt µt−1t P t−1

t µtt P t

t µnt Pn

t

0 — −.63 — — .00 1.00 −.32 .621 −1.05 −.44 .00 2.00 −.70 .67 −.65 .472 −.94 −1.28 −.70 1.67 −.85 .63 −.57 .453 −.81 .32 −.85 1.63 −.83 .62 −.11 .454 2.08 .65 −.83 1.62 .97 .62 1.04 .455 1.81 −.17 .97 1.62 1.49 .62 1.16 .456 −.05 .31 1.49 1.62 .53 .62 .63 .457 .01 1.05 .53 1.62 .21 .62 .78 .458 2.20 1.63 .21 1.62 1.44 .62 1.70 .459 1.19 1.32 1.44 1.62 1.28 .62 2.12 .45

10 5.24 2.83 1.28 1.62 3.73 .62 3.48 .45

for t = n, n − 1, . . . , 2,

Pnt−1,t−2 = P t−1

t−1 J ′t−2 + Jt−1

(Pn

t,t−1 − ΦP t−1t−1

)J ′

t−2. (6.56)

Proof. Because we are computing covariances, we may assume uuut ≡ 000 withoutloss of generality. To derive the initial term (6.55), we first define

xxxst = xxxt − xxxs

t .

Then, using (6.21) and (6.47), we write

P tt,t−1 = E

(xxxt

t xxxt′t−1

)= E

{[xxxt−1

t − Kt(yyyt − Atxxxt−1t )][xxxt−1

t−1 − Jt−1Kt(yyyt − Atxxxt−1t )]′

}= E

{[xxxt−1

t − Kt(Atxxxt−1t + vvvt)][xxx

t−1t−1 − Jt−1Kt(Atxxx

t−1t + vvvt)]′

}.

Expanding terms and taking expectation, we arrive at

P tt,t−1 = P t−1

t,t−1 − P t−1t A′

tK′tJ

′t−1 − KtAtP

t−1t,t−1 + Kt(AtP

t−1t A′

t + R)K ′tJ

′t−1,

noting E(xxxt−1t vvv′

t) = 000. The final simplification occurs by realizing thatKt(AtP

t−1t A′

t + R) = P t−1t A′

t, and P t−1t,t−1 = ΦP t−1

t−1 . These relationships holdfor any t = 1, . . . , n, and (6.55) is the case t = n.

We give the basic steps in the derivation of (6.56). The first step is to use(6.47) to write

xxxnt−1 + Jt−1xxx

nt = xxxt−1

t−1 + Jt−1Φxxxt−1t−1 (6.57)

6.3: Maximum Likelihood Estimation 339

andxxxn

t−2 + Jt−2xxxnt−1 = xxxt−2

t−2 + Jt−2Φxxxt−2t−2. (6.58)

Next, multiply the left-hand side of (6.57) by the transpose of the left-handside of (6.58), and equate that to the corresponding result of the right-handsides of (6.57) and (6.58). Then, taking expectation of both sides, the left-handside result reduces to

Pnt−1,t−2 + Jt−1E(xxxn

t xxxn′t−1)J

′t−2 (6.59)

and the right-hand side result reduces to

P t−2t−1,t−2 − Kt−1At−1P

t−2t−1,t−2 + Jt−1ΦKt−1At−1P

t−2t−1,t−2

+ Jt−1ΦE(xxxt−1t−1xxx

t−2′t−2 )Φ′J ′

t−2. (6.60)

In (6.59), write

E(xxxnt xxxn′

t−1) = E(xxxtxxx′t−1) − Pn

t,t−1 = ΦE(xxxt−1xxx′t−2)Φ

′ + ΦQ − Pnt,t−1,

and in (6.60), write

E(xxxt−1t−1xxx

t−2′t−2 ) = E(xxxt−2

t−1xxxt−2′t−2 ) = E(xxxt−1xxx

′t−2) − P t−2

t−1,t−2.

Equating (6.59) to (6.60) using these relationships and simplifying the resultleads to (6.56).

6.3 Maximum Likelihood Estimation

The estimation of the parameters that specify the state-space model, (6.3)and (6.4), is quite involved. We use Θ = {µµµ0, Σ0, Φ, Q, R, Υ,Γ} to representthe vector of parameters containing the elements of the initial mean and co-variance µµµ0 and Σ0, the transition matrix Φ, and the state and observationcovariance matrices Q and R and the input coefficient matrices, Υ and Γ. Weuse maximum likelihood under the assumption that the initial state is normal,xxx0 ∼ N(µµµ0, Σ0), and the errors www1, . . . ,wwwn and vvv1, . . . , vvvn are jointly normaland uncorrelated vector variables. We continue to assume, for simplicity, {wwwt}and {vvvt} are uncorrelated.

The likelihood is computed using the innovations εεε1, εεε2, . . . , εεεn, defined by(6.24),

εεεt = yyyt − Atxxxt−1t − Γuuut.

The innovations form of the likelihood function, which was first given bySchweppe (1965), is obtained using an argument similar to the one leadingto (3.105) and proceeds by noting the innovations are independent Gaussianrandom vectors with zero means and, as shown in (6.25), covariance matrices

Σt = AtPt−1t A′

t + R. (6.61)


Hence, ignoring a constant, we may write the likelihood, LY (Θ), as

− lnLY (Θ) =12

n∑t=1

log |Σt(Θ)| +12

n∑t=1

εεεt(Θ)′Σt(Θ)−1εεεt(Θ), (6.62)

where we have emphasized the dependence of the innovations on the parame-ters Θ. Of course, (6.62) is a highly nonlinear and complicated function of theunknown parameters. The usual procedure is to fix xxx0 and then develop a setof recursions for the log likelihood function and its first two derivatives (forexample, Gupta and Mehra, 1974). Then, a Newton–Raphson algorithm (seeExample 3.28) can be used successively to update the parameter values untilthe negative of the log likelihood is minimized. This approach is advocated,for example, by Jones (1980), who developed ARMA estimation by putting theARMA model in state-space form. For the univariate case, (6.62) is identical,in form, to the likelihood for the ARMA model given in (3.105).

The steps involved in performing a Newton–Raphson estimation procedureare as follows.

1. Select initial values for the parameters, say, Θ(0).

2. Run the Kalman filter, Property P6.1, using the initial parameter values,Θ(0), to obtain a set of innovations and error covariances, say, {εεε(0)t ; t =1, . . . , n} and {Σ(0)

t ; t = 1, . . . , n}.

3. Run one iteration of a Newton–Raphson procedure with − lnLY (Θ) asthe criterion function (refer to Example 3.28 for details), to obtain a newset of estimates, say Θ(1).

4. At iteration j, (j = 1, 2, . . .), repeat step 2 using Θ(j) in place of Θ(j−1) toobtain a new set of innovation values {εεε(j)t ; t = 1, . . . , n} and {Σ(j)

t ; t =1, . . . , n}. Then repeat step 3 to obtain a new estimate Θ(j+1). Stopwhen the estimates or the likelihood stabilize; for example, stop whenthe values of Θ(j+1) differ from Θ(j), or when LY (Θ(j+1)) differs fromLY (Θ(j)), by some predetermined, but small amount.

Example 6.6 Newton–Raphson for Example 6.3

In this example, we generated n = 100 observations, y1, . . . , y100, usingthe model in Example 6.3, to perform a Newton–Raphson estimation ofthe parameters φ, σ2

w, and σ2v . In the notation of §6.2, we would have

Φ = φ, Q = σ2w and R = σ2

v . The actual values of the parameters areφ = .8, σ2

w = σ2v = 1.

In the simple case of an AR(1) with observational noise, initial estimationcan be accomplished using the results of Example 6.3. For example, using(6.16), we set

φ(0) = ρy(2)/ρy(1).


Similarly, from (6.15), γx(1) = γy(1) = φσ2w/(1 − φ2), so that, initially,

we setσ2(0)

w = (1 − φ(0)2)γy(1)/φ(0).

Finally, using (6.14) we obtain an initial estimate of σ2v , namely,

σ2(0)

v = γy(0) − [σ2(0)

w /(1 − φ(0)2)].

Newton–Raphson estimation was accomplished using the R programoptim. The code used for this example can be obtained on the web-site for the text. In that program, we must provide an evaluation of thefunction to be minimized, namely, − lnLY (Θ). In this case, the “functioncall” combines steps 2 and 3, using the current values of the parameters,Θ(j−1), to obtain first the filtered values, then the innovation values,and then calculating the criterion function, − lnLY (Θ(j−1)), to be mini-mized. We can also provide analytic forms of the gradient or score vector,−∂ lnLY (Θ)/∂Θ, and the Hessian matrix, −∂2 lnLY (Θ)/∂Θ ∂Θ′, in theoptimization routine, or allow the program to calculate these values nu-merically. In this example, we let the program proceed numerically andwe note the need to be cautious when calculating gradients numerically.For better stability, we can also provide an iterative solution for obtain-ing analytic gradients and Hessians of the log likelihood function; fordetails, see Problems 6.11 and 6.12; also, see Gupta and Mehra (1974).The final estimates, along with their standard errors (in parentheses),were

φ = .81 (.08), σw = .85 (.17), σv = .87 (.14),

and the algorithm converged in seven steps. The standard errors are abyproduct of the estimation procedure, and we will discuss their evalua-tion later in this section, after Property P6.4.

Example 6.7 Newton–Raphson for the Global Temperature Seriesin Example 6.2

In Example 6.2 we considered two different global temperature seriesof n = 108 observations each, and they are plotted in Figure 6.2. Inthat example, we argued that both series should be measuring the sameunderlying climatic signal, xt, which we model as a random walk,

xt = xt−1 + wt.

Recall that the observation equation was written as(yt1yt2

)=(

11

)xt +

(vt1vt2

),


Time

Tem

p De

viatio

ns

0 20 40 60 80 100

−0.6

−0.4

−0.2

0.0

0.2

0.4

Figure 6.4 Plot for Example 6.7. The thin solid and dashed lines are the twoaverage global temperature deviations shown in Figure 6.2. The thick solidline is the estimated smoother xn

t .

and the model covariance matrices are given by Q = q11 and

R =(

r11 r12r21 r22

).

Hence, there are four parameters to estimate, namely q11, r11, r12, r22,noting that r21 = r12 We hold the the initial state parameters fixed inthis example at µ0 = −.35 and Σ0 = .01 (these are, approximately, themean and variance of the first observation in each series).

The final estimates are q11 = .05, r11 = .019, r12 = .006, r22 = .005, withall values being significant. The observations and the smoothed estimateof the signal, xn

t , are displayed in Figure 6.4.

In addition to Newton–Raphson, Shumway and Stoffer (1982) presenteda conceptually simpler estimation procedure based on the EM (expectation-maximization) algorithm (Dempster et al., 1977). For the sake of brevity, weignore the inputs and consider the model in the form of (6.1) and (6.2); thegeneral case is left as an exercise (Problem 6.9). The basic idea is that if wecould observe the states, Xn = {xxx0, xxx1, . . . , xxxn}, in addition to the observationsYn = {yyy1, . . . , yyyn}, then we would consider {Xn, Yn} as the complete data, with


the joint density

fΘ(Xn, Yn) = fµ0,Σ0(xxx0)n∏

t=1

fΦ,Q(xxxt|xxxt−1)n∏

t=1

fR(yyyt|xxxt). (6.63)

Under the Gaussian assumption and ignoring constants, the complete datalikelihood, (6.63), can be written as

−2 ln LX,Y (Θ) = ln |Σ0| + (xxx0 − µµµ0)′Σ−1

0 (xxx0 − µµµ0)

+ n ln |Q| +n∑

t=1

(xxxt − Φxxxt−1)′Q−1(xxxt − Φxxxt−1)

+ n ln |R| +n∑

t=1

(yyyt − Atxxxt)′R−1(yyyt − Atxxxt). (6.64)

Thus, in view of (6.64), if we did have the complete data, we could then usethe results from multivariate normal theory to easily obtain the MLEs of Θ.We do not have the complete data; however, the EM algorithm gives us aniterative method for finding the MLEs of Θ based on the incomplete data,Yn, by successively maximizing the conditional expectation of the completedata likelihood. To implement the EM algorithm, we write, at iteration j,(j = 1, 2, . . .),

Q(Θ∣∣ Θ(j−1)

)= E

{−2 ln LX,Y (Θ)

∣∣ Yn, Θ(j−1)}

. (6.65)

Calculation of (6.65) is the expectation step. Of course, given the current valueof the parameters, Θ(j−1), we can use Property P6.2 to obtain the desiredconditional expectations as smoothers. This property yields

Q(Θ∣∣ Θ(j−1)

)= ln |Σ0| + tr

{Σ−1

0 [Pn0 + (xxxn

0 − µµµ0)(xxxn0 − µµµ0)

′]}

+ n ln |Q| + tr{Q−1[S11 − S10Φ′ − ΦS′

10 + ΦS00Φ′]}

+ n ln |R| (6.66)

+ tr

{R−1

n∑t=1

[(yyyt − Atxxxnt )(yyyt − Atxxx

nt )′ + AtP

nt A′

t]

},

where

S11 =n∑

t=1

(xxxnt xxxn

t′ + Pn

t ), (6.67)

S10 =n∑

t=1

(xxxnt xxxn

t−1′ + Pn

t,t−1), (6.68)

and

S00 =n∑

t=1

(xxxnt−1xxx

nt−1

′ + Pnt−1). (6.69)


In (6.66)-(6.69), the smoothers are calculated under the current value of theparameters Θ(j−1); for simplicity, we have not explicitly displayed this fact.

Minimizing (6.66) with respect to the parameters, at iteration j, constitutesthe maximization step, and is analogous to the usual multivariate regressionapproach, which yields the updated estimates

Φ(j) = S10S−100 , (6.70)

Q(j) = n−1 (S11 − S10S−100 S′

10), (6.71)

and

R(j) = n−1n∑

t=1

[(yyyt − Atxxxnt )(yyyt − Atxxx

nt )′ + AtP

nt A′

t]. (6.72)

The updates for the initial mean and variance–covariance matrix are

µµµ(j)0 = xxxn

0 and Σ(j)0 = Pn

0 (6.73)

obtained from minimizing (6.66).The overall procedure can be regarded as simply alternating between the

Kalman filtering and smoothing recursions and the multivariate normal maxi-mum likelihood estimators, as given by (6.70)–(6.73). Convergence results forthe EM algorithm under general conditions can be found in Wu (1983). Wesummarize the iterative procedure as follows.

1. Initialize the procedure by selecting starting values for the parametersΘ(0) = {µµµ0, Σ0, Φ, Q, R}.

On iteration j, (j = 1, 2, . . .):

2. Compute the incomplete-data likelihood, − lnLY (Θ(j−1)); see equation(6.62).

3. Perform the E-Step. Use Properties 6.1, 6.2, and 6.3 to obtain thesmoothed values xxxn

t , Pnt and Pn

t,t−1, for t = 1, . . . , n, using the para-meters Θ(j−1). Use the smoothed values to calculate S11, S10, S00 givenin (6.67)–(6.69).

4. Perform the M-Step. Update the estimates, µµµ0, Σ0, Φ, Q, and R using(6.70)–(6.73), to obtain Θ(j).

5. Repeat Steps 2 – 4 to convergence.


Example 6.8 EM Algorithm for Example 6.3

Using the same data generated in Example 6.6, we performed an EMalgorithm estimation of the parameters φ, σ2

w and σ2v as well as the

initial parameters µ0 and Σ0.

The convergence rate of the EM algorithm compared with the Newton–Raphson procedure is slow. In this example, with convergence beingclaimed when the log likelihood does not change by more that .001, con-vergence was attained after 38 iterations.

The final estimates, along with their standard errors (in parentheses),were

φ = .83 (.08), σw = .81 (.17), σv = .91 (.14),

with µ0 = −.06 and Σ0 = .44.

Evaluation of the standard errors used a call to fdHess in the nlme Rpackage to evaluate the Hessian at the final estimates. The nlme packagemust be loaded prior to the call to fdHess.

Asymptotic Distribution of the MLEs

The asymptotic distribution of estimators of the model parameters, say, Θn,is studied extensively in Caines (1988, Chapters 7 and 8), and in Hannan andDeistler (1988, Chapter 4). In both of these references, the consistency andasymptotic normality of the estimators is established under general conditions.Although we will only state the basic result, some crucial elements are neededto establish large sample properties of the estimators. An essential conditionis the stability of the filter. Stability of the filter assures that, for large t,the innovations εεεt are basically copies of each other (that is, independent andidentically distributed) with a stable covariance matrix Σ that does not dependon t and that, asymptotically, the innovations contain all of the informationabout the unknown parameters. Although it is not necessary, for simplicity,we shall assume here that At ≡ A for all t. Details on departures from thisassumption can be found in Jazwinski (1970, Sections 7.6 and 7.8). We alsodrop the inputs as use the model in the form of (6.1) and (6.2).

For stability of the filter, we assume the eigenvalues of Φ are less than onein absolute value; this assumption can be weakened (for example, see Harvey,1991, Section 4.3), but we retain it for simplicity. This assumption is enoughto ensure the stability of the filter in that, as t → ∞, the filter error covariancematrix P t

t converges to P , the steady-state error covariance matrix, the gainmatrix Kt converges to K, the steady-state gain matrix, from which it followsthat the innovation variance–covariance matrix Σt converges to Σ, the steady-state variance–covariance matrix of the stable innovations; details can be foundin Jazwinski (1970, Sections 7.6 and 7.8) and Anderson and Moore (1979,Section 4.4). In particular, the steady-state filter error covariance matrix, P ,


satisfies the Riccati equation:

P = Φ[P − PA′(APA′ + R)−1AP ]Φ′ + Q;

the steady-state gain matrix satisfies K = PA′[APA′ +R]−1. In Example 6.5,for all practical purposes, stability was reached by the fourth observation.

When the process is in steady-state, we may consider xxxtt+1 as the steady-

state predictor and interpret it as xxxtt+1 = E(xxxt+1

∣∣ yyyt, yyyt−1, . . .). As can beseen from (6.19) and (6.21), the steady-state predictor can be written as

xxxtt+1 = Φ[I − KA]xxxt−1

t + ΦKyt

= Φxxxt−1t + ΦKεεεt, (6.74)

where εεεt is the steady-state innovation process given by

εεεt = yt − E(yyyt

∣∣ yyyt−1, yyyt−2, . . .).

In this case, εεεt ∼ iid N(000, Σ), where Σ = APA′ + R. In steady-state, theobservations can be written as

yyyt = Axxxt−1t + εεεt. (6.75)

Together, (6.74) and (6.75) make up the steady-state innovations form of thedynamic linear model.

Two other conditions worth mentioning are observability and controlla-bility. Observability focuses on the question of how much information canbe gained about the p-dimensional state vector xxxt from p future observations{yyyt, yyyt+1, . . . , yyyt+p−1}. Consider the state without any noise term,

xxxt+p = Φxxxt+p−1 = · · · = Φpxxxt.

Then, the data (without observational noise) satisfy

yyyt+j = Axxxt+j = AΦjxxxt, j = 0, . . . , p − 1,

or(yyy′

t, . . . , yyy′t+p−1) = xxx′

t[A′, Φ′A′, . . . ,Φ

′p−1A′].

Hence, if the observability matrix O′ = [A′, Φ′A′, . . . ,Φ′p−1A′] has full rank

p, we may explicitly solve for xxxt in terms of yyyt:p = (yyy′t, . . . , yyy

′t+p−1)

′, namely,xxxt = (O′O)−1O′yyyt:p, and the system is said to be observable.

In a similar manner, to define controllability, write the state noise as wwwt =Buuut, where B is p × r and uuut is an r-dimensional, nonsingular, white noiseprocess. Thus, the state equation is xxxt = Φxxxt−1 + Buuut. If the matrix C =[B,ΦB,Φ2B, . . . ,Φp−1B] has full rank p, the process is said to be controllable.Controllability has to do with the fact that the state equation satisfies

xxxt+p =p−1∑j=0

ΦjBuuut+p−j + Φpxxxt = CUUU t + Φpxxxt,


where UUU t = (uuu′t+p, . . . , uuu

′t+1)

′. If we think of the variables {uuut+p, . . . , uuut+1} as“controlling” the state output xxxt, and we act as if we are free to choose theuuut at will, the fact that C is full rank means any desired value of xxxt+p can beobtained from any initial state xxxt by control of UUU t. In particular, we can putUUU t = C′(CC′)−1xxxt+p − Φpxxxt.

The key point about controllability and observability is that these con-ditions are necessary and sufficient to ensure the state-space model has thesmallest possible dimension; details can be found in Hannan and Diestler(1988, Section 2.3). As a simple example, suppose the state system is bi-variate, xxxt = (xt1, xt2)′, where xt1 and xt2 are independent components withΦ = diag{φ1, φ2}, and yt = [1, 0]xxxt +vt; that is, yt = xt1 +vt. Clearly we couldnot hope to reasonably estimate φ2. This system is not observable becauseO has rank one. Additional details on this point can be found in Jazwinski(1970, Section 7.5).

In the following property, we assume the Gaussian state-space model (6.1)and (6.2), is time invariant, i.e., At ≡ A, the eigenvalues of Φ are within theunit circle and the system is observable and controllable. We denote the trueparameters by Θ0, and we assume the dimension of Θ0 is the dimension ofthe parameter space. Although it is not necessary to assume wwwt and vvvt areGaussian, certain additional conditions would have to apply and adjustmentsto the asymptotic covariance matrix would have to be made (see Caines, 1988,Chapter 8).

Property P6.4: Asymptotic Distribution of the EstimatorsUnder general conditions, let Θn be the estimator of Θ0 obtained by maximizingthe innovations likelihood, LY (Θ), as given in (6.62). Then, as n → ∞,

√n(Θn − Θ0

)d→ N

[0, I(Θ0)−1] ,

where I(Θ) is the asymptotic information matrix given by

I(Θ) = limn→∞ n−1E

[−∂2 lnLY (Θ)/∂Θ ∂Θ′] .Precise details and the proof of Property P6.4 are given in Caines (1988, Chap-ter 7) and in Hannan and Deistler (1988, Chapter 4). For a Newton procedure,the Hessian matrix (as described in Example 6.6) at the time of convergencecan be used as an estimate of nI(Θ0) to obtain estimates of the standarderrors. In the case of the EM algorithm, no derivatives are calculated, butwe may include a numerical evaluation of the Hessian matrix at the time ofconvergence to obtain estimated standard errors. Also, extensions of the EMalgorithm exist, such as the SEM algorithm (Meng and Rubin, 1991), thatinclude a procedure for the estimation of standard errors. In the examples ofthis section, the estimated standard errors were obtained from the numericalHessian matrix of − lnLY (Θ), where Θ is the vector of parameters estimatesat the time of convergence.


6.4 Missing Data Modifications

An attractive feature available within the state-space framework is its abil-ity to treat time series that have been observed irregularly over time. Forexample, Palma and Chan (1997) used the state-space model for estimationand forecasting of long memory (specifically, fractionally integrated ARMA, orARFIMA, processes) time series with missing observations. Throughout thissection we assume the model is of the form (6.1) and (6.2). The EM algo-rithm allows parts of the observation vector yyyt to be missing at a number ofobservation times. Shumway and Stoffer (1982) described the modificationsnecessary for the special case in which the subvectors of vvvt corresponding tothe observed and unobserved parts of yyyt happen to be uncorrelated. Here, wewill also discuss the general case.

Suppose, at a given time t, we define the partition of the q × 1 observationvector yyyt = (yyy(1)

t

′, yyy

(2)t

′)′, where the first q1t × 1 component is observed and

the second q2t × 1 component is unobserved, q1t + q2t = q. Then, write thepartitioned observation equation(

yyy(1)t

yyy(2)t

)=[

A(1)t

A(2)t

]xxxt +

(vvv

(1)t

vvv(2)t

), (6.76)

where A(1)t and A

(2)t are, respectively, the q1t × p and q2t × p partitioned ob-

servation matrices, and

cov(

vvv(1)t

vvv(2)t

)=[

R11t R12t

R21t R22t

](6.77)

denotes the covariance matrix of the measurement errors between the observedand unobserved parts. Stoffer (1982) established the filtering equations, Prop-erty P6.1, hold for the missing data case if, at update t, we make the replace-ments

yyy(t) =(

yyy(1)t

000

), A(t) =

[A

(1)t

0

], R(t) =

[R11t 0

0 R22t

], (6.78)

for yyyt, At, and R, respectively, in (6.21)–(6.23).Once the “missing data” filtered values have been obtained, Stoffer (1982)

also established the smoother values can be processed using Properties P6.2and P6.3 with the values obtained from the missing data-filtered values.The implication of these results is that, if yyyt is incomplete, the filtered andsmoothed estimators can be calculated from the usual equations by enteringzeros in the observation vector when data are missing, by zeroing out thecorresponding rows of the design matrix At, and by entering zeros in the off-diagonal elements of R that correspond to R12t and R21t at update t in thefilter equation (6.23). In doing this procedure, the state estimators are

xxx(s)t = E

(xxxt

∣∣ yyy(1)1 , . . . , yyy(1)

s

), (6.79)

6.4: Missing Data 349

with error variance–covariance matrix

P(s)t = E

{(xxxt − xxx

(s)t

)(xxxt − xxx

(s)t

)′}. (6.80)

The missing data lag-one smoother covariances will be denoted by P(n)t,t−1.

The maximum likelihood estimators, as computed in the EM procedure,must also be modified in the missing data case. Now, we consider

Y (1)n = {yyy(1)

1 , . . . , yyy(1)n } (6.81)

as the incomplete data, and Xn, Yn, as defined in (6.63), as the complete data.In this case, the complete data likelihood, (6.63), or equivalently (6.64), is thesame, but to implement the E-step, at iteration j, we must calculate

Q(Θ∣∣ Θ(j−1)

)= E

{−2 ln LX,Y (Θ)

∣∣ Y (1)n , Θ(j−1)

}= E∗

{ln |Σ0| + tr Σ−1

0 (xxx0 − µµµ0)(xxx0 − µµµ0)′ ∣∣ Y (1)

n

}+ E∗

{n ln |Q| +

n∑t=1

tr[Q−1(xxxt − Φxxxt−1)(xxxt − Φxxxt−1)′] ∣∣ Y (1)

n

}

+ E∗

{n ln |R| +

n∑t=1

tr[R−1(yyyt − Atxxxt)(yyyt − Atxxxt)′] ∣∣ Y (1)

n

}, (6.82)

where E∗ denotes the conditional expectation under Θ(j−1) and tr denotestrace. The first two terms in (6.82) will be like the first two terms of (6.66) withthe smoothers xxxn

t , Pnt , and Pn

t,t−1 replaced by their missing data counterparts,

xxx(n)t , P

(n)t , and P

(n)t,t−1. What changes in the missing data case is the third term

of (6.82), where we must evaluate E∗(yyy(2)t

∣∣ Y(1)n ) and E∗(yyy

(2)t yyy

(2)′t

∣∣ Y(1)n ). In

Stoffer (1982), it is shown that

E∗{

(yyyt − Atxxxt)(yyyt − Atxxxt)′ ∣∣ Y (1)n

}

=(

yyy(1)t − A

(1)t xxx

(n)t

R∗21tR−1∗11t(yyy

(1)t − A

(1)t xxx

(n)t )

)(yyy

(1)t − A

(1)t xxx

(n)t

R∗21tR−1∗11t(yyy

(1)t − A

(1)t xxx

(n)t )

)′

+(

A(1)t

R∗21tR−1∗11tA

(1)t

)P

(n)t

(A

(1)t

R∗21tR−1∗11tA

(1)t

)′

+(

0 00 R∗22t − R∗21tR

−1∗11tR∗12t

). (6.83)

In (6.83), the values of R∗ikt, for i, k = 1, 2, are the current values specifiedby Θ(j−1). In addition, xxx

(n)t and P

(n)t are the values obtained by running the

smoother under the current parameter estimates specified by Θ(j−1).


In the case in which observed and unobserved components have uncorre-lated errors, that is, R∗12t is the zero matrix, (6.83) can be simplified to

E∗{

(yyyt − Atxxxt)(yyyt − Atxxxt)′ ∣∣ Y (1)n

}=

(yyy(t) − A(t)xxx

(n)t

)(yyy(t) − A(t)xxx

(n)t

)′+ A(t)P

(n)t A′

(t)

+(

0 00 R∗22t

), (6.84)

where yyy(t) and A(t) are defined in (6.78).In this simplified case, the “missing data” M-step looks like the M-step

given in (6.67)-(6.73). That is, with

S(11) =n∑

t=1

(xxx(n)t xxx

(n)t

′+ P

(n)t ), (6.85)

S(10) =n∑

t=1

(xxx(n)t xxx

(n)t−1

′+ P

(n)t,t−1), (6.86)

and

S(00) =n∑

t=1

(xxx(n)t−1xxx

(n)t−1

′+ P

(n)t−1), (6.87)

where the smoothers are calculated under the present value of the parametersΘ(j−1) using the missing data modifications, at iteration j, the maximizationstep is

Φ(j) = S(10)S−1(00), (6.88)

Q(j) = n−1(S(11) − S(10)S

−1(00)S

′(10)

), (6.89)

and

R(j) = n−1n∑

t=1

Dt

{(yyy(t) − A(t)xxx

(n)t

)(yyy(t) − A(t)xxx

(n)t

)′+ A(t)P

(n)t A′

(t)

+(

0 00 R

(j−1)22t

)}D′

t, (6.90)

where Dt is a permutation matrix that reorders the variables at time t in theiroriginal order and yyy(t) and A(t) are defined in (6.78). For example, supposeq = 3 and at time t, yt2 is missing. Then,

yyy(t) =

⎛⎝ yt1yt30

⎞⎠ , A(t) =

⎡⎣At1At3000′

⎤⎦ , and Dt =

⎡⎣ 1 0 00 0 10 1 0

⎤⎦ ,

6.4: Missing Data 351

log(white blood count)

Time

0 20 40 60 80

1.5

2.5

3.5

4.5

log(platelet)

Time

0 20 40 60 80

4.0

4.5

5.0

5.5

HCT

Time

0 20 40 60 80

25

30

35

40

Figure 6.5 Smoothed values for various components in the blood parametertracking problem. The actual data are shown as points, the smoothed valuesare shown as solid lines, and ±3 standard error bounds are shown as dashedlines.

where Ati is the ith row of At and 000′ is a 1 × p vector of zeros. In (6.90), onlyR11t gets updated, and R22t at iteration j is simply set to its value from theprevious iteration, j −1. Of course, if we cannot assume R12t = 0, (6.90) mustbe changed accordingly using (6.83), but (6.88) and (6.89) remain the same.As before, the parameter estimates for the initial state are updated as

µµµ(j)0 = xxx

(n)0 and Σ(j)

0 = P(n)0 . (6.91)


Example 6.9 Longitudinal Biomedical Data

We consider the biomedical data in Example 6.1 which has portions ofthe three-dimensional vector missing after the 40th day. The maximumlikelihood procedure yielded the estimators

Φ =

⎛⎝ 1.02 −.09 .01.08 .90 .01

−.90 1.42 .87

⎞⎠ , Q =

⎛⎝ .018 .002 .000.002 .004 .017.000 .017 2.27

⎞⎠ ,

and R = diag{.004, .022, 1.69} for the transition, state error covarianceand observation error covariance matrices, respectively. The couplingbetween the first and second series is relatively weak, whereas the thirdseries HCT is strongly related to the first two; that is,

xt3 = −.90xt−1,1 + 1.42xt−1,2 + .87xt−1,3.

Hence, the HCT is negatively correlated with white blood count andpositively correlated with platelet count. Byproducts of the procedureare estimated trajectories for all three longitudinal series and their re-spective prediction intervals. In particular, Figure 6.5 shows the dataas points, the estimated smoothed values x

(n)t as solid lines, and error

bounds, x(n)t ±3

√P

(n)t as dotted lines, for critical post-transplant platelet

count.

6.5 Structural Models: Signal Extraction andForecasting

In order to develop computing techniques for handling a versatile cross sectionof possible models, it is necessary to restrict the state-space model somewhat,and we consider one possible class of specializations in this section. The com-ponents of the model are taken as linear processes that can be adapted torepresent fixed and disturbed trends and periodicities as well as classical au-toregressions. The observed series is regarded as being a sum of componentsignal series. To illustrate the possibilities, consider the economic examplegiven below that shows how to fit a sum of trend, seasonal, and irregularcomponents the quarterly earnings data that we have considered before.

Example 6.10 Johnson & Johnson Quarterly Earnings

Consider the quarterly earnings series from the U.S. company Johnson &Johnson as given in Figure 1.1. The series is highly nonstationary, andthere is both a trend signal that is gradually increasing over time and aseasonal component that cycles every four quarters or once per year. The

6.5: Structural Models 353

Figure 6.6 Estimated trend component, Tnt (top), and estimated trend plus

seasonal component, Snt (bottom), for the Johnson and Johnson quarterly

earnings series.

seasonal component is getting larger over time as well. Transforming intologarithms or even taking the nth root does not seem to make the seriesstationary, as there is a slight bend to the transformed curve. Suppose,however, we consider the series to be the sum of a trend component, aseasonal component, and a white noise. That is, let the observed seriesbe expressed as

yt = Tt + St + vt, (6.92)

where Tt is trend and St is the seasonal component. Suppose we allowtrend to increase exponentially; that is,

Tt = φTt−1 + wt1, (6.93)

where the coefficient φ > 1 characterizes the increase. Let the seasonalcomponent be modeled as

St + St−1 + St−2 + St−3 = wt2, (6.94)


which corresponds to assuming the seasonal component is expected tosum to zero over a complete period or four quarters. To express thismodel in state-space form, let xxx′

t = (Tt, St, St−1, St−2) be the state vectorso the observation equation (6.2) can be written as

yt = ( 1 1 0 0 )

⎛⎜⎝Tt

St

St−1St−2

⎞⎟⎠+ vt,

with the state equation written as⎛⎜⎝Tt

St

St−1St−2

⎞⎟⎠ =

⎛⎜⎜⎝φ 0 0 00 −1 −1 −10 1 0 00 0 1 0

⎞⎟⎟⎠⎛⎜⎝

Tt−1St−1St−2St−3

⎞⎟⎠+

⎛⎜⎝wt1wt200

⎞⎟⎠ ,

where R = r11 and

Q =

⎛⎜⎝q11 0 0 00 q22 0 00 0 0 00 0 0 0

⎞⎟⎠ .

The model reduces to state-space form, (6.1) and (6.2), with p = 4and q = 1. The parameters to be estimated are r11, the noise vari-ance in the measurement equations, q11 and q22, the model variancescorresponding to the trend and seasonal components and φ, the tran-sition parameter that models the growth rate. Growth is about 3%per year, and we began with φ = 1.03. The initial mean was fixedat µµµ0 = (.5, .3, .2, .1)′, with uncertainty modeled by the diagonal covari-ance matrix with Σ0ii = .01, for i = 1, . . . , 4. Initial state covariancevalues were taken as q11 = .01, q22 = .10, corresponding to relativelylow uncertainty in the trend model compared with that in the seasonalmodel. The measurement error covariance was started at r11 = .04.After 70 iterations of the EM algorithm the transition parameter stabi-lized at φ = 1.035, corresponding to exponential growth with inflationat about 3.5% per year. The measurement uncertainty was small atr11 = .0086, compared with the model uncertainties q11 = .0169 andq22 = .0497. From initial guesses, the trend uncertainty increased andthe seasonal uncertainty decreased. Figure 6.6 shows the smoothed trendestimate and the exponentially increasing seasonal components. We mayalso consider forecasting the Johnson & Johnson series, and the result ofa 12-quarter forecast is shown in Figure 6.7 as basically an extension ofthe latter part of the observed data.

6.6: ARMAX Models in State-Space Form 355

Figure 6.7 A 12-quarter forecast for the Johnson & Johnson quarterly earn-ings series. The last three years of data (quarters 72-84), are shown as pointsconnected by a solid line. The forecasts are shown as points connected by asolid line (quarters 85-96) and dotted lines are upper and lower 95% predictionintervals.

6.6 ARMAX Models in State-Space Form

Sometimes, it is advantageous to write the state-space model in a slightlydifferent way, as is done by numerous authors; for example, Anderson andMoore (1970) and Hannan and Deistler (1988). Here, we write the state-spacemodel as

xxxt+1 = Φxxxt + Υuuut + wwwt t = 0, 1, . . . , n (6.95)

yyyt = Atxxxt + Γuuut + vvvt t = 1, . . . , n (6.96)

where, in the state equation, xxx0 ∼ N(µµµ0, Σ0), Φ is p× p, and Υ is p× r. In theobservation equation, At is q × p and Γ is q × r. Now, wwwt and vvvt are still whitenoise series (both independent of xxx0), with var(wwwt) = Q, var(vvvt) = R, but wealso allow the state noise and observation noise to be correlated at time t; thatis,

cov(wwwt, vvvt) = E(wwwtvvv′t) = S (6.97)

and zero otherwise; note, S is a p × q matrix. To obtain the innovations,εεεt = yyyt − Atxxx

t−1t − Γuuut, and the innovation variance Σt = AtP

t−1t A′

t + R,in this case, we need the one-step-ahead state predictions. Of course, thefiltered estimates will also be of interest, and they will be needed for smoothing.Property P6.2 (the smoother) as displayed in §6.2 still holds. The following


property generates the predictor xxxtt+1 from the past predictor xxxt−1

t when thenoise terms are correlated and exhibits the filter update.

Property P6.5: The Kalman Filter with Correlated State andMeasurement NoiseFor the state-space model specified in (6.95) and (6.96), with initial conditionsxxx0

1 and P 01 , for t = 1, . . . , n,

xxxtt+1 = Φxxxt−1

t + Υuuut + K∗t (yyyt − Atxxx

t−1t − Γuuut), (6.98)

P tt+1 = [Φ − K∗

t At]P t−1t [Φ − K∗

t At]′ + Q + K∗t RK∗′

t − SK∗′t − K∗

t S′, (6.99)

where the new gain matrix is given by

K∗t = [ΦP t−1

t A′t + S][[AtP

t−1t A′

t + R]−1. (6.100)

The filter update, given a new observation yyyt+1 and input uuut+1 is given by

xxxt+1t+1 = xxxt

t+1 + P tt+1A

′t+1[At+1P

tt+1A

′t+1 + R

]−1εεεt+1, (6.101)

P t+1t+1 = P t

t+1 − P tt+1A

′t+1[At+1P

tt+1A

′t+1 + R

]−1At+1P

tt+1. (6.102)

The derivation of Property P6.5 is similar to the derivation of the Kalmanfilter in Property P6.1 (Problem 6.17). Note, (6.101) and (6.102) are identicalto (6.19) and (6.20).

Consider a p-dimensional ARMAX model given by,

yyyt = Γuuut +m∑

j=1

Φjyyyt−j +q∑

k=1

Θkvvvt−k + vvvt. (6.103)

The Φs and Θs are p × p matrices, Γ is p × r, and vvvt is a p × 1 white noiseprocess; in fact, (6.103) and (5.84) are identical models, but here, we havewritten the observations as yyyt. We now have the following property.Property P6.6: A State-Space Form of ARMAXFor m ≥ q, the state-space model given by

xxxt+1 =

⎡⎢⎢⎢⎢⎣Φ1 I 0 · · · 0Φ2 0 I · · · 0...

......

. . ....

Φm−1 0 0 · · · IΦm 0 0 · · · 0

⎤⎥⎥⎥⎥⎦xxxt +

⎡⎢⎢⎢⎢⎢⎢⎢⎣

Θ1 + Φ1...

Θq + Φq

Φq+1...

Φm

⎤⎥⎥⎥⎥⎥⎥⎥⎦vvvt, (6.104)

yyyt = [ I, 0, · · · , 0 ]xxxt + Γuuut + vvvt, (6.105)

implies the ARMAX model (6.103). The state process, xxxt, is pm × 1, and theobservations process yyyt is p × 1. If m < q, set Φm+1 = · · · = Φq = 0, in whichcase m = q and (6.104)–(6.105) still apply.

6.7: Bootstrapping State-Space Models 357

This form of the model is somewhat different than the form suggestedin §5.1, equations (6.6)-(6.8). For example, in (6.8), by setting At equal tothe p × p identity matrix (for all t) and setting R = 0 implies the data yt

in (6.8) follow a VAR(m) process. In doing so, however, we do not makeuse of the ability to allow for correlated state and observation error, so asingularity is introduced into the system in the form of R = 0. The methodin Property P6.6 avoids that problem, and points out the fact that the samemodel can take many forms. We do not prove Property P6.6 directly, but thefollowing example should suggest how to establish the general result.

Example 6.11 Univariate ARMA(1, 1) in State-Space Form

Consider the univariate ARMA(1, 1) model yt = φyt−1+θvt−1+vt. UsingProperty P6.6, we can write the model as

xt+1 = φxt + wt, (state eqn), (6.106)

where wt = (θ + φ)vt and

yt = xt + vt, (obs eqn). (6.107)

In this case, cov(wt, vt) = (θ + φ)var(vt) = (θ + φ)R, and cov(wt, vs) = 0when s �= t, so Property P6.5 would apply. To verify (6.106) and (6.107)specify an ARMA(1,1) model, we have

yt = xt + vt from (6.107)= φxt−1 + (θ + φ)vt−1 + vt from (6.106)= φ(xt−1 + vt−1) + θvt−1 + vt

= φyt−1 + θvt−1 + vt, from (6.107).

Properties P6.5 and P6.6, together, can be used to accomplish maximumlikelihood estimation for ARMAX models. In this case, the likelihood wouldbe in the innovations form given in Chapter 2, equation (3.106), or equivalently(6.62), and estimation could be accomplished using Newton–Raphson or theEM algorithm as described §6.3.

6.7 Bootstrapping State-Space Models

Although, in §6.3, we discussed the fact that, under general conditions (whichwe assume to hold in this section), the MLEs of the parameters of a DLMare consistent and asymptotically normal, time series data are often of shortor moderate length. Several researchers have found evidence that samplesmust be fairly large before asymptotic results are applicable (Dent and Min,1978; Ansley and Newbold, 1980). Moreover, as we discussed in Example 3.31,problems occur if the parameters are near the boundary of the parameter


space. In this section, we discuss an algorithm for bootstrapping state-spacemodels; this algorithm and its justification, including the non-Gaussian case,along with numerous examples, can be found in Stoffer and Wall (1991) andin Stoffer and Wall (2004). In view of §6.6, anything we do or say here aboutDLMs applies equally to ARMAX models.

Using the DLM given by (6.95)–(6.97) and Property P6.5, we write theinnovations form of the filter as

εεεt = yyyt − Atxxxt−1t − Γuuut, (6.108)

Σt = AtPt−1t A′

t + R, (6.109)xxxt

t+1 = Φxxxt−1t + Υuuut + Ktεεεt, (6.110)

Kt = [ΦP t−1t A′

t + S]Σ−1t , (6.111)

P tt+1 = ΦP t−1

t Φ′ + Q − KtΣtK′t. (6.112)

This form of the filter is just a rearrangement of the filter given in Prop-erty P6.5; we have dropped the * in the new form of the gain matrix.

In addition, we can rewrite the model to obtain the innovations form of themodel,

xxxtt+1 = Φxxxt−1

t + Υuuut + Ktεεεt, (6.113)

yyyt = Atxxxt−1t + Γuuut + εεεt. (6.114)

This form of the model is a rewriting of (6.108) and (6.110), and it accommo-dates the bootstrapping algorithm.

As discussed in Example 6.5, although the innovations εεεt are uncorrelated,initially, Σt can be different for different time points t. Thus, in a resamplingprocedure, we can either ignore the first few values of εεεt until Σt stabilizes orwe can work with the standardized innovations

eeet = Σ−1/2t εεεt, (6.115)

so we are guaranteed these innovations have, at least, the same first two mo-ments. In (6.115), Σ1/2

t denotes the unique square root matrix of Σt defined byΣ1/2

t Σ1/2t = Σt. In what follows, we base the bootstrap procedure on the stan-

dardized innovations, but we stress the fact that, even in this case, ignoringstartup values might be necessary, as noted by Stoffer and Wall (1991).

The model coefficients and the correlation structure of the model are uniquelyparameterized by a k×1 parameter vector Θ0; that is, Φ = Φ(Θ0), Υ = Υ(Θ0),Q = Q(Θ0), At = At(Θ0), Γ = Γ(Θ0), and R = R(Θ0). Recall the innovationsform of the Gaussian likelihood (ignoring a constant) is

−2 ln LY (Θ) =n∑

t=1

[ln |Σt(Θ)| + εεεt(Θ)′Σt(Θ)−1εεεt(Θ)

]=

n∑t=1

[ln |Σt(Θ)| + eeet(Θ)′eeet(Θ)] . (6.116)


We stress the fact that it is not necessary for the model to be Gaussian toconsider (6.116) as the criterion function to be used for parameter estimation.

Let Θ denote the MLE of Θ0, that is, Θ = argmaxΘLY (Θ), obtained bythe methods discussed in §6.3. Let εεεt(Θ) and Σt(Θ) be the innovation valuesobtained by running the filter, (6.108)–(6.112), under Θ. Once this has beendone, the bootstrap procedure is accomplished by the following steps.

1. Construct the standardized innovations

eeet(Θ) = Σ−1/2t (Θ)εεεt(Θ).

2. Sample, with replacement, n times from the set {eee1(Θ), . . . , eeen(Θ)} toobtain {eee∗

1(Θ), . . . , eee∗n(Θ)}, a bootstrap sample of standardized innova-

tions.

3. Construct a bootstrap data set {yyy∗1, . . . , yyy

∗n} as follows. Define the (p +

q) × 1 vector ξξξt = (xxxt′t+1, yyy

′t)

′. Stacking (6.113) and (6.114) results in avector first-order equation for ξξξt given by

ξξξt = Ftξξξt−1 + Guuut + Hteeet, (6.117)

where

Ft =[

Φ 0At 0

], G =

[ΥΓ

], Ht =

[KtΣ

−1/2t

Σ−1/2t

].

Thus, to construct the bootstrap data set, solve (6.117) using eee∗t (Θ) in

place of eeet. The exogenous variables uuut and the initial conditions ofthe Kalman filter remain fixed at their given values, and the parametervector is held fixed at Θ.

4. Using the bootstrap data set {yyy∗t ; t = 1, . . . , n}, construct a likelihood,

LY ∗(Θ), and obtain the MLE of Θ, say, Θ∗.

5. Repeat steps 2 through 4, a large number, B, of times, obtaining abootstrapped set of parameter estimates {Θ∗

b ; b = 1, . . . , B}. The finitesample distribution of Θ − Θ0 may be approximated by the distributionof Θ∗

b − Θ, b = 1, . . . , B.

In the next example, we discuss the case of a linear regression model, butwhere the regression coefficients are stochastic and allowed to vary with time.The state-space model provides a convenient setting for the analysis of suchmodels.

Example 6.12 Stochastic Regression

Figure 6.8 shows the interest rate recorded for three-month treasury bills(line–squares), yt, and the quarterly inflation rate (dotted line–circles) in


Figure 6.8 Interest rate for three-month treasury bills (line–squares) and quar-terly inflation rate (dotted line–circles) in the Consumer Price Index, 1953:1to 1965:2.

the Consumer Price Index, zt, from the first quarter of 1953 through thesecond quarter of 1965, n = 50 observations. These data were analyzedby Newbold and Bos (1985, pp. 61-73).

In this analysis, the treasury bill interest rate is modeled as being linearlyrelated to quarterly inflation as

yt = α + βtzt + vt,

where α is a fixed constant, βt is a stochastic regression coefficient, andvt is white noise with variance σ2

v . The stochastic regression term, whichcomprises the state variable, is specified by a first-order autoregression,

(βt − b) = φ(βt−1 − b) + wt,

where b is a constant, and wt is white noise with variance σ2w. The noise

processes, vt and wt, are assumed to be uncorrelated.

Using the notation of the state-space model (6.95) and (6.96), we havein the state equation, xxxt = βt, Φ = φ, uuut ≡ 1, Υ = (1 − φ)b, Q =σ2

w, and in the observation equation, At = zt, Γ = α, R = σ2v , and

S = 0. The parameter vector is Θ = (φ, α, b, σw, σv)′. The results ofthe Newton–Raphson estimation procedure are listed in Table 6.2. Alsoshown in the Table 6.2 are the corresponding standard errors obtainedfrom B = 500 runs of the bootstrap. These standard errors are simply


Figure 6.9 Bootstrap distribution, B = 500, of the estimator of φ.

Figure 6.10 Bootstrap distribution, B = 500, of the estimator of σw.

the standard deviations of the bootstrapped estimates, that is, the squareroot of

∑Bb=1(Θ

∗ib−Θ∗

i )2/(B−1), where Θi, represents the ith parameter,

i = 1, . . . , 5, and Θ∗i =

∑Bb=1 Θ∗

ib/B.

The asymptotic standard errors listed in Table 6.2 are typically smallerthan those obtained from the bootstrap. This result is the most pro-nounced in the estimates of φ, σw, and σv, where the bootstrappedstandard errors are about 50% larger than the corresponding asymp-totic value. Also, asymptotic theory prescribes the use of normal theorywhen dealing with the parameter estimates. The bootstrap, however,allows us to investigate the small sample distribution of the estimatorsand, hence, provides more insight into the data analysis.


Table 6.2 Comparison of Asymptotic Standard Errors andBootstrapped Standard Errors (B = 500)

Asymptotic BootstrapParameter MLE Standard Error Standard Error

φ .841 .200 .304α −.771 .645 .645b .855 .278 .277

σw .127 .092 .182σv 1.131 .142 .217

For example, Figure 6.9 shows the bootstrap distribution of the estimatorof φ. This distribution is highly skewed with values concentrated around.8, but with a long tail to the left. Some quantiles of the bootstrappeddistribution of φ are -.09 (2.5%), .03 (5%), .16 (10%), .87 (90%), .92(95%), .94 (97.5%), and they can be used to obtain confidence intervals.For example, a 90% confidence interval for φ would be approximated by(.03, .92). This interval is rather wide, and we will interpret this afterwe discuss the results of the estimation of σw.

Figure 6.10 shows the bootstrap distribution of σw. The distribution isconcentrated at two locations, one at approximately σw = .15 and theother at σw = 0. The cases in which σw ≈ 0 correspond to deterministicstate dynamics. When σw = 0 and |φ| < 1, then βt ≈ b for large t, so theapproximately 25% of the cases in which σw ≈ 0 suggest a fixed state,or constant coefficient model. The cases in which σw is away from zerowould suggest a truly stochastic regression parameter. To investigatethis matter further, Figure 6.11 shows the joint bootstrapped estimates,(φ, σw), for positive values of φ. The joint distribution suggests σw > 0corresponds to φ ≈ 0. When φ = 0, the state dynamics are given byβt = b+wt. If, in addition, σw is small relative to b, the system is nearlydeterministic; that is, βt ≈ b. Considering these results, the bootstrapanalysis leads us to conclude the dynamics of the data are best describedin terms of a fixed regression effect.

6.8 Dynamic Linear Models with Switching

The problem of modeling changes in regimes for vector-valued time series hasbeen of interest in many different fields. In §5.4, we explored the idea that thedynamics of the system of interest might change over the course of time. InExample 5.5, we saw that pneumonia and influenza mortality rates behave dif-ferently when a flu epidemic occurs than when no epidemic occurs. As anotherexample, some authors (for example, Hamilton, 1989, or McCulloch and Tsay,

6.8: Dynamic Linear Models with Switching 363

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

oo

o

o

o

oo

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo oo o

o

o o

o oo

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o o

o

o

o

o

o oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

oo o

o

o

o

o

o o

o

oo

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o o

o

o

o

ooo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o

phi

sigma

0.0 0.2 0.4 0.6 0.8 1.0

0.00.1

0.20.3

0.40.5

Figure 6.11 Joint bootstrap distribution, B = 500, of the estimators of φ andσw. Only the values corresponding to φ∗ ≥ 0 are shown.

1993) have explored the possibility the dynamics of the quarterly U.S. GNPseries (say, yt) analyzed in Example 3.33 might be different during expansion(∇ log yt > 0) than during contraction (∇ log yt < 0). In this section, we willconcentrate on the method presented in Shumway and Stoffer (1991). Oneway of modeling change in an evolving time series is by assuming the dynam-ics of some underlying model changes discontinuously at certain undeterminedpoints in time. Our starting point is the DLM given by (6.1) and (6.2), namely,


to describe the p × 1 state dynamics, and

yyyt = Atxxxt + vvvt (6.119)

to describe the q × 1 observation dynamics. Recall wwwt and vvvt are Gaussianwhite noise sequences with var(wwwt) = Q, var(vvvt) = R, and cov(wwwt, vvvs) = 0 forall s and t.

Generalizations of (6.118) and (6.119) to include the possibility of changesoccurring over time have been approached by allowing changes in the error co-variances (Harrison and Stevens, 1976, Gordon and Smith, 1988, 1990) or by as-signing mixture distributions to the observation errors vvvt (Pena and Guttman,1988). Approximations to filtering were derived in all of the aforementionedarticles. An application to monitoring renal transplants was described in Smithand West (1983) and in Gordon and Smith (1990). Changes can also be mod-eled in the classical regression case by allowing switches in the design matrix,as in Quandt (1972).


Switching via a stationary Markov chain with independent observations hasbeen developed by Lindgren (1978) and Goldfeld and Quandt (1973). In theMarkov chain approach, we declare the dynamics of the system at time t isgenerated by one of m possible regimes evolving according to a Markov chainover time. As a simple example, suppose the dynamics of a univariate timeseries, yt, is generated by either the model (1) yt = β1yt−1 + wt or the model(2) yt = β2yt−1 + wt. We will write the model as yt = φtyt−1 + wt such thatPr(φt = βj) = πj , j = 1, 2, π1 + π2 = 1, and with the Markov property

Pr(φt = βj

∣∣ φt−1 = βi, φt−2 = βi2 , . . .) = Pr(φt = βj

∣∣ φt−1 = βi) = πij ,

for i, j = 1, 2 (and i2, . . . = 1, 2). As previously mentioned, Markov switchingfor dependent data has been applied by Hamilton (1989) to detect changesbetween positive and negative growth periods in the economy. Applications tospeech recognition have been considered by Juang and Rabiner (1985). Thecase in which the particular regime is unknown to the observer comes underthe heading of hidden Markov models, and the techniques related to analyzingthese models are summarized in Rabiner and Juang (1986). An application ofthe idea of switching to the tracking of multiple targets has been considered inBar-Shalom (1978), who obtained approximations to Kalman filtering in termsof weighted averages of the innovations.

Example 6.13 Tracking Multiple Targets

The approach of Shumway and Stoffer (1991) was motivated primarilyby the problem of tracking a large number of moving targets using avector yyyt of sensors. In this problem, we do not know at any given pointin time which target any given sensor has detected. Hence, it is thestructure of the measurement matrix At in (6.119) that is changing, andnot the dynamics of the signal xxxt or the noises, wwwt or vvvt. As an example,consider a 3 × 1 vector of satellite measurements yyyt = (yt1, yt2, yt3)′ thatare observations on some combination of a 3 × 1 vector of targets orsignals, xxxt = (xt1, xt2, xt3)′. For the measurement matrix

At =

⎡⎣ 1 0 01 0 01 0 0

⎤⎦in the model (6.119), all sensors are observing the first target, xt1,whereas for the measurement matrix

At =

⎡⎣ 0 1 01 0 00 0 1

⎤⎦the first sensor, yt1, observes the second target, xt2; the second sensor,yt2, observes the first target, xt1; and the third sensor, yt3, observes the


third target, xt3. All possible detection configurations will define a set ofpossible values for At, say, {M1, M2, . . . , Mm}, as a collection of plausiblemeasurement matrices.

Example 6.14 Modeling Economic Change

As another example of the switching model presented in this section,consider the case in which the dynamics of the linear model changessuddenly over the history of a given realization. For example, Lam (1990)has given the following generalization of Hamilton’s (1989) model fordetecting positive and negative growth periods in the economy. Supposethe data are generated by

yt = zt + nt, (6.120)

where zt is an autoregressive series and nt is a random walk with a driftthat switches between two values α0 and α0 + α1. That is,

nt = nt−1 + α0 + α1St, (6.121)

with St = 0 or 1, depending on whether the system is in state 1 or state2. For the purpose of illustration, suppose

zt = φ1zt−1 + φ2zt−2 + wt (6.122)

is an AR(2) series with var(wt) = σ2w. Lam (1990) wrote (6.120) in a

differenced form∇yt = zt − zt−1 + α0 + α1St, (6.123)

which we may take as the observation equation (6.119) with state vector

xxxt = (zt, zt−1, α0, α1)′ (6.124)

andM1 = [1,−1, 1, 0] and M2 = [1,−1, 1, 1] (6.125)

determining the two possible economic conditions. The state equation,(6.118), is of the form⎛⎜⎝

zt

zt−1α0α1

⎞⎟⎠ =

⎡⎢⎣φ1 φ2 0 01 0 0 00 0 1 00 0 0 1

⎤⎥⎦⎛⎜⎝

zt−1zt−2α0α1

⎞⎟⎠+

⎛⎜⎝wt

000

⎞⎟⎠ . (6.126)

The observation equation, (6.119), in this case is

∇yt = Atxxxt + vt, (6.127)

where Pr(At = M1) = 1 − Pr(At = M2), with M1 and M2 given in(6.125).


To incorporate a reasonable switching structure for the measurement ma-trix into the DLM that is compatible with both practical situations previouslydescribed, we assume that the m possible configurations are states in a non-stationary, independent process defined by the time-varying probabilities

πj(t) = Pr(At = Mj), (6.128)

for j = 1, . . . , m and t = 1, 2, . . . , n. Important information about the currentstate of the measurement process is given by the filtered probabilities of beingin state j, defined as the conditional probabilities

πj(t|t) = Pr(At = Mj |Yt), (6.129)

which also vary as a function of time. In (6.129), we have used the notationYs = {yyy1, . . . , yyys}. The filtered probabilities (6.129) give the time-varyingestimates of the probability of being in state j given the data to time t.

It will be important for us to obtain estimators of the configuration prob-abilities, πj(t|t), the predicted and filtered state estimators, xxxt−1

t and xxxtt, and

the corresponding error covariance matrices P t−1t and P t

t . Of course, the pre-dictor and filter estimators will depend on the parameters, Θ, of the DLM. Inmany situations, the parameters will be unknown and we will have to estimatethem. Our focus will be on maximum likelihood estimation, but other authorshave taken a Bayesian approach that assigns priors to the parameters, andthen seeks posterior distributions of the model parameters; see, for example,Gordon and Smith (1990), Pena and Guttman (1988), or McCulloch and Tsay(1993).

We now establish the recursions for the filters associated with the statexxxt and the switching process, At. As discussed in §6.3, the filters are also anessential part of the maximum likelihood procedure. The predictors, xxxt−1

t =E(xxxt|Yt−1), and filters, xxxt

t = E(xxxt|Yt), and their associated error variance–covariance matrices, P t−1

t and P tt , are given by

xxxt−1t = Φxxxt−1

t−1, (6.130)

P t−1t = ΦP t−1

t−1 Φ′ + Q, (6.131)

xxxtt = xxxt−1

t +m∑

j=1

πj(t|t)Ktjεεεtj , (6.132)

P tt =

m∑j=1

πj(t|t)(I − KtjMj)P t−1t , (6.133)

Ktj = P t−1t M ′

jΣ−1tj , (6.134)

where the innovation values in (6.132) and (6.134) are

εεεtj = yyyt − Mjxxxt−1t , (6.135)


Σtj = MjPt−1t M ′

j + R, (6.136)

for j = 1, . . . , m.Equations (6.130)-(6.134) exhibit the filter values as weighted linear com-

binations of the m innovation values, (6.135)-(6.136), corresponding to eachof the possible measurement matrices. The equations are similar to the ap-proximations introduced by Bar-Shalom and Tse (1975), by Gordon and Smith(1990), and Pena and Guttman (1988).

To verify (6.132), let the indicator I(At = Mj) = 1 when At = Mj , andzero otherwise. Then, using (6.21),

xxxtt = E(xxxt|Yt) = E[E(xxxt|Yt, At)

∣∣ Yt]

= E

⎧⎨⎩m∑

j=1

E(xxxt|Yt, At = Mj)I(At = Mj)∣∣ Yt

⎫⎬⎭= E

⎧⎨⎩m∑

j=1

[xxxt−1t + Ktj(yyyt − Mjxxx

t−1t )]I(At = Mj)

∣∣ Yt

⎫⎬⎭=

m∑j=1

πj(t|t)[xxxt−1t + Ktj(yyyt − Mjxxx

t−1t )],

where Ktj is given by (6.134). Equation (6.133) is derived in a similar fashion;the other relationships, (6.130), (6.131), and (6.134), follow from straightfor-ward applications of the Kalman filter results given in Property P6.1.

Next, we derive the filters πj(t|t). Let fj(t|t − 1) denote the conditionaldensity of yyyt given the past yyy1, . . . , yyyt−1, and At = Mj , for j = 1, . . . , m. Then,

πj(t|t) =πj(t)fj(t|t − 1)∑m

k=1 πk(t)fk(t|t − 1), (6.137)

where we assume the distribution πj(t), for j = 1, . . . , m has been specifiedbefore observing yyy1, . . . , yyyt (details follow as in Example 6.15 below). If theinvestigator has no reason to prefer one state over another at time t, the choiceof uniform priors, πj(t) = m−1, for j = 1, . . . , m, will suffice. Smoothness canbe introduced by letting

πj(t) =m∑

i=1

πi(t − 1|t − 1)πij , (6.138)

where the non-negative weights πij are chosen so∑m

i=1 πij = 1. If the At

process was Markov with transition probabilities πij , then (6.138) would bethe update for the filter probability, as shown in the next example.


Example 6.15 Hidden Markov Chain Model

If {At} is a hidden Markov chain with stationary transition probabilitiesπij = Pr(At = Mj |At−1 = Mi), for i, j = 1, . . . , m, letting p(·) denote ageneric probability function, we have

πj(t|t) =p(At = Mj , yyyt, Yt−1)

p(yyyt, Yt−1)

=p(Yt−1)p(At = Mj

∣∣ Yt−1)p(yyyt

∣∣ At = Mj , Yt−1)p(Yt−1)p(yyyt

∣∣ Yt−1)

=πj(t|t − 1)fj(t|t − 1)∑m

k=1 πk(t|t − 1)fk(t|t − 1). (6.139)

In the Markov case, the conditional probabilities

πj(t|t − 1) = Pr(At = Mj

∣∣Yt−1)

in (6.139) replace the unconditional probabilities, πj(t) = Pr(At = Mj),in (6.137).

To evaluate (6.139), we must be able to calculate πj(t|t−1) and fj(t|t−1).We will discuss the calculation of fj(t|t−1) after this example. To deriveπj(t|t − 1), note,

πj(t|t − 1) = Pr(At = Mj

∣∣Yt−1)

=m∑

i=1

Pr(At = Mj , At−1 = Mi

∣∣Yt−1)

=m∑

i=1

Pr(At = Mj

∣∣At−1 = Mi)Pr(At−1 = Mi

∣∣Yt−1)

=m∑

i=1

πijπi(t − 1|t − 1). (6.140)

Expression (6.138) comes from equation (6.140), where, as previouslynoted, we replace πj(t|t − 1) by πj(t).

The difficulty in extending the approach here to the Markov case is thedependence among the yyyt, which makes it necessary to enumerate over allpossible histories to derive the filtering equations. This problem will be evidentwhen we derive the conditional density fj(t|t−1). Equation (6.138) has πj(t) asa function of the past observations, Yt−1, which is inconsistent with our modelassumption. Nevertheless, this seems to be a reasonable compromise thatallows the data to modify the probabilities πj(t), without having to develop ahighly computer-intensive technique.

As previously suggested, the computation of fj(t|t − 1), without some ap-proximations, is highly computer-intensive. To evaluate fj(t|t − 1), considerthe event

A1 = Mj1 , . . . , At−1 = Mjt−1 , (6.141)


for ji = 1, . . . , m, and i = 1, . . . , t−1, which specifies a specific set of measure-ment matrices through the past; we will write this event as A(t−1) = M(�). Be-cause mt−1 possible outcomes exist for A1, . . . , At−1, the index runs through = 1, . . . , mt−1. Using this notation, we may write

fj(t|t − 1)

=mt−1∑�=1

Pr{A(t−1) = M(�)∣∣ Yt−1}f(yyyt

∣∣ Yt−1, At = Mj , A(t−1) = M(�))

≡mt−1∑�=1

α() N(yyyt

∣∣∣ µµµtj(),Σtj())

, j = 1, . . . , m, (6.142)

where the notation N(· ∣∣ bbb, B) represents the normal density with mean vec-tor bbb and variance–covariance matrix B. That is, fj(t|t − 1) is a mixture ofnormals with non-negative weights α() = Pr{A(t−1) = M(�)

∣∣ Yt−1} such that∑� α() = 1, and with each normal distribution having mean vector

µµµtj() = Mjxxxt−1t () = MjE[xxxt

∣∣ Yt−1, A(t−1) = M(�)] (6.143)

and covariance matrix

Σtj() = MjPt−1t ()M ′

j + R. (6.144)

This result follows because the conditional distribution of yyyt in (6.142) is iden-tical to the fixed measurement matrix case presented in Section 4.2. The valuesin (6.143) and (6.144), and hence the densities, fj(t|t − 1), for j = 1, . . . , m,can be obtained directly from the Kalman filter, Property P6.1, with the mea-surement matrices A(t−1) fixed at M(�).

Although fj(t|t − 1) is given explicitly in (6.142), its evaluation is highlycomputer intensive. For example, with m = 2 states and n = 20 observations,we have to filter over 2 + 22 + · · · + 220 possible sample paths; note, 220 =1, 048, 576. One remedy is to trim (remove), at each t, highly improbablesample paths; that is, remove events in (6.141) with extremely small probabilityof occurring, and then evaluate fj(t|t−1) as if the trimmed sample paths couldnot have occurred. Another alternative, as suggested by Gordon and Smith(1990) and Shumway and Stoffer (1991), is to approximate fj(t|t − 1) usingthe closest (in the sense of Kulback–Leibler distance) normal distribution. Inthis case, the approximation leads to choosing normal distribution with thesame mean and variance associated with fj(t|t − 1); that is, we approximatefj(t|t − 1) by a normal with mean Mjxxx

t−1t and variance Σtj given in (6.136).

To develop a procedure for maximum likelihood estimation, the joint den-sity of the data is

f(yyy1, . . . , yyyn) =n∏

t=1

f(yyyt|Yt−1)

=n∏

t=1

m∑j=1

Pr(At = Mj |Yt−1)f(yyyt|At = Mj , Yt−1),


and hence, the likelihood can be written as

lnLY (Θ) =n∑

t=1

ln

⎛⎝ m∑j=1

πj(t)fj(t|t − 1)

⎞⎠ . (6.145)

For the hidden Markov model, πj(t) would be replaced by πj(t|t−1). In (6.145),we will use the normal approximation to fj(t|t−1). That is, henceforth, we willconsider fj(t|t−1) as the normal, N(Mjxxx

t−1t , Σtj), density, where xxxt−1

t is givenin (6.130) and Σtj is given in (6.136). We may consider maximizing (6.145)directly as a function of the parameters Θ = {µµµ0, Φ, Q, R} using a Newtonmethod, or we may consider applying the EM algorithm to the complete datalikelihood.

To apply the EM algorithm as in §6.3, we call xxx0, xxx1, . . . , xxxn, A1, . . . , An,and yyy1, . . . , yyyn, the complete data, with likelihood given by

−2 ln LX,A,Y (Θ) = ln |Σ0| + (xxx0 − µµµ0)′Σ−1

0 (xxx0 − µµµ0)

+ n ln |Q| +n∑

t=1

(xxxt − Φxxxt−1)′Q−1(xxxt − Φxxxt−1)

− 2n∑

t=1

m∑j=1

I(At =Mj) lnπj(t) + n ln |R|

+n∑

t=1

m∑j=1

I(At =Mj)(yyyt − Atxxxt)′R−1(yyyt − Atxxxt). (6.146)

As discussed in §6.3, we require the minimization of the conditional expectation

Q(Θ∣∣ Θ(k−1)

)= E

{−2 ln LX,A,Y (Θ)

∣∣∣ Yn, Θ(k−1)}

, (6.147)

with respect to Θ at each iteration, k = 1, 2, . . . . The calculation and maxi-mization of (6.147) is similar to the case of (6.65). In particular, with

πj(t|n) = E[I(At = Mj)∣∣ Yn], (6.148)

we obtain on iteration k,π

(k)j (t) = πj(t|n), (6.149)

µµµ(k)0 = xxxn

0 , (6.150)

Φ(k) = S10S−100 , (6.151)

Q(k) = n−1 (S11 − S10S−100 S′

10), (6.152)

and

R(k) = n−1n∑

t=1

m∑j=1

πj(t|n)[(yyyt − Mjxxx

nt )(yyyt − Mjxxx

nt )′ + MjP

nt M ′

j

]. (6.153)


where S11, S10, S00 are given in (6.67)-(6.69). As before, at iteration k, thefilters and the smoothers are calculated using the current values of the pa-rameters, Θ(k−1), and Σ0 is held fixed. Filtering is accomplished by using(6.130)-(6.134). Smoothing is derived in a similar manner to the derivationof the filter, and one is led to the smoother given in Property P6.2 and P6.3,with one exception, the initial smoother covariance, (6.55), is now

Pnn,n−1 =

m∑j=1

πj(n|n)(I − KtjMj)ΦPn−1n−1 . (6.154)

Unfortunately, the computation of πj(t|n) is excessively complicated, and re-quires integrating over mixtures of normal distributions. Shumway and Stof-fer (1991) suggest approximating the smoother πj(t|n) by the filter πj(t|t), andfind the approximation works well.

Example 6.16 Analysis of Influenza Data

We use the results of this section to analyze the U.S. monthly pneumoniaand influenza mortality data presented in §5.4, Figure 5.7. Letting yt

denote the mortality caused by pneumonia and influenza at month t,we model yt in terms of a structural component model coupled with ahidden Markov process that determines whether a flu epidemic exists.

The model consists of three structural components. The first component,xt1, is an AR(2) process chosen to represent the periodic (seasonal) com-ponent of the data,

xt1 = α1xt−1,1 + α2xt−2,1 + wt1, (6.155)

where wt1 is white noise, with var(wt1) = σ21 . The second component,

xt2, is an AR(1) process with a nonzero constant term, which is chosento represent the sharp rise in the data during an epidemic,

xt2 = β0 + β1xt−1,2 + wt2, (6.156)

where wt2 is white noise, with var(wt2) = σ22 . The third component, xt3,

is a fixed trend component given by,

xt3 = xt−1,3 + wt3, (6.157)

where var(wt3) = 0. The case in which var(wt3) > 0, which correspondsto a stochastic trend (random walk), was tried here, but the estimationbecame unstable, and lead to us fitting a fixed, rather than stochastic,trend. Thus, in the final model, the trend component satisfies ∇xt3 = 0;recall in Example 2.42 the data were also differenced once before fittingthe model.


Table 6.3 Estimation Results for Influenza DataInitial Final

Parameter Estimate Estimateα1 1.401 (.079) 1.379 (.073)α2 −.618 (.091) −.575 (.075)β0 .162 (.042) .201 (.028)β1 .156 (.142) —σ1 .023 (.001) .023 (.001)σ2 .105 (.015) .108 (.016)σv .000 (.032) —

Estimated standard errors are shown in parentheses.

Throughout the years, periods of normal influenza mortality (state 1)are modeled as

yt = xt1 + xt3 + vt, (6.158)

where the measurement error, vt, is white noise with var(vt) = σ2v . When

an epidemic occurs (state 2), mortality is modeled as

yt = xt1 + xt2 + xt3 + vt. (6.159)

The model specified in (6.155)–(6.159) can be written in the generalstate-space form. The state equation is⎛⎜⎝

xt1xt−1,1xt2xt3

⎞⎟⎠ =

⎡⎢⎣α1 α2 0 01 0 0 00 0 β1 00 0 0 1

⎤⎥⎦⎛⎜⎝

xt−1,1xt−2,1xt−1,2xt−1,3

⎞⎟⎠+

⎛⎜⎝00β00

⎞⎟⎠+

⎛⎜⎝wt10

wt20

⎞⎟⎠ .

(6.160)Of course, (6.160) can be written in the standard state-equation form as

xxxt = Φxxxt−1 + Γut + wwwt, (6.161)

where xxxt = (xt1, xt−1,1, xt2, xt3)′, Γ = (0, 0, β0, 0)′, ut ≡ 1, and Q is a4 × 4 matrix with σ2

1 as the (1,1)-element, σ22 as the (3,3)-element, and

the remaining elements set equal to zero. The observation equation is

yt = Atxxxt + vt, (6.162)

where At is 1 × 4, and vt is white noise with var(vt) = R = σ2v . We

assume all components of variance wt1, wt2, and vt are uncorrelated.

As discussed in (6.158) and (6.159), At can take one of two possible forms

At = M1 = [1, 0, 0, 1] no epidemic,At = M2 = [1, 0, 1, 1] epidemic,


Figure 6.12 Influenza data, yt, (dark line–squares) and the predicted proba-bility that no epidemic occurs in month t given the past, π1(t|t−1) (line–circles)for the ten-year period 1969-1978; 1968 is not shown.

corresponding to the two possible states of (1) no flu epidemic and (2)flu epidemic, such that Pr(At = M1) = 1 − Pr(At = M2). In thisexample, we will assume At is a hidden Markov chain, and hence we usethe updating equations given in Example 6.15, (6.139) and (6.140), withtransition probabilities π11 = π22 = .75 (and, thus, π12 = π21 = .25).

Parameter estimation was accomplished using a quasi-Newton–Raphsonprocedure to maximize the approximate log likelihood given in (6.145),with initial values of π1(1|0) = π2(1|0) = .5. Table 6.3 shows the re-sults of the estimation procedure. On the initial fit, two estimates arenot significant, namely, β1 and σv. When σ2

v = 0, there is no measure-ment error, and the variability in data is explained solely by the variancecomponents of the state system, namely, σ2

1 and σ22 . The case in which

β1 = 0 corresponds to a simple level shift during a flu epidemic. In thefinal model, with β1 and σ2

v removed, the estimated level shift (β0) cor-responds to an increase in mortality by about .2 per 1000 during a fluepidemic. The estimates for the final model are also listed in Table 6.3.

Figure 6.12 shows a plot of the data, yt, for the ten-year period of1969-1978 as well as the estimated approximate conditional probabili-


Figure 6.13 The three filtered structural components of influenza mortality:xt

t1 (cyclic trace), xtt2 (spiked trace), and xt

t3 (negative linear trace) for theten-year period 1969-1978.

ties π1(t|t − 1), that is, the predicted probability no epidemic occurs inmonth t given the past, y1, . . . , yt−1. The results for the first year of thedata, 1968, are not included in the figure because of initial instabilities ofthe filter. Of course, the estimated predicted probability a flu epidemicwill occur next month is π2(t|t − 1) = 1 − π1(t|t − 1). Thus, a good esti-mator would have small values of π1(t|t−1) corresponding to peaks in yt.Except for initial values where instability exists, the estimated predictionprobabilities are right on the mark. That is, the predicted probability ofa flu epidemic exceeds the probability of no epidemic when indeed a fluepidemic occurred the next month.

Figure 6.13 shows the estimated filtered values (that is, filtering is doneusing the parameter estimates) of the three components of the model,xt

t1, xtt2, and xt

t3. Except for initial instability (which is not shown),xt

t1 represents the seasonal (cyclic) aspect of the data, xtt2 represents the

spikes during a flu epidemic, and xtt3 represents the slow decline in flu

mortality over the ten-year period of 1969-1978.

One-month-ahead prediction, say, yt−1t , is obtained as follows,

yt−1t = M1xxx

t−1t if π1(t|t − 1) > π2(t|t − 1),

yt−1t = M2xxx

t−1t if π1(t|t − 1) ≤ π2(t|t − 1).


Figure 6.14 One-month-ahead prediction, yt−1t (line), of the number of deaths

caused by pneumonia and influenza, yt (points) for 1969-1978. The standarderror of the prediction is .02 when a flu epidemic is not predicted, and .11 whena flu epidemic is predicted.

Of course, xxxt−1t is the estimated state prediction, obtained via the filter

presented in (6.130)-(6.134) (with the addition of the constant term inthe model) using the estimated parameters. The results are shown inFigure 6.14. The precision of the forecasts can be measured by the in-novation variances, Σt1 when no epidemic is predicted, and Σt2 when anepidemic is predicted. These values become stable quickly, and when noepidemic is predicted, the estimated standard error of the prediction isapproximately .02 (this is the square root of Σt1 for t large); when a fluepidemic is predicted, the estimated standard error of the prediction isapproximately .11.

The results of this analysis are impressive given the small number ofparameters and the degree of approximation that was made to obtain acomputationally simple method for fitting a complex model. In partic-ular, as seen in Figure 6.12, the model is never fooled as to when a fluepidemic will occur. This result is particularly impressive, given that,for example, in the third year, around t = 36, it appeared as though anepidemic was about to begin, but it never was realized, and the model


predicted no flu epidemic that year. As seen in Figure 6.14, the pre-dicted mortality tends to be underestimated during the peaks, but thetrue values are typically within one standard error of the predicted value.Further evidence of the strength of this technique can be found in theexample given in Shumway and Stoffer (1991).

6.9 Nonlinear and Non-normal State-SpaceModels Using Monte Carlo Methods

Most of this chapter has focused on linear dynamic models assumed to beGaussian processes. Historically, these models were convenient because an-alyzing the data was a relatively simple matter. These assumptions cannotcover every situation, and it is advantageous to explore departures from theseassumptions. As seen in §6.8, the solution to the nonlinear and non-Gaussiancase will require computer-intensive techniques currently in vogue because ofthe availability of cheap and fast computers. In this section, we take a Bayesianapproach to forecasting as our main objective; see West and Harrison (1997)for a detailed account of Bayesian forecasting with dynamic models. Prior tothe mid-1980s, a number of approximation methods were developed to filternon-normal or nonlinear processes in an attempt to circumvent the computa-tional complexity of the analysis of such models. For example, the extendedKalman filter and the Gaussian sum filter (Alspach and Sorensen, 1972) aretwo such methods described in detail in Anderson and Moore (1979). As inthe previous section, these techniques typically rely on approximating the non-normal distribution by one or several Gaussian distributions or by some otherparametric function.

With the advent of cheap and fast computing, a number of authors devel-oped computer-intensive methods based on numerical integration. For exam-ple, Kitagawa (1987) proposed a numerical method based on piecewise linearapproximations to the density functions for prediction, filtering, and smooth-ing for non-Gaussian and nonstationary state-space models. Pole and West(1988) used Gaussian quadrature techniques in a Bayesian analysis of nonlin-ear dynamic models; West and Harrison (1997, Chapter 13) provide a detailedexplanation of these and similar methods. Markov chain Monte Carlo (MCMC)methods refer to Monte Carlo integration methods that use a Markovian up-dating scheme. We will describe the method in more detail later. The mostcommon MCMC method is the Gibbs sampler, which is essentially a modifica-tion of the Metropolis algorithm (Metropolis et al., 1953) developed by Hast-ings (1970) in the statistical setting and by Geman and Geman (1984) in thecontext of image restoration. Later, Tanner and Wong (1987) used the ideasin their substitution sampling approach, and Gelfand and Smith (1990) devel-oped the Gibbs sampler for a wide class of parametric models. This technique

6.9: Nonlinear and Non-normal Models 377

was first used by Carlin et al. (1992) in the context of general nonlinear andnon-Gaussian state-space models. Fruhwirth-Schnatter (1994) and Carter andKohn (1994) built on these ideas to develop efficient Gibbs sampling schemesfor more restrictive models.

If the model is linear, that is, (6.1) and (6.2) hold, but the distributionsare not Gaussian, a non-Gaussian likelihood can be defined by (6.31) in §6.2,but where f0(·), fw(·) and fv(·) are not normal densities. In this case, pre-diction and filtering can be accomplished using numerical integration tech-niques (e.g., Kitagawa, 1987; Pole and West, 1988) or Monte Carlo techniques(e.g. Fruhwirth-Schnatter, 1994; Carter and Kohn, 1994) to evaluate (6.32)and (6.33). Of course, the prediction and filter densities pΘ(xxxt

∣∣ Yt−1) andpΘ(xxxt

∣∣ Yt) will no longer be Gaussian and will not generally be of the location-scale form as in the Gaussian case. A rich class of non-normal densities is givenin (6.173).

In general, the state-space model can be given by the following equations:

xxxt = Ft(xxxt−1,wwwt) and yyyt = Ht(xxxt, vvvt), (6.163)

where Ft and Ht are known functions that may depend on parameters Θand wwwt and vvvt are white noise processes. The main component of the modelretained by (6.163) is that the states are Markov, and the observations areconditionally independent, but we do not necessarily assume Ft and Ht arelinear, or wwwt and vvvt are Gaussian. Of course, if Ft(xxxt−1,wwwt) = Φtxxxt−1 + wwwt

and Ht(xxxt, vvvt) = Atxxxt + vvvt and wwwt and vvvt are Gaussian, we have the standardDLM (exogenous variables can be added to the model in the usual way). Inthe general model, (6.163), the likelihood is given by

LX,Y (Θ) = pΘ(xxx0)n∏

t=1

pΘ(xxxt

∣∣ xxxt−1)pΘ(yyyt

∣∣ xxxt), (6.164)

and the prediction and filter densities, as given by (6.32) and (6.33) in Section4.2, still hold.

Because our focus is on simulation using MCMC methods, we first describethe technique in a general context.

Example 6.17 MCMC Techniques and the Gibbs Sampler

The goal of a Monte Carlo technique, of course, is to simulate a pseudo-random sample of vectors from a desired density function pΘ(zzz). InMarkov chain Monte Carlo, we simulate an ordered sequence of pseudo-random vectors, zzz0 �→ zzz1 �→ zzz2 �→ · · · by specifying a starting value, zzz0and then sampling successive values from a transition density π(zzzt|zzzt−1),for t = 1, 2, . . .. In this way, conditional on zzzt−1, the t-th pseudo-randomvector, zzzt, is simulated independent of its predecessors. This techniquealone does not yield a pseudo-random sample because contiguous drawsare dependent on each other (that is, we obtain a first-order dependent


sequence of pseudo-random vectors). If done appropriately, the depen-dence between the pseudo-variates zzzt and zzzt+m decays exponentially inm, and we may regard the collection {zzzt+�m; = 1, 2, . . .} for t and msuitably large, as a pseudo-random sample. Alternately, one may repeatthe process in parallel, retaining the m-th value, on run g = 1, 2, . . ., say,zzz(g)m , for large m. Under general conditions, the Markov chain converges

in the sense that, eventually, the sequence of pseudo-variates appear sta-tionary and the individual zzzt are marginally distributed according to thestationary “target” density pΘ(zzz). Technical details may be found inTierney (1994).

For Gibbs sampling, suppose we have a collection {zzz1, . . . , zzzk} of randomvectors with complete conditional densities denoted generically by

pΘ(zzzj

∣∣ zzzi, i �= j) ≡ pΘ(zzzj

∣∣ zzz1, . . . , zzzj−1, zzzj+1, . . . , zzzk),

for j = 1, . . . , k, available for sampling. Here, available means pseudo-samples may be generated by some method given the values of the appro-priate conditioning random vectors. Under mild conditions, these com-plete conditionals uniquely determine the full joint density pΘ(zzz1, . . . , zzzk)and, consequently, all marginals, pΘ(zzzj) for j = 1, . . . , k; details maybe found in Besag (1974). The Gibbs sampler generates pseudo-samplesfrom the joint distribution as follows. Start with an arbitrary set of start-ing values, say, {zzz1[0], . . . , zzzk[0]}. Draw zzz1[1] from pΘ(zzz1|zzz2[0], . . . , zzzk[0]),then draw zzz2[1] from pΘ(zzz2|zzz1[1], zzz3[0], . . . , zzzk[0]), and so on up to zzzk[1] frompΘ(zzzk|zzz1[1], . . . , zzzk−1[1]), to complete one iteration. After such itera-tions, we have the collection {zzz1[�], . . . , zzzk[�]}. Geman and Geman (1984)showed that under mild conditions, {zzz1[�], . . . , zzzk[�]} converges ( → ∞)in distribution to a random observation from pΘ(zzz1, . . . , zzzk). For thisreason, we typically drop the subscript [] from the notation, assuming is sufficiently large for the generated sample to be thought of as a re-alization from the joint density; hence, we denote this first realizationas {zzz(1)

1[�], . . . , zzz(1)k[�]} ≡ {zzz(1)

1 , . . . , zzz(1)k }. This entire process is replicated in

parallel, a large number, G, of times providing pseudo-random iid collec-tions {zzz(g)

1 , . . . , zzz(g)k }, for g = 1, . . . , G from the joint distribution. These

simulated values can the be used to estimate the marginal densities. Inparticular, if pΘ(zzzj |zzzi, i �= j) is available in closed form, then

pΘ(zzzj) = G−1G∑

g=1

pΘ(zzzj

∣∣ zzz(g)i , i �= j). (6.165)

Approximation (6.165) is based on the fact that, for random variables xand y with joint density p(x, y), the marginal density of x is obtained asfollows: p(x) =

∫p(x, y)dy =

∫p(x|y)p(y)dy. Because of the relatively

recent appearance of Gibbs sampling methodology, several importanttheoretical and practical issues are under investigation. These issues


include the diagnosis of convergence, modification of the sampling order,efficient estimation, and sequential sampling schemes (as opposed to theparallel processing described above) to mention a few. At this time, thebest advice can be obtained from the texts by Gelman et al. (1995) andGilks et al. (1996), and we are certain that many more will follow.

Finally, it may be necessary to nest rejection sampling within the Gibbssampling procedure. The need for rejection sampling arises when wewant to sample from a density, say, f(zzz), but f(zzz) is known only up to aproportionality constant, say, p(zzz) ∝ f(zzz). If a density g(zzz) is available,and there is a constant c for which p(zzz) ≤ cg(zzz) for all zzz, the rejectionalgorithm generates pseudo-variates from f(zzz) by generating a value,zzz∗ from g(zzz) and accepting it as a value from f(zzz) with probabilityπ(zzz∗) = p(zzz∗)/[cg(zzz∗)]. This algorithm can be quite inefficient if π(·) isclose to zero; in such cases, more sophisticated envelope functions maybe needed. Further discussion of these matters in the case of nonlinearstate-space models can be found in Carlin et al. (1992, Examples 1.2 and3.2).

In Example 6.17, the generic random vectors zzzj can represent parameter values,such as components of Θ, state values xxxt, or future observations yyyn+m, form ≥ 1. This will become evident in the following examples. Before discussingthe general case of nonlinear and non-normal state-space models, we brieflyintroduce MCMC methods for the Gaussian DLM, as presented in Fruhwirth-Schnatter (1994) and Carter and Kohn (1994).

Example 6.18 Assessing Model Parameters for the Gaussian DLM

Consider the Gaussian DLM given by

xxxt = Φtxxxt−1 + wwwt and yt = aaa′txxxt + vt. (6.166)

The observations are univariate, and the state process is p-dimensional;this DLM includes the structural models presented in §6.5. The prioron the initial state is xxx0 ∼ N(µµµ0, Σ0), and we assume that wwwt ∼ iidN(000, Qt), independent of vt ∼ iid N(0, rt). The collection of unknownmodel parameters will be denoted by Θ.

To explore how we would assess the values of Θ using an MCMC tech-nique, we focus on the problem obtaining the posterior distribution,p(Θ

∣∣ Yn), of the parameters given the data, Yn = {y1, . . . , yn} and aprior π(Θ). Of course, these distributions depend on “hyperparameters”that are assumed to be known. (Some authors consider the states xxxt asthe first level of parameters because they are unobserved. In this case,the values in Θ are regarded as the hyperparameters, and the parametersof their distributions are regarded as hyper-hyperparameters.) Denoting


the entire set of state vectors as Xn = {xxx0, xxx1, . . . , xxxn}, the posterior canbe written as

p(Θ∣∣ Yn) =

∫p(Θ

∣∣ Xn, Yn) p(Xn, Θ∗ ∣∣ Yn) dXn dΘ∗. (6.167)

Although the posterior, p(Θ∣∣ Yn), may be intractable, conditioning on

the states can make the problem manageable in that

p(Θ∣∣ Xn, Yn) ∝ π(Θ) p(x0

∣∣ Θ)n∏

t=1

p(xxxt

∣∣ xxxt−1, Θ) p(yt

∣∣ xxxt, Θ) (6.168)

can be easier to work with (either as members of conjugate families orusing some rejection scheme); we will discuss this in more detail whenwe present the nonlinear, non-Gaussian case, but we will assume for thepresent p(Θ

∣∣ Xn, Yn) is in closed form.

Suppose we can obtain G pseudo-random draws, X(g)n ≡ (Xn, Θ∗)(g),

for g = 1, . . . , G, from the joint posterior density p(Xn, Θ∗ ∣∣ Yn). Then(6.167) can be approximated by

p(Θ∣∣ Yn) = G−1

G∑g=1

p(Θ∣∣ X(g)

n , Yn).

A sample from p(Xn, Θ∗ ∣∣ Yn) is obtained using two different MCMCmethods. First, the Gibbs sampler is used, for each g, as follows: sampleXn[�] given Θ∗

[�−1] from p(Xn

∣∣ Θ∗[�−1], Yn), and then a sample Θ∗

[�] fromp(Θ

∣∣ Xn[�], Yn) as given by (6.168), for = 1, 2, . . .. Stop when issufficiently large, and retain the final values as X

(g)n . This process is

repeated G times.

The first step of this method requires simultaneous generation of thestate vectors. Because we are dealing with a Gaussian linear model, wecan rely on the existing theory of the Kalman filter to accomplish thisstep. This step is conditional on Θ, and we assume at this point thatΘ is fixed and known. In other words, our goal is to sample the entireset of state vectors, Xn = {xxx0, xxx1, . . . , xxxn}, from the multivariate normalposterior density pΘ(Xn

∣∣ Yn), where Yn = {y1, . . . , yn} represents theobservations. Because of the Markov structure, we can write,

pΘ(Xn

∣∣ Yn) = pΘ(xxxn

∣∣ Yn)pΘ(xxxn−1∣∣ xxxn, Yn−1) · · · pΘ(xxx0

∣∣ xxx1). (6.169)

In view of (6.169), it is possible to sample the entire set of state vec-tors, Xn, by sequentially simulating the individual states backward.This process yields a simulation method that Fruhwirth–Schnatter (1994)called the forward-filtering, backward-sampling algorithm. In particular,


because the processes are Gaussian, we need only obtain the conditionalmeans and variances, say, mmmt = EΘ(xxxt

∣∣ Yt, xxxt+1), and Vt = varΘ(xxxt

∣∣Yt, xxxt+1). This conditioning argument is akin to having xxxt+1 as an addi-tional observation on state xxxt. In particular, using standard multivariatenormal distribution theory,

mmmt = xxxtt + Jt(xxxt+1 − xxxt

t+1),Vt = P t

t − JtPtt+1J

′t, (6.170)

for t = n − 1, n − 2, . . . , 0, where Jt is defined in (6.49). To verify(6.170), the essential part of the Gaussian density (that is, the exponent)of xxxt

∣∣ Yt, xxxt+1 is

(xxxt+1 − Φt+1xxxt)′[Qt+1]−1(xxxt+1 − Φt+1xxxt) + (xxxt − xxxtt)

′[P tt ]−1(xxxt − xxxt

t),

and we simply complete the square; see Fruhwirth–Schnatter (1994) orWest and Harrison (1997, Section 4.7). Hence, the algorithm is to firstsample xxxn from a N(xxxn

n, Pnn ), where xxxn

n and Pnn are obtained from the

Kalman filter, Property P6.1, and then sample xxxt from a N(mmmt, Vt), fort = n − 1, n − 2, . . . , 0, where the conditioning value of xxxt+1 is the valuepreviously sampled; mmmt and Vt are given in (6.170).

Next, we address an MCMC approach to nonlinear and non-Gaussian state-space modeling that was first presented in Carlin et al. (1992). We considerthe general model given in (6.163), but with additive errors:

xxxt = Ft(xxxt−1) + wwwt and yyyt = Ht(xxxt) + vvvt, (6.171)

where Ft and Ht are given, but may also depend on unknown parameters, say,Φt and At, respectively, the collection of which will be denoted by Θ. The errorsare independent white noise sequences with var(wwwt) = Qt and var(vvvt) = Rt.Although time-varying variance–covariance matrices are easily incorporated inthis framework, to ease the discussion we focus on the case Qt ≡ Q and Rt ≡ R.Also, although it is not necessary, we assume the initial state condition xxx0 isfixed and known; this is merely for notational convenience, so we do not haveto carry along the additional terms involving xxx0 throughout the discussion.

In general, the likelihood specification for the model is given by

LX,Y (Θ, Q, R) =n∏

t=1

f1(xxxt

∣∣ xxxt−1, Θ, Q) f2(yyyt

∣∣ xxxt, Θ, R), (6.172)

where it is assumed the densities f1(·) and f2(·) are scale mixtures of normals.Specifically, for t = 1, . . . , n,

f1(xxxt

∣∣ xxxt−1, Θ, Q) =∫

f(xxxt

∣∣ xxxt−1, Θ, Q, λt) p1(λt) dλt,

f2(yyyt

∣∣ xxxt, Θ, R) =∫

f(yyyt

∣∣ xxxt, Θ, R, ωt) p2(ωt) dωt, (6.173)


where conditional on the independent sequences of nuisance parameters λλλ =(λt; t = 1, . . . , n) and ωωω = (ωt; t = 1, . . . , n),

xxxt

∣∣ xxxt−1, Θ, Q, λt ∼ N(Ft(xxxt−1; Θ), λtQ

),

yyyt

∣∣ xxxt, Θ, R, ωt ∼ N(Ht(xxxt; Θ), ωtR

). (6.174)

By varying p1(λt) and p2(ωt), we can have a wide variety of non-Gaussianerror densities. These densities include, for example, double exponential, lo-gistic, and t distributions in the univariate case and a rich class of multivariatedistributions; this is discussed further in Carlin et al. (1992). The key to theapproach is the introduction of the nuisance parameters λλλ and ωωω and the struc-ture (6.174), which lends itself naturally to the Gibbs sampler and allows forthe analysis of this general nonlinear and non-Gaussian problem.

According to Example 6.17, to implement the Gibbs sampler, we must beable to sample from the following complete conditional distributions:

(i) xxxt

∣∣ xxxs=t, λλλ,ωωω, Θ, Q, R, Yn t = 1, . . . , n,

(ii) λt

∣∣ λs=t, ωωω, Θ, Q, R, Yn, Xn ∼ λt

∣∣ Θ, Q,xxxt, xxxt−1 t = 1, . . . , n,

(iii) ωt

∣∣ ωs=t, λλλ, Θ, Q, R, Yn, Xn ∼ ωt

∣∣ Θ, R, yyyt, xxxt t = 1, . . . , n,

(iv) Q∣∣ λλλ,ωωω, Θ, R, Yn, Xn ∼ Q

∣∣ λλλ, Yn, Xn,

(v) R∣∣ λλλ,ωωω, Θ, Q, Yn, Xn ∼ R

∣∣ ωωω, Yn, Xn,

(vi) Θ∣∣ λλλ,ωωω, Q, R, Yn, Xn ∼ Θ

∣∣ Yn, Xn,

where Xn = {xxx1, . . . , xxxn} and Yn = {yyy1, . . . , yyyn}. The main difference betweenthis method and the linear Gaussian case is that, because of the generality, wesample the states one-at-a-time rather than simultaneously generating all ofthem. As discussed in Carter and Kohn (1994), if possible, it is more efficientto generate the states simultaneously as in Example 6.18.

We will discuss items (i) and (ii) above. The third item follows in a similarmanner to the second, and items (iv)-(vi) will follow from standard multivariatenormal distribution theory and from Wishart distribution theory because ofthe conditioning on λλλ and ωωω. We will discuss this matter further in the nextexample. First, consider the linear model, Ft(xxxt−1) = Φtxxxt−1, and Ht(xxxt) =Atxxxt in (6.171). In this case, for t = 1, . . . , n, xxxt

∣∣ xxxs=t, λλλ,ωωω, Θ, Q, R, Yn has ap-dimensional Np(Btbbbt, Bt) distribution, with

B−1t =

Q−1

λt+

A′tR

−1At

ωt+

Φ′t+1Q

−1Φt+1

λt+1,

bbbt =xxxt−1Φ′

tQ−1

λt+

yyytR−1At

ωt+

xxxt+1Q−1Φt+1

λt+1, (6.175)


where, when t = n in (6.175), terms in the sum with elements having a sub-script of n + 1 are dropped (this is assumed to be the case in what follows,although we do not explicitly state it). This result follows by noting the es-sential part of the multivariate normal distribution (that is, the exponent) ofxxxt

∣∣ xxxs=t, λλλ,ωωω, Θ, Q, R, Yn is

(xxxt − Φtxxxt−1)′(λtQ)−1(xxxt − Φtxxxt−1) + (yyyt − Atxxxt)′(ωtR)−1(yyyt − Atxxxt)

+(xxxt+1 − Φt+1xxxt)′(λt+1Q)−1(xxxt+1 − Φt+1xxxt), (6.176)

which upon manipulation yields (6.175).

Example 6.19 Nonlinear Models

In the case of nonlinear models, we can use (6.175) with slight modifica-tions. For example, consider the case in which Ft is nonlinear, but Ht islinear, so the observations are yyyt = Atxxxt + vvvt. Then,

xxxt

∣∣ xxxs=t, λλλ,ωωω, Θ, Q, R, Yn ∝ η1(xxxt)Np(B1tbbb1t, B1t), (6.177)

where

B−11t =

Q−1

λt+

A′tR

−1At

ωt,

bbb1t =F ′

t (xxxt−1)Q−1

λt+

yyytR−1At

ωt,

and

η1(xxxt) = exp{

− 12λt+1

(xxxt+1 − Ft+1(xxxt)

)′Q−1

(xxxt+1 − Ft+1(xxxt)

)}.

Because 0 ≤ η1(xxxt) ≤ 1, for all xxxt, the distribution we want to samplefrom is dominated by the Np(B1tbbb1t, B1t) density. Hence, we may userejection sampling as discussed in Example 6.17 to obtain an observationfrom the required density. That is, we generate a pseudo-variate fromthe Np(B1tbbb1t, B1t) density and accept it with probability η1(xxxt).

We proceed analogously in the case in which Ft(xxxt−1) = Φtxxxt−1 is linearand Ht(xxxt) is nonlinear. In this case,

xxxt

∣∣ xxxs=t, λλλ,ωωω, Θ, Q, R, Yn ∝ η2(xxxt)Np(B2tbbb2t, B2t), (6.178)

where

B−12t =

Q−1

λt+

Φ′t+1Q

−1Φt+1

λt+1,

bbb2t =xxxt−1Φ′

tQ−1

λt+

xxxt+1Q−1Φt+1

λt+1,


and

η2(xxxt) = exp{

− 12ωt

(yyyt − Ht(xxxt)

)′R−1

(yyyt − Ht(xxxt)

)}.

Here, we generate a pseudo-variate from the Np(B2tbbb2t, B2t) density andaccept it with probability η2(xxxt).

Finally, in the case in which both Ft and Ht are nonlinear, we have

xxxt

∣∣ xxxs=t, λλλ,ωωω, Θ, Q, R, Yn ∝ η1(xxxt)η2(xxxt)Np(Ft(xxxt−1), λtQ), (6.179)

so we sample from a Np(Ft(xxxt−1), λtQ) density and accept it with prob-ability η1(xxxt)η2(xxxt).

Determination of (ii), λt

∣∣ Θ, Q,xxxt, xxxt−1 follows directly from Bayes theo-rem; that is, p(λt

∣∣ Θ, Q,xxxt, xxxt−1) ∝ p1(λt)p(xxxt

∣∣ λt, xxxt−1, Θ, Q). By (6.173),however, we know the normalization constant is given by f1(xxxt

∣∣ xxxt−1, Θ, Q),and thus the complete conditional density for λt is of a known functional form.

Many examples of these techniques are given in Carlin et al. (1992), in-cluding the problem of model choice. In the next example, we consider a uni-variate nonlinear model in which the state noise process has a t-distribution.As noted in Meinhold and Singpurwalla (1989), using t-distributions for theerror processes is a way of robustifying the Kalman filter against outliers. Inthis example we present a brief discussion of a detailed analysis presented inCarlin et al. (1992, Example 4.2); readers interested in more detail may find itin that article.

Example 6.20 Analysis of a Nonlinear, Non-Gaussian State-SpaceModel

Kitagawa (1987) considered the analysis of data generated from the fol-lowing univariate nonlinear model:

xt = Ft(xt−1) + wt and yt = Ht(xt) + vt t = 1, . . . , 100, (6.180)

with

Ft(xt−1) = αxt−1 + βxt−1/(1 + x2t−1) + γ cos[1.2(t − 1)],

Ht(xt) = x2t /20, (6.181)

where x0 = 0, wt are independent random variables having a centralt-distribution with ν = 10 degrees and scaled so var(wt) = σ2

w = 10 [wedenote this generically by t(0, σ, ν)], and vt is white standard Gaussiannoise, var(vt) = σ2

v = 1. The state noise and observation noise aremutually independent. Kitagawa (1987) discussed the analysis of datagenerated from this model with α = .5, β = 25, and γ = 8 assumed


known. We will use these values of the parameters in this example, butwe will assume they are unknown. Figure 6.15 shows a typical datasequence yt and the corresponding state process xt.

Our goal here will be to obtain an estimate of the prediction densityp(x101

∣∣ Y100). To accomplish this, we use n = 101 and consider y101 asa latent variable (we will discuss this in more detail shortly). The priorson the variance components are chosen from a conjugate family, thatis, σ2

w ∼ IG(a0, b0) independent of σ2v ∼ IG(c0, d0), where IG denotes

the inverse (reciprocal) gamma distribution [z has an inverse gammadistribution if 1/z has a gamma distribution; general properties can befound, for example, in Box and Tiao (1973, Section 8.5)]. Then,

σ2w

∣∣ λλλ, Yn, Xn ∼

IG

⎛⎝a0 +n

2,

{1b0

+12

n∑t=1

[xt − F (xt−1)]2/λt

}−1⎞⎠ ,

σ2v

∣∣ ωωω, Yn, Xn ∼

IG

⎛⎝c0 +n

2,

{1d0

+12

n∑t=1

[yt − H(xt)]2/ωt

}−1⎞⎠ . (6.182)

Next, letting ν/λt ∼ χ2ν , we get that, marginally, wt

∣∣ σw ∼ t(0, σw, ν),as required, leading to the complete conditional λt

∣∣ σw, α, β, γ, Yn, Xn,for t = 1, . . . , n, being distributed as

IG

(ν + 1

2, 2{

[xt − F (xt−1)]2

σ2w

+ ν

}−1)

. (6.183)

We take ωt ≡ 1 for t = 1, . . . , n, because the observation noise is Gaussian.

For the states, xt, we take a normal prior on the initial state, x0 ∼N(µ0, σ

20), and then we use rejection sampling to conditionally generate

a state value xt, for t = 1, . . . , n, as described in Example 6.19, equation(6.179). In this case, η1(xt) and η2(xt) are given in (6.177) and (6.178),respectively, with Ft and Ht given by (6.181), Θ = (α, β, γ)′, Q = σ2

w

and R = σ2v . Endpoints take some special consideration; we generate x0

from a N(µ0, σ20) and accept it with probability η1(x0), and we generate

x101 as usual and accept it with probability η2(x101). The last completeconditional depends on y101, a latent data value not observed but insteadgenerated according to its complete conditional, which is N(x2

101/20, σ2v),

because ω101 = 1.

The prior on Θ = (α, β, γ)′ is taken to be trivariate normal with mean(µα, µβ , µγ)′ and diagonal variance–covariance matrix diag{σ2

α, σ2β , σ2

γ}.The necessary conditionals can be found using standard normal theory,


Figure 6.15 The state process, xt (top), and the observations, yt (bottom),for t = 1, . . . , 100 generated from the model (6.180).

as done in (6.175). For example, the complete conditional distributionof α is of the form N(Bb,B), where

B−1 =1σ2

α

+1

σ2w

n∑t=1

x2t−1

λt

and

b =µα

σ2α

+1

σ2w

n∑t=1

xt−1

λt

(xt − β

xt−1

1 + x2t−1

− γ cos[1.2(t − 1)])

.

The complete conditional for β has the same form, with

B−1 =1σ2

β

+1

σ2w

n∑t=1

x2t−1

λt(1 + x2t−1)2

and

b =µβ

σ2β

+1

σ2w

n∑t=1

xt−1

λt(1 + x2t−1)

(xt − αxt−1 − γ cos[1.2(t − 1)]) ,


dens

ity

-10 0 10 20

0.00.0

20.0

40.0

60.0

8

Figure 6.16 Estimated one-step-ahead prediction posterior densityp(x101|Y100) of the state process for the nonlinear and non-normal model givenby (6.180) using Gibbs sampling, G = 500.

and for γ the values are

B−1 =1σ2

γ

+1

σ2w

n∑t=1

cos2[1.2(t − 1)]λt

and

b =µγ

σ2γ

+1

σ2w

n∑t=1

cos[1.2(t − 1)]λt

(xt − αxt−1 − β

xt−1

1 + x2t−1

).

In this example, we put µ0 = 0, σ20 = 10, and a0 = 3, b0 = .05 (so the

prior on σ2w has mean and standard deviation equal to 10), and c0 = 3,

d0 = .5 (so the prior on σ2v has mean and standard deviation equal to

one). The normal prior on Θ = (α, β, γ)′ had corresponding mean vectorequal to (µα = .5, µβ = 25, µγ = 8)′ and diagonal variance matrix equalto diag{σ2

α = .25, σ2β = 10, σ2

γ = 4}. The Gibbs sampler ran for = 50iterations for G = 500 parallel replications per iteration. We estimatethe marginal posterior density of x101 as

p(x101∣∣ Y100) = G−1

G∑g=1

N(x101

∣∣∣ [Ft(xt−1)](g), λ(g)101σ

2(g)w

), (6.184)


where N(·|a, b) denotes the normal density with mean a and variance b,and

[Ft(xt−1)](g) = α(g)x(g)t−1 + β(g)x

(g)t−1/(1 + x

2(g)t−1 ) + γ(g) cos[1.2(t − 1)].

The estimate, (6.184), with G = 500, is shown in Figure 6.16. Other as-pects of the analysis, for example, the marginal posteriors of the elementsof Θ, can be found in Carlin et al. (1992).

6.10 Stochastic Volatility

Recently, there has been considerable interest in stochastic volatility models.These models are similar to the ARCH models presented in Chapter 5, butthey add a stochastic noise term to the equation for σt. Recall from §5.2 thata GARCH(1, 1) model for a return, which we denote here by rt, is given by

rt = σtεt (6.185)σ2

t = α0 + α1r2t−1 + β1σ

2t−1, (6.186)

where εt is Gaussian white noise. If we define

ht = log σ2t and yt = log r2

t ,

then (6.185) can be written as

yt = ht + log ε2t . (6.187)

Equation (6.187) is considered the observation equation, and the stochasticvariance ht is considered to be an unobserved state process. Similar to (6.186),the volatility process follows, in its basic form, an autoregression,

ht = φ0 + φ1ht−1 + wt, (6.188)

where wt is white Gaussian noise with variance σ2w.

Together, (6.187) and (6.188) make up the stochastic volatility model dueto Harvey, Ruiz and Shephard (1994). If ε2t had a log-normal distribution,(6.187)-(6.188) would form a Gaussian state-space model, and we could thenuse standard DLM results to fit the model to data. Unfortunately, yt = log r2

t

is rarely normal, so we typically keep the ARCH normality assumption on εt;in which case, log ε2t is distributed as the log of a chi-squared random variablewith one degree of freedom. This density is given by

f(x) =1√2π

exp{

−12

(ex − x)}

− ∞ < x < ∞, (6.189)

and its mean and variance are −1.27 and π2/2, respectively; the density (6.189)is highly skewed with a long tail on the left (see Figure 6.18).

6.10: Stochastic Volatility 389

Various approaches to the fitting of stochastic volatility models have beenexamined; these methods include a wide range of assumptions on the obser-vational noise process. A good summary of the proposed techniques, bothBayesian (via MCMC) and non-Bayesian approaches (such as quasi-maximumlikelihood estimation and the EM algorithm), can be found in Jacquier etal. (1994), and Shephard (1996). Simulation methods for classical inferenceapplied to stochastic volatility models are discussed in Danielson (1994) andSandmann and Koopman (1998).

Kim, Shephard and Chib (1998) proposed modeling the log of a chi-squaredrandom variable by a mixture of seven normals to approximate the first fourmoments of the observational error distribution; the mixture is fixed and noadditional model parameters are added by using this technique. In an effort tokeep matters simple, and perhaps somewhat more general (in that we allow theobservational error dynamics to depend on parameters that will be fitted), ourmethod of fitting stochastic volatility models is to retain the Gaussian stateequation (6.188), but to write the observation equation, with yt = log r2

t , as

yt = α + ht + ηt, (6.190)

where ηt is white noise, whose distribution is a mixture of two normals, onecentered at zero. In particular, we write

ηt = utzt0 + (1 − ut)zt1, (6.191)

where ut is an iid Bernoulli process, Pr{ut = 0} = π0, Pr{ut = 1} = π1(π0 + π1 = 1), zt0 ∼ iid N(0, σ2

0), and zt1 ∼ iid N(µ1, σ21).

The advantage to this model is that it is easy to fit because it uses nor-mality. In fact, the model equations (6.188) and (6.190)-(6.191) are similarto those presented in Pena and Guttman (1988), who used the idea to obtaina robust Kalman filter, and, as previously mentioned, in Kim, Shephard andChib (1998). The material presented in §6.8 applies here, and in particular,the filtering equations for this model are

htt+1 = φ0 + φ1h

t−1t +

1∑j=0

πtjKtjεtj , (6.192)

P tt+1 = φ2

1Pt−1t + σ2

w −1∑

j=0

πtjK2tjΣtj , (6.193)

εt0 = yt − α − ht−1t , (6.194)

εt1 = yt − α − ht−1t − µ1, (6.195)

Σt0 = P t−1t + σ2

0 , (6.196)Σt1 = P t−1

t + σ21 , (6.197)

Kt0 = φ1Pt−1t

/Σt0, (6.198)

Kt1 = φ1Pt−1t

/Σt1. (6.199)


To complete the filtering, we must be able to assess the probabilities πt1 =Pr(ut = 1

∣∣ y1, . . . , yt), for t = 1, . . . , n; of course, πt0 = 1−πt1. Let fj(t∣∣ t−1)

denote the conditional density of yt given the past y1, . . . , yt−1, and ut = j(j = 0, 1). Then,

πt1 =π1f1(t

∣∣ t − 1)π0f0(t

∣∣ t − 1) + π1f1(t∣∣ t − 1)

, (6.200)

where we assume the distribution πj , for j = 0, 1 has been specified a priori.If the investigator has no reason to prefer one state over another the choiceof uniform priors, π1 = 1/2, will suffice. Unfortunately, it is computationallydifficult to obtain the exact values of fj(t

∣∣ t − 1); although we can give anexplicit expression of fj(t

∣∣ t − 1), the actual computation of the conditionaldensity is prohibitive. A viable approximation, however, is to choose fj(t

∣∣ t−1)to be the normal density, N(ht−1

t + µj , Σtj), for j = 0, 1 and µ0 = 0; see §6.8for details.

The innovations filter given in (6.192)–(6.137) can be derived from theKalman filter by a simple conditioning argument. For example, to derive(6.192), we write

E(ht+1

∣∣y1, . . . , yt

)=

1∑j=0

E(ht+1

∣∣y1, . . . , yt, ut = j)Pr(ut = j

∣∣y1, . . . , yt)

=1∑

j=0

(φ0 + φ1h

t−1t + Ktjεtj

)πtj

= φ0 + φ1ht−1t +

1∑j=0

πtjKtjεtj .

Estimation of the parameters, Θ = (φ0, φ1, σ20 , µ1, σ

21 , σ2

w)′, is accomplishedvia MLE based on the likelihood given by

lnLY (Θ) =n∑

t=1

ln

⎛⎝ 1∑j=0

πj fj(t∣∣ t − 1)

⎞⎠ , (6.201)

where the density fj(t∣∣ t−1) is approximated by the normal density, N(ht−1

t +µj , σ2

j ), previously mentioned. We may consider maximizing (6.201) directlyas a function of the parameters Θ using a Newton method, or we may considerapplying the EM algorithm to the complete data likelihood.

Example 6.21 Analysis of the New York Stock Exchange Returns

Figure 6.17 shows the log of the squares of returns, yt = log r2t , of 2000

daily observations of the NYSE previously displayed in Figure 1.4.


Figure 6.17 Graph of yt = log r2t , where rt is the daily return of the NYSE,

2000 observations.

Table 6.4 Estimation Results for the NYSE FitEstimated

Parameter Estimate Standard Errorφ0 −.006 .016φ1 .988 .007σw .091 .027α −9.607 1.266σ0 1.220 .065µ1 −2.292 .204σ1 2.683 .105

Model (6.188) and (6.190)-(6.191), with and π1 fixed at .5, was fit to thedata using a quasi-Newton–Raphson method to maximize (6.201). Theresults are given in Table 6.4. Figure 6.18 compares the density of thelog of a χ2

1 with the fitted normal mixture; we note the data indicatea substantial amount of probability in the upper tail that the log-χ2

1distribution misses.

Finally, Figure 6.19 shows yt for 800 ≤ t ≤ 1000, which includes thecrash of October 19, 1987, with yt−1

t = α + ht−1t superimposed on the

graph; compare with Figure 5.6. Also displayed are error bounds.

It is possible to use the bootstrap procedure described in §6.7 for the sto-chastic volatility model, with some minor changes. The following procedurewas described in Stoffer and Wall (2004). We develop a vector first-orderequation, as was done in (6.117). First, using (6.194)–(6.195), and noting that


Figure 6.18 Density of the log of a χ21 as given by (6.189) (solid line) and the

fitted normal mixture (dashed line) form the NYSE example.

yt = πt0yt + πt1yt, we may write

yt = α + ht−1t + πt0εt0 + πt1(εt1 + µ1). (6.202)

Consider the standardized innovations

etj = Σ−1/2tj εtj , j = 0, 1, (6.203)

and define the 2 × 1 vector

eeet =[

et0et1

].

Also, define the 2 × 1 vector

ξξξt =[

htt+1yt

].

Combining (6.192) and (6.202) results in a vector first-order equation for ξξξt

given byξξξt = Fξξξt−1 + Gt + Hteeet, (6.204)

where

F =[

φ1 01 0

], Gt =

[φ0

α + πt1µ1

], Ht =

[πt0Kt0Σ

1/2t0 πt1Kt1Σ

1/2t1

πt0Σ1/2t0 πt1Σ

1/2t1

].


800 850 900 950 1000

−15

−10

−5

log(Squared NYSE Returns)

time

800 850 900 950 1000

−1.5

−0.5

0.5

1.5

Predicted log−Volatility

time

Figure 6.19 Two hundred observations of yt = log r2t , for 801 ≤ t ≤ 1000,

where rt is the daily return of the NYSE (top). Corresponding one-step-aheadpredicted log volatility, log σ2

t , with ±2 standard prediction errors (bottom).

Hence, the steps in bootstrapping for this case are the same as steps 1 through5 described in §5.7, but with (6.117) replaced by the following first-order equa-tion:

ξξξ∗t = F (Θ)ξξξ∗

t−1 + Gt(Θ; πt1) + Ht(Θ; πt1)eee∗t , (6.205)

where Θ = (φ0, φ1, σ20 , α, µ1, σ

21 , σ2

w)′ is the MLE of Θ, and πt1 is estimated via(6.200), replacing f1(t

∣∣ t − 1) and f0(t∣∣ t − 1) by their respective estimated

normal densities (πt0 = 1 − πt1).

Example 6.22 Analysis of the U.S. GNP Growth Rate

In Example 5.3, we fit an ARCH model to the U.S. GNP growth rate.In this example, we will fit a stochastic volatility model to the residualsfrom the MA(2) fit on the growth rate (see Example 3.35).

Figure 6.20 shows the log of the squared residuals, say yt, from the MA(2)fit on the U.S. GNP series. The stochastic volatility model (6.187)–(6.191) was then fit to yt. Table 6.5 shows the MLEs of the model


Figure 6.20 Log of the squared residuals from an MA(2) fit on GNP growthrate.

parameters along with their asymptotic SEs assuming the model is cor-rect. Also displayed in Table 6.5 are the means and SEs of B = 500bootstrapped samples. There is some amount of agreement between theasymptotic values and the bootstrapped values. The interest here, how-ever, is not so much in the SEs, but in the actual sampling distribu-tion of the estimates. For example, Figure 6.21 compares the bootstraphistogram and asymptotic normal distribution of φ1. In this case, thebootstrap distribution exhibits positive kurtosis and skewness which ismissed by the assumption of asymptotic normality.

6.11 State-Space and ARMAX Models forLongitudinal Data Analysis

In some studies, we may observe several independent k-dimensional time series,say, yyyt�, for = 1, . . . , N . For example, a new treatment may be given to Npatients with high blood pressure, and the systolic and diastolic blood pressures(SBP and DBP) are recorded at equal time intervals, for some time, usingan ambulatory device. We may think of yyyt� as being the bivariate, k = 2,recordings of SBP and DBP at time t for person . It is also reasonable toassume, in this example, exogenous variables may have been collected on each

6.12: Analysis of Longitudinal Data 395

Figure 6.21 Bootstrap histogram and asymptotic distribution of φ1 for theU.S. GNP example.

Table 6.5 Estimates and Their Asymptotic and BootstrapStandard Errors for U.S. GNP Example.

Asymptotic Bootstrap BootstrapParameter MLE SE Mean† SE†

φ0 .068 .274 −.010 .353φ1 .900 .099 .864 .102σw .378 .208 .696 .375α −10.524 2.321 −10.792 .748µ1 −2.164 .567 −1.941 .416σ1 3.007 .377 2.891 .422σ0 .935 .198 .692 .362

† Based on 500 bootstrapped samples.

subject to help explain the variation in blood pressure (for example, gender,race, age, activity, and so on). We might expect to encounter missing data orirregularly spaced observations in this type of experiment; these problems areeasier to handle from a state-space perspective.

An extension of the ARMAX model given in (6.103) that might handle thecase of cross-sectional data, yyyt�, is

yyyt� = Γuuut� +p∑

j=1

Φjyyyt−j,� +q∑

j=1

Θjwwwt−j,� + wwwt�, (6.206)

where, for = 1, . . . , N , var(wwwt�) = Σw and uuut� represents the r × 1 vector ofexogenous variables. As in §6.6, Property P6.6, we can write (6.206) in terms


of a state-space model. That is, for = 1, . . . , N ,

xxxt+1,� = Fxxxt,� + Gwwwt,�, (6.207)yyyt,� = [I, 0, · · · , 0]xxxt� + Γuuut� + wwwt�, (6.208)

where matrices F and G are as in (6.104), xxxt,� represents the unobservedstate, and yyyt,� is the observation at time t, replication . Maximum likelihoodestimation for state space models with cross-sectional data, such as the examplegiven here, was investigated by Goodrich and Caines (1979), and can be carriedout with minor modifications to the methods described in §6.3. In particular,given data yyyt,�, t = 1, . . . , n, = 1, . . . , N , we can use Newton–Raphson tominimize the criterion function, which is, up to a constant term, proportionalto the negative of the log likelihood function,

l(Θ) = N−1N∑

�=1

(n∑

t=1

log∣∣Σt,�(Θ)

∣∣+ n∑t=1

εεεt,�(Θ)′Σt,�(Θ)−1εεεt,�(Θ)

), (6.209)

where εεεt,�(Θ) and Σt,�(Θ) are the innovations and their variance–covariancematrices, respectively. For details, see Goodrich and Caines (1979).

Anderson (1978) did an extensive study of replicated ARX models, that is,the case in which q = 0 in (6.206). We can write this model using regressionnotation as

yyyt� = Bzzzt� + wwwt�, (6.210)

for = 1, . . . , N and t = p + 1, . . . , n, where

zzzt� = (uuu′t�, yyy

′t−1,�, . . . , yyy

′t−p,�)

′ (6.211)

and the matrix of regression coefficients is

B = [Γ, Φ1, Φ2, . . . ,Φp]. (6.212)

The estimate of the regression matrix B in this case is

B =

(N∑

�=1

n∑t=p+1

yyyt�zzz′t�

)(N∑

�=1

n∑t=p+1

zzzt�zzz′t�

)−1

, (6.213)

and an estimate of Σw is

Σw =1

N(n − p)

N∑�=1

n∑t=p+1

(yyyt� − Bzzzt�)(yyyt� − Bzzzt�)′. (6.214)

Inference for B follows as in multivariate regression. That is, the large samplestandard error of the ij-th element of B is

√σjjcii, where σjj is the j-th

diagonal element of Σw and cii is the i-th diagonal element of(N∑

�=1

n∑t=p+1

zzzt�zzz′t�

)−1

.


Model (6.206) may be somewhat restrictive in its assumption that the pa-rameters do not change over time. Because replications exist, extending themodel to the case of time-varying parameters is easy. The case of time-varyingparameters in (6.210) was also presented in Anderson (1978). In particular,the model is written as

yyyt� = Γtuuut� +pt∑

j=1

Φtjyyyt−j,� + wwwt�, (6.215)

and var(wwwt�) = Σt, for = 1, . . . , N . The order of the model, pt, is also allowedto vary with time, and the equal spacing of time is not required. Of course,we can still use regression for estimation because the time-varying model canbe written as n regressions, one for each point in time,

yyyt� = Btzzzt� + wwwt�, (6.216)

for = 1, . . . , N , where zzzt� is as in (6.211), but with p replaced by pt, andwhere now,

Bt = [Γt, Φt1, Φt2, . . . ,Φtpt], (6.217)

assuming t > pt. The estimate of Bt, for any time t, is now given by

Bt =

(N∑

�=1

yyyt�zzz′t�

)(N∑

�=1

zzzt�zzz′t�

)−1

, (6.218)

and an estimate of Σt is

Σt =1

N − pt − 1

N∑�=1

(yyyt� − Btzzzt�)(yyyt� − Btzzzt�)′. (6.219)

Example 6.23 The Effect of Prenatal Smoking on Growth

In this example, we use data taken from an epidemiologic study at theUniversity of Pittsburgh that focused on the effects of substance useduring pregnancy. In particular, we focus on the growth of N = 318children followed from birth to six years of age. In this longitudinal study,the children were examined at birth (t = 0), and at eight months (t = 1),18 months (t = 2), 36 months (t = 3), and 72 months (t = 4) of age.At times t = 1, 2, 3, 4, a growth index, say, yt�, was calculated for eachchild = 1, . . . , 318. The growth index is essentially a standardized scorefor a child’s weight adjusting for that child’s age, gender, and height,against the national averages. At birth, y0� represents the standardizedbirthweight of child .

We might consider that children not prenatally exposed to teratogenswould follow a certain growth curve, whereas exposed children would


Figure 6.22 Average growth scores across time for four groups of children.A solid line represents children not prenatally exposed to cigarette smoke; adashed line represents children prenatally exposed to cigarette smoke. A circlerepresents white children, and a cross represents black children.

follow another. To investigate this hypothesis, we propose the followingtime-varying ARX model for growth:

yt� = γ0t + γ1tS� + γ2tR� + γ3tS�R�

+t∑

j=1

φtj (yt−j,� − yt−j,�) + wt�, (6.220)

for t = 0, 1, 2, 3, 4, where var(wt�) = σ2t , for = 1, . . . , 318. The exoge-

nous variables in the model are, S�, the average number of cigarettes perday the mother smoked during the second trimester of pregnancy, andR�, which indicates race (0 = black, 1 = white). The model is written interms of the innovation sequences, (yt−j,� − yt−j,�), where yt,� is the pre-diction of yt,� from the previous model. We did this to remove any effectof smoking or race on previous growth. Figure 6.22 shows the averagegrowth scores over time for four groups: 68 black children not exposedto smoke prenatally (solid line-cross), 92 white children not exposed tosmoke (solid line-circle), 83 black children exposed to smoke (dashed line-cross), and 75 white children exposed to smoke (dashed line-circle). Fordisplay purposes in Figure 6.22, smoking has been dichotomized to noexposure versus any exposure, but in the analysis, the smoking variableis in average cigarettes per day.


For example, the model for birthweight, t = 0, is

y0� = γ00 + γ10S� + γ20R� + γ30S�R� + w0�.

Once the model has been estimated, the predicted values are calculated

y0� = γ00 + γ10S� + γ20R� + γ30S�R�.

Then, the model for growth at eight months, t = 1, is

y1� = γ01 + γ11S� + γ21R� + γ31S�R� + φ11(y0,� − y0,�) + w1�,

where (y0,� − y0,�) represents birthweight with the effect of smoking andrace removed. In this way, only S� represents smoking and R� representsrace, because their effect on birthweight has been removed. The othercases, for t = 2, 3, 4 continue in the same way.

The following estimates are the results of the fit; we only report the finalmodels. At birth,

y0� = 3.295 − .011(.002)S� + .215(.056)R�,

with σ0 = .472; estimated standard errors are shown in parenthesis. Weconclude that prenatal smoking significantly reduces birthweight, whitebabies are born slightly bigger, and no interaction exists between smokingand race. At eight months,

y1� = −.015(.011)S� − .335(.147)R�

+ .029(.012)S�R� + .214(.127)(y0,� − y0,�),

with σ1 = 1.066. The interaction term is significant, indicating thatwhite, unexposed babies are slightly smaller than the others.

The estimated model for 18 months is,

y2� = .340 + .278(.125)R�

+ .661(.056)(y1,� − y1,�) + .357(.126)(y0,� − y0,�),

with σ2 = 1.059. Now, the effect of prenatal smoking is gone at 18months, and, at this age, the white kids tend to be larger. The resultat 36 months (t = 3) is that prenatal smoking becomes significant again,but exposed children are slightly bigger at this age, and race is no longersignificant (this result is not as unusual as it might seem; in fact, it hasbeen hypothesized that children exposed prenatally to cigarette smoketend to become obese as they grow older):

y3� = .334 + .008(.004)S� + .310(.044)(y2,� − y2,�)+ .450(.043)(y1,� − y1,�) + .465(.098)(y0,� − y0,�),


with σ2 = .817. Finally, the result for 72 months is

y4� = .330 + .933(.082)(y3,� − y3,�)+ .462(.063)(y2,� − y2,�) + .484(.062)(y1,� − y1,�),

with σ2 = 1.176. At this age, the effect of prenatal smoking and theeffect of race are gone. Also growth at eight months (t = 1) is still apredictor of growth at 72 months, but the effect of birthweight (t = 0) isgone.

Mixed Linear Models in State-Space Form

A widely used general mixed model for longitudinal data was introduced byLaird and Ware (1982). In this case, responses yyy� = {yt,�, t = 1, . . . , n�} areobtained on N subjects, = 1, . . . , N . Each response vector is modeled as

yyy� = X�βββ + Z�γγγ� + εεε�, (6.221)

where X� is an n� × b design matrix, βββ is a b × 1 vector of fixed parameters,and Z� is an n� × g design matrix corresponding to the random g × 1 vectorof random effects, γγγ�, which is assumed to be independent across subject, anddistributed as γγγ� ∼N(000, D), where D > 0 is an arbitrary variance–covariancematrix. The within-subject errors, εεε�, are independently distributed as εεε� ∼N(000, Σ�); often, Σ� is of the form σ2I. A good introduction to these models canbe found in many texts; for example, Diggle et al. (1994), Jones (1993), andFahrmeir and Tutz (1994). Jones (1993) focuses on the state-space approach,and so will we.

The model, (6.221), can be written as

yyy� ∼ N (X�βββ, V�) , (6.222)

independently, for = 1, . . . , N , where

V� = Z�DZ ′� + Σ�. (6.223)

An example of a typical covariance structure for V� is compound symmetry,wherein g = 1, Z� is a vector of ones, D = σ2

γ is a scalar, and Σ� = σ2I. Inthis way, V� is an n� × n� matrix given by

V� =

⎛⎜⎜⎜⎝σ2 + σ2

γ σ2γ . . . σ2

γ

σ2γ σ2 + σ2

γ . . . σ2γ

......

. . ....

σ2γ σ2

γ . . . σ2 + σ2γ

⎞⎟⎟⎟⎠ . (6.224)

Another useful covariance structure is the autoregressive structure, where g = 0(that is, no random effects exist) and

V� = Σ� = σ2

⎛⎜⎜⎝1 ρ ρ2 . . . ρn�−1

ρ 1 ρ . . . ρn�−2

......

. . ....

ρn�−1 ρn�−2 . . . 1

⎞⎟⎟⎠ , (6.225)


with |ρ| < 1.For a particular subject, , the vector yyy� consists of observations, yt�, taken

over time t = 1, 2, . . . , n�. For subject , model (6.221) states

yt� = xxx′t�βββ + zzz′

t�γγγ� + εt�, (6.226)

where xxx′t� is the t-th row of X� and zzz′

t� is the t-th row of Z�. Using the formof the model given by (6.226), yt� is normal with

E(yt�) = xxx′t�βββ,

cov(yt�, ys�) = zzz′t�Dzzzs� + σ�,ts,

cov(yt�, ysk) = 0 �= k,

where σ�,ts is the ts-th element of Σ�. For the example given in (6.224), wewould have

var(yt�) = σ2 + σ2γ and cov(yt�, ys�) = σ2

γ ,

for any t �= s, so the correlation between two observations on the same subjectis given by ρ = σ2

γ/(σ2+σ2γ). In the autoregressive case, (6.225), the correlation

between two observations yt� and ys� on the same subject t−s time units apartis, of course, ρ|t−s|.

The Laird–Ware model has a state space formulation; Jones (1993) providesa detailed presentation of these and related topics. If random effects exist, thatis g ≥ 1, and Σ� = σ2I, let ssst,� denote a g×1 state vector with initial conditionsss0,� ∼N(000, D). Then, for each = 1, . . . , N , (6.226) can be written as

ssst,� = ssst−1,� + wwwt,�, (6.227)yt� = xxx′

t�βββ + zzz′t�ssst,� + εt�, (6.228)

for t = 1, . . . , n�, where wwwt,� ≡ 000, or, equivalently, wwwt,� ∼N(000, Q), where Q = 0is the zero matrix. All other values are as defined in (6.226). The data yt�

as written in (6.227)-(6.228) have the same properties as the data written in(6.226).

If g = 0, that is, no random effects exist, and the variance–covariancestructure is autoregressive, as in (6.225), the state-space model can be writtenas

st� = ρst−1,� + wt,�, (6.229)yt� = xxx′

t�βββ + st�, (6.230)

where, now, the autoregressive structure is entered into the data via the (scalar,in this example) state, and there is no measurement error. In this case, R =0, which does not present a problem in running the Kalman filter, providedP 0

0 > 0. To obtain a matrix of the form given in (6.225), wt� is white Gaussiannoise, with Q = σ2, and the initial state satisfies s0,� ∼ N(0, σ2/(1 − ρ2)). Inthis case, recall the states, st�, for a given subject , form a stationary AR(1)process with variance σ2/(1 − ρ2) and ACF given by ρ(h) = ρ|h|.


In the more general case in which both random effects, g > 0, and anautoregressive error structure exist, we can combine the ideas used to get(6.227)-(6.228) and (6.229)-(6.230). In this case, the state equation would bea (g+1)×1 process made by stacking (6.227) and (6.229), and the observationequation would be

yt� = xxx′t�βββ + Atst�,

where At = [z′t�, 1].

We immediately see from (6.227)-(6.228), or from (6.229)-(6.230), that thelikelihood of the data is the same as the one given in (6.209), but with n setto n�. Consequently, the methods presented in §5.3 can be used to estimatethe parameters of the Laird–Ware model, namely, βββ, and variance componentsin V�, for = 1, . . . , n�. For simplicity, let Θ represent the vector of all of theparameters associated with the model.

In the notation of the algorithm presented in §6.3, Step 1 is to find initialestimates, Θ(0), of the parameters Θ. If the V� were known, using a weightedleast squares argument (see §4.4), the least squares estimate of βββ in the model(6.222)-(6.223) is given by

βββ =

(N∑

�=1

X ′�V

−1� X�

)−1( N∑�=1

X ′�V

−1� yyy�

). (6.231)

Initial guesses for V� should reflect the variance–covariance structure of themodel. We can use (6.231) with the initial values chosen for V� to obtain theinitial regression coefficients, βββ(0).

To accomplish Step 2 of the algorithm, for each = 1, . . . , N , run theKalman filter (Property P6.1 with the states denoted by ssst) for t = 1, . . . , n�

to obtain the initial innovations and their covariance matrices. For example,if the model is of the form given in (6.227)-(6.228), run the Kalman filterwith Φ = I, Q = 0, At = zzz′

t�, R = [σ(0)]2, and initial conditions sss00 = 000,

P 00 = D(0) > 0. In addition, yt� replaced by yt� −xxx′

t�βββ(0); this is also equivalent

to running Property P6.6 with uncorrelated noises, wherein the rows of thefixed effects design matrix, X�, are the exogenous variables. The Newton–Raphson procedure (steps 3 and 4 of the algorithm in §6.3) is performed onthe criterion function given in (6.209). The following example may help inunderstanding the technique.

Example 6.24 Response to Medication

As a simple example of how we can use the state-space formulation ofthe Laird–Ware model, we analyze the S+ data set drug.mult. The dataare taken from an experiment in which six subjects are given a dose ofmedication and then observed immediately and at weekly intervals forthree weeks. The data are given in Table 6.6.


Table 6.6 Weekly Response to MedicationWeek 0 Week 1 Week 2 Week 3

Gender y1 y2 y3 y4

1 F 75.9 74.3 80.0 78.92 F 78.3 75.5 79.6 79.23 F 80.3 78.2 80.4 76.24 M 80.7 77.2 82.0 83.85 M 80.3 78.6 81.4 81.56 M 80.1 81.1 81.9 86.4

We fit model (6.229)-(6.230) to this data using gender as a groupingvariable. In particular, if yyyt� is the 4× 1 vector of observations over timefor a female ( = 1, 2, 3), the model is⎛⎜⎝

y1y2y3y4

⎞⎟⎠ =

⎛⎜⎝1 01 01 01 0

⎞⎟⎠(β1β2

)+

⎛⎜⎝ε1ε2ε3ε4

⎞⎟⎠ ,

and for a male ( = 4, 5, 6), the model is⎛⎜⎝y1y2y3y4

⎞⎟⎠ =

⎛⎜⎝1 11 11 11 1

⎞⎟⎠(β1β2

)+

⎛⎜⎝ε1ε2ε3ε4

⎞⎟⎠ ,

where the εt, in general, form an AR(1) process given by

ε0 = w0/√

1 − ρ2,

εt = ρεt−1 + wt t = 1, 2, 3, 4,

where wt is white Gaussian noise, with var(wt) = σ2w. Recall var(εt) =

σ2ε = σ2

w/(1−ρ2) and ρε(h) = ρ|h|. A different value of ρ was selected foreach gender group, say, ρ1 for female subjects and ρ2 for male subjects.

We initialized the estimation procedure with ρ(0)1 = ρ

(0)2 = 0, σ

(0)w =

2, which, upon using (6.231), yields βββ(0) = (78.07, 3.18)′. The finalestimates (and their estimated standard errors) were

β1 = 78.20(.56), β2 = 3.24(.89),

ρ1 = −.47(.36), ρ2 = .07(.53), σw = 2.17(.36).

Because ρ1 and ρ2 are not significantly different from zero, this wouldsuggest either a simple linear regression is sufficient to describe the re-sults, or the model is not correct.


Next, we fit the compound symmetry model using (6.227)-(6.228) withg = 1. In this case, the model for a female subject is⎛⎜⎝

y1y2y3y4

⎞⎟⎠ =

⎛⎜⎝1 01 01 01 0

⎞⎟⎠(β1β2

)+

⎛⎜⎝1111

⎞⎟⎠ γ1 +

⎛⎜⎝ε1ε2ε3ε4

⎞⎟⎠ ,

and for a male subject, the model is⎛⎜⎝y1y2y3y4

⎞⎟⎠ =

⎛⎜⎝1 11 11 11 1

⎞⎟⎠(β1β2

)+

⎛⎜⎝1111

⎞⎟⎠ γ2 +

⎛⎜⎝ε1ε2ε3ε4

⎞⎟⎠ ,

where γ1 ∼ N(0, σ2γ1

), γ2 ∼ N(0, σ2γ2

), and the εt, for t = 1, 2, 3, 4 areuncorrelated with variance σ2

ε .

In this case, the state variable is a scalar process with D = σ2γ1

forfemale subjects ( = 1, 2, 3) and D = σ2

γ2for male subjects ( = 4, 5, 6).

Starting the estimation process off with σ(0)γ1 = σ

(0)γ2 = 1, σ

(0)ε = 2, and

βββ(0) = (78, 3)′, the final estimates were

β1 = 78.03 (.67), β2 = 3.51 (1.05),

σγ1 = 2.05 (.45), σγ2 = 2.51 (.59), σε = 2.00 (.13).

This model fits the data well.

Problems

Section 6.1

6.1 Consider a system process given by

xt = −.9xt−2 + wt t = 1, . . . , n

where x0 ∼ N(0, σ20), x−1 ∼ N(0, σ2

1), and wt is Gaussian white noisewith variance σ2

w. The system process is observed with noise, say,

yt = xt + vt,

where vt is Gaussian white noise with variance σ2v . Further, suppose x0,

x−1, {wt} and {vt} are independent.

(a) Write the system and observation equations in the form of a statespace model.

Problems 405

(b) Find the values of σ20 and σ2

1 that make the observations, yt, sta-tionary.

(c) Generate n = 100 observations with σw = 1, σv = 1 and using thevalues of σ2

0 and σ21 found in (b). Do a time plot of xt and of yt

and compare the two processes. Also, compare the sample ACF andPACF of xt and of yt.

(d) Repeat (c), but with σv = 10.

6.2 Consider the state-space model presented in Example 6.3. Let xt−1t =

E(xt|yt−1, . . . , y1) and let P t−1t = E(xt − xt−1

t )2. The innovation se-quence or residuals are εt = yt − yt−1

t , where yt−1t = E(yt|yt−1, . . . , y1).

Find cov(εs, εt) in terms of xt−1t and P t−1

t for (i) s �= t and (ii) s = t.

Section 6.2

6.3 Simulate n = 100 observations from the following state-space model:

xt = .8xt−1 + wt and yt = xt + vt

where x0 ∼ N(0, 2.78), wt ∼ iid N(0, 1), and vt ∼ iid N(0, 1) are allmutually independent. Compute and plot the data, yt, the one-step-ahead predictors, yt−1

t along with the root mean square prediction errors,E1/2(yt − yt−1

t )2 using Figure 6.3 as a guide.

6.4 Suppose the vector zzz = (xxx′, yyy′)′, where xxx (p × 1) and yyy (q × 1) are jointlydistributed with mean vectors µµµx and µµµy and with covariance matrix

cov(zzz) =(

Σxx Σxy

Σyx Σyy

).

Consider projecting xxx on M = sp{111, yyy}, say, xxx = bbb + Byyy.

(a) Show the orthogonality conditions can be written as

E(xxx − bbb − Byyy) = 0,

E[(xxx − bbb − Byyy)yyy′] = 0,

leading to the solutions

bbb = µµµx − Bµµµy and B = ΣxyΣ−1yy .

(b) Prove the mean square error matrix is

MSE = E[(xxx − bbb − Byyy)xxx′] = Σxx − ΣxyΣ−1yy Σyx.


(c) How can these results be used to justify the claim that, in the ab-sence of normality, Property P6.1 yields the best linear estimate ofthe state xxxt given the data Yt, namely, xxxt

t, and its correspondingMSE, namely, P t

t ?

6.5 Derivation of Property P6.2 Based on the Projection Theorem. Through-out this problem, we use the notation of Property P6.2 and of theProjection Theorem given in Appendix B, where H is L2. If Lk+1 =sp{yyy1, . . . , yyyk+1}, and Vk+1 = sp{yyyk+1 − yyyk

k+1}, for k = 0, 1, . . . , n − 1,where yyyk

k+1 is the projection of yyyk+1 on Lk, then, Lk+1 = Lk ⊕Vk+1. Weassume P 0

0 > 0 and R > 0.

(a) Show the projection of xxxk on Lk+1, that is, xxxk+1k , is given by

xxxk+1k = xxxk

k + Hk+1(yyyk+1 − yyykk+1),

where Hk+1 can be determined by the orthogonality property

E{(

xxxk − Hk+1(yyyk+1 − yyykk+1)

) (yyyk+1 − yyyk

k+1)′}

= 0.

ShowHk+1 = P k

k Φ′A′k+1

[Ak+1P

kk+1A

′k+1 + R

]−1.

(b) Define Jk = P kk Φ′[P k

k+1]−1, and show

xxxk+1k = xxxk

k + Jk(xxxk+1k+1 − xxxk

k+1).

(c) Repeating the process, show

xxxk+2k = xxxk


k+1) + Hk+2(yyyk+2 − yyyk+1k+2),

solving for Hk+2. Simplify and show

xxxk+2k = xxxk


k+1).

(d) Using induction, conclude

xxxnk = xxxk

k + Jk(xxxnk+1 − xxxk

k+1),

which yields the smoother with k = t − 1.

Section 6.3

6.6 (a) Consider the univariate state-space model given by state conditionsx0 = w0, xt = xt−1 + wt and observations yt = xt + vt, t = 1, 2, . . .,where wt and vt are independent, Gaussian, white noise processeswith var(wt) = σ2

w and var(vt) = σ2v . Show the data follow an

IMA(1,1) model, that is, ∇yt follows an MA(1) model.

Problems 407

(b) Fit the model specified in part (a) to the logarithm of the glacialvarve series and compare the results to those presented in Exam-ple 3.31.

6.7 Let yt represent the land-based global temperature series shown in Fig-ure 6.2. The data file for this problem is HL.dat on the website.

(a) Using regression, fit a third-degree polynomial in time to yt, thatis, fit the model

yt = β0 + β1t + β2t2 + β3t

3 + εt,

where εt is white noise. Do a time plot of the data fit, yt, superim-posed on the data, yt, for t=1,. . . ,108.

(b) Write the model yt = xt + vt with ∇2xt = wt, where wt and vt areindependent white noise processes, in state-space form. Hint: Thestate will be a 2×1 vector, say, xxxt = (xt, xt−1)′. Fit the state-spacemodel to the data, and do a time plot of the estimated filter, xt−1

t ,and the estimated smoother, xn

t , superimposed on the data, yt, fort=1,. . . ,108. Compare these results with the results of part (a).

6.8 Consider the modelyt = xt + vt,

where vt is Gaussian white noise with variance σ2v , xt are independent

Gaussian random variables with mean zero and var(xt) = rtσ2x with

xt independent of vt, and r1, . . . , rn are known constants. Show thatapplying the EM algorithm to the problem of estimating σ2

x and σ2v leads

to updates (represented by hats)

σ2x =

1n

n∑t=1

σ2t + µ2

t

rtand σ2

v =1n

n∑t=1

[(yt − µt)2 + σ2t ],

where, based on the current estimates (represented by tildes),

µt =rtσ

2x

rtσ2x + σ2

v

yt and σ2t =

rtσ2xσ2

v

rtσ2x + σ2

v

.

6.9 Develop the EM algorithm for the model with inputs, (6.3) and (6.4).

6.10 To explore the stability of the filter, consider a univariate state-spacemodel. That is, for t = 1, 2, . . ., the observations are yt = xt + vt and thestate equation is xt = φxt−1 + wt, where σw = σv = 1 and |φ| < 1. Theinitial state, x0, has zero mean and variance one.

(a) Exhibit the recursion for P t−1t in Property P6.1 in terms of P t−2

t−1 .


(b) Use the result of (a) to verify P t−1t approaches a limit (t → ∞) P

that is the positive solution of P 2 − φ2P − 1 = 0.(c) With K = limt→∞ Kt as given in Property P6.1, show |1 − K| < 1.(d) Show, in steady-state, the one-step-ahead predictor, yn

n+1 = E(yn+1∣∣

yn, yn−1, . . .), of a future observation satisfies

ynn+1 =

∞∑j=0

φjK(1 − K)j−1yn+1−j .

6.11 In §6.3, we discussed that it is possible to obtain a recursion for thegradient vector, −∂ lnLY (Θ)/∂Θ. Assume the model is given by (6.1)and (6.2) and At is a known design matrix that does not depend on Θ,in which case Property P6.1 applies. For the gradient vector, show

∂ lnLY (Θ)/∂Θi =n∑

t=1

{εεε′tΣ

−1t

∂εεεt

∂Θi− 1

2εεε′tΣ

−1t

∂Σt

∂ΘiΣ−1

t εεεt

+12tr(

Σ−1t

∂Σt

∂Θi

)},

where the dependence of the innovation values on Θ is understood. In ad-dition, with the general definition ∂ig = ∂g(Θ)/∂Θi, show the followingrecursions, for t = 2, . . . , n apply:

• ∂iεεεt = −At ∂ixxxt−1t ,

• ∂ixxxt−1t = ∂iΦ xxxt−2

t−1 + Φ ∂ixxxt−2t−1 + ∂iKt−1 εεεt−1 + Kt−1 ∂iεεεt−1,

• ∂iΣt = At ∂iPt−1t A′

t + ∂iR,

• ∂iKt =[

∂iΦ P t−1t A′

t + Φ ∂iPt−1t A′

t − Kt ∂iΣt

]Σ−1

t ,

• ∂iPt−1t = ∂iΦ P t−2

t−1 Φ′ + Φ ∂iPt−2t−1 Φ′ + Φ P t−2

t−1 ∂iΦ′ + ∂iQ,

− ∂iKt−1 ΣtK′t−1 − Kt−1 ∂iΣt K ′

t−1 − Kt−1Σt ∂iK′t−1,

using the fact that P t−1t = ΦP t−2

t−1 Φ′ + Q − Kt−1ΣtK′t−1.

6.12 Continuing with the previous problem, consider the evaluation of theHessian matrix and the numerical evaluation of the asymptotic variance–covariance matrix of the parameter estimates. The information matrixsatisfies

E

{−∂2 lnLY (Θ)

∂Θ ∂Θ′

}= E

{(∂ lnLY (Θ)

∂Θ

)(∂ lnLY (Θ)

∂Θ

)′};

see Anderson (1984, Section 4.4), for example. Show the (i, j)-th elementof the information matrix, say, Iij(Θ) = E

{−∂2 lnLY (Θ)/∂Θi ∂Θj

}, is

Iij(Θ) =n∑

t=1

E{

∂iεεε′t Σ−1

t ∂jεεεt +12tr(Σ−1

t ∂iΣt Σ−1t ∂jΣt

)+

14tr(Σ−1

t ∂iΣt

)tr(Σ−1

t ∂jΣt

)}.

Problems 409

Consequently, an approximate Hessian matrix can be obtained from thesample by dropping the expectation, E, in the above result and usingonly the recursions needed to calculate the gradient vector.

Section 6.4

6.13 As an example of the way the state-space model handles the missing dataproblem, suppose the first-order autoregressive process

xt = φxt−1 + wt

has an observation missing at t = m, leading to the observations yt =Atxt, where At = 1 for all t, except t = m wherein At = 0. Assumex0 = 0 with variance σ2

w/(1 − φ2), where the variance of wt is σ2w. Show

the Kalman smoother estimators in this case are

xnt =

⎧⎪⎪⎪⎨⎪⎪⎪⎩φy1, t = 0,

φ1+φ2 (ym−1 + ym+1), t = m,

yt, t �= 0, m,

with mean square covariances determined by

Pnt =

⎧⎪⎪⎪⎨⎪⎪⎪⎩σ2

w, t = 0,

σ2w

1+φ2 , t = m,

0 t �= 0, m.

6.14 The data set labeled ar1miss.dat is n = 100 observations generatedfrom an AR(1) process, xt = φxt−1 + wt, with φ = .9 and σw = 1, where10% of the data has been zeroed out at random. Considering the zeroedout data to be missing data, use the results of Problem 6.13 to estimatethe parameters of the model, φ and σw, using the EM algorithm, andthen estimate the missing values.

Section 6.5

6.15 Using Example 6.10 as a guide, fit a structural model to the FederalReserve Board Production Index data and compare it with the model fitin Example 3.43.

Section 6.6

6.16 Use Property P6.6 to complete the following exercises.


(a) Write a univariate AR(1) model, yt = φyt−1 + vt, in state-spaceform. Verify your answer is indeed an AR(1).

(b) Repeat (a) for an MA(1) model, yt = vt + θvt−1.

(c) Write an IMA(1,1) model, yt = yt−1 + vt + θvt−1, in state-spaceform.

6.17 Verify Property P6.5.

6.18 Verify Property P6.6.

Section 6.7

6.19 Repeat the bootstrap analysis of Example 6.12 on the entire three-monthtreasury bills and rate of inflation data set of 110 observations. Dothe conclusions of Example 6.12—that the dynamics of the data is bestdescribed in terms of a fixed, rather than stochastic, regression—stillhold?

Section 6.8

6.20 Argue that a switching model is reasonable in explaining the behavior ofthe number of sunspots (see Figure 4.31) and then fit a switching modelto the sunspot data.

Section 6.9

6.21 Use the material presented in Example 6.18 to perform a Bayesian analy-sis of the model for the Johnson & Johnson data presented in Exam-ple 6.10.

6.22 Verify (6.169) and (6.170).

6.23 Verify (6.175) and (6.182).

Section 6.10

6.24 Fit a stochastic volatility model to the returns of one (or more) ofthe four financial time series available in the R datasets package asEuStockMarkets.

Problems 411

Section 6.11

6.25 In a small pilot study, a psychiatrist wanted to examine the effects of thedrug lithium on bulimics (bulimics have continuous abnormal hunger andfrequently go on eating binges). Although evidence of the effectivenessof lithium on bulimics has been shown, he was not sure if depressedsubjects would respond differently than those without depression. Hetreated eight teenage female patients with lithium for 12 weeks; four ofthe subjects were diagnosed with depression, and half of the subjectsreceived behavioral therapy. At the end of each four-week period, herecorded the number of binges each subject had during that week. Thefollowing are the results:

Weekly Number of BingesSubject Depression Week 0 Week 4 Week 8 Week 12

1 No 13 3 0 02 No 15 4 3 13 No 16 4 3 24 No 14 2 1 25 Yes 10 7 4 36 Yes 18 7 2 47 Yes 16 6 5 48 Yes 19 8 5 7

Fit a longitudinal model that addresses the concerns of the psychiatrist.Because the data are counts (number of occurrences), consider a squareroot transformation prior to the analysis.

Chapter 7

Statistical Methods in theFrequency Domain

7.1 Introduction

In previous chapters, we saw many applied time series problems that involvedrelating series to each other or to evaluating the effects of treatments or designparameters that arise when time-varying phenomena are subjected to periodicstimuli. In many cases, the nature of the physical or biological phenomenaunder study are best described by their Fourier components rather than bythe difference equations involved in ARIMA or state-space models. The fun-damental tools we use in studying periodic phenomena are the discrete Fouriertransforms (DFTs) of the processes and their statistical properties. Hence,in §7.2, we review the properties of the DFT of a multivariate time seriesand discuss various approximations to the likelihood function based on thelarge-sample properties and the properties of the complex multivariate normaldistribution. This enables extension of the classical techniques discussed in thefollowing paragraphs to the multivariate time series case.

An extremely important class of problems in classical statistics developswhen we are interested in relating a collection of input series to some out-put series. For example, in Chapter 2, we have previously considered relatingtemperature and various pollutant levels to daily mortality, but have not in-vestigated the frequencies that appear to be driving the relation and have notlooked at the possibility of leading or lagging effects. In Chapter 4, we isolateda definite lag structure that could be used to relate sea surface temperature tothe number of new recruits. In Problem 5.11 of Chapter 5, the possible drivingprocesses that could be used to explain inflow to Shasta Lake were hypothe-sized in terms of the possible inputs precipitation, cloud cover, temperature,and other variables. Identifying the combination of input factors in Figure 4.33

412


20 40 60 80 100 120−1

−0.5

0

0.5

1Awake−Brush

20 40 60 80 100 120−1

−0.5

0

0.5

1

Low−Brush

20 40 60 80 100 120−1

−0.5

0

0.5

1Awake−Heat

20 40 60 80 100 120−1

−0.5

0

0.5

1Low−Heat

20 40 60 80 100 120−1

−0.5

0

0.5

1Awake−Shock

20 40 60 80 100 120−1

−0.5

0

0.5

1

Low−Shock

Figure 7.1 Mean response of subjects to various combinations of periodicstimulae measured at the cortex (primary somatosensory, contralateral).

that produce the best prediction for inflow is an example of multiple regressionin the frequency domain, with the models treated theoretically by consideringthe regression, conditional on the random input processes.

A situation somewhat different from that above would be one in which theinput series are regarded as fixed and known. In this case, we have a modelanalogous to that occurring in analysis of variance, in which the analysis nowcan be performed on a frequency by frequency basis. This analysis worksespecially well when the inputs are dummy variables, depending on some con-figuration of treatment and other design effects and when effects are largelydependent on periodic stimuli. As an example, we will look at a designed ex-periment measuring the fMRI brain responses of a number of awake and mildlyanesthetized subjects to several levels of periodic brushing, heat, and shock ef-fects. Some limited data from this experiment have been discussed previouslyin Example 1.6 of Chapter 1. Figure 7.1 shows mean responses to variouslevels of periodic heat, brushing, and shock stimuli for subjects awake andsubjects under mild anesthesia. The stimuli were periodic in nature, appliedalternately for 32 seconds (16 points) and then stopped for 32 seconds. Theperiodic input signal comes through under all three design conditions whenthe subjects are awake, but is somewhat attenuated under anesthesia. Themean shock level response hardly shows on the input signal; shock levels weredesigned to simulate surgical incision without inflicting tissue damage. The

414 Frequency Domain Methods

means in Figure 7.1 are from a single location. Actually, for each individual,some nine series were recorded at various locations in the brain. It is naturalto consider testing the effects of brushing, heat, and shock under the two levelsof consciousness, using a time series generalization of analysis of variance.

A generalization to random coefficient regression is also considered, paral-leling the univariate approach to signal extraction and detection presented in§4.9. This method enables a treatment of multivariate ridge-type regressionsand inversion problems. Also, the usual random effects analysis of variance inthe frequency domain becomes a special case of the random coefficient model.

The extension of frequency domain methodology to more classical ap-proaches to multivariate discrimination and clustering is of interest in thefrequency dependent case. Many time series differ in their means and in theirautocovariance functions, making the use of both the mean function and thespectral density matrices relevant. As an example of such data, consider thebivariate series consisting of the P and S components derived from severalearthquakes and explosions, such as those shown in Figure 7.2, where the Pand S components, representing different arrivals have been separated from thefirst and second halves, respectively, of wave forms like those shown originallyin Figure 1.7 of Chapter 1.

Two earthquakes and two explosions from a set of eight earthquakes and ex-plosions are shown in Figure 7.2 and some essential differences exist that mightbe used to characterize the two classes of events. Also, the frequency contentof the two components of the earthquakes appears to be lower than those ofthe explosions, and relative amplitudes of the two classes appear to differ. Forexample, the ratio of the S to P amplitudes in the earthquake group is muchhigher for this restricted subset. Spectral differences were also noticed in Chap-ter 4, where the explosion processes had a stronger high-frequency componentrelative to the low-frequency contributions. Examples like these are typical ofapplications in which the essential differences between multivariate time se-ries can be expressed by the behavior of either the frequency-dependent meanvalue functions or the spectral matrix. In discriminant analysis, these typesof differences are exploited to develop combinations of linear and quadraticclassification criteria. Such functions can then be used to classify events ofunknown origin, such as the Novaya Zemlya event shown in Figure 7.2, whichtends to bear a visual resemblance to the explosion group.

Finally, for multivariate processes, the structure of the spectral matrixis also of great interest. We might reduce the dimension of the underlyingprocess to a smaller set of input processes that explain most of the variabilityin the cross-spectral matrix as a function of frequency. Principal componentanalysis can be used to decompose the spectral matrix into a smaller subsetof component factors that explain decreasing amounts of power. For example,the hydrological data might be explained in terms of a component process thatweights heavily on precipitation and inflow and one that weights heavily ontemperature and cloud cover. Perhaps these two components could explainmost of the power in the spectral matrix at a given frequency. The ideas


200 400 600 800 1000−2

0

2P

EQ

5

200 400 600 800 1000−2

0

2S

EQ

5

200 400 600 800 1000−4−2

024

EQ

6

200 400 600 800 1000−4−2

024

EQ

6

200 400 600 800 1000−1

0

1

EX

P5

200 400 600 800 1000−1

0

1

EX

P5

200 400 600 800 1000

−1

0

1

EX

P6

200 400 600 800 1000

−1

0

1

EX

P6

200 400 600 800 1000−1

0

1

NZ

200 400 600 800 1000−1

0

1

NZ

Figure 7.2 Bivariate earthquakes and explosions (40 pts/sec) compared withan event NZ (Novaya Zemlya) of unknown origin.


behind principal component analysis can also be generalized to include anoptimal scaling methodology for categorical data called the spectral envelope(see Stoffer et al., 1993). In succeeding sections, we also give an introductionto dynamic Fourier analysis and to wavelet analysis.

7.2 Spectral Matrices and Likelihood Functions

We have previously argued for an approximation to the log likelihood basedon the joint distribution of the DFTs in (4.116), where we used approximationas an aid in estimating parameters for certain parameterized spectra. In thischapter, we make heavy use of the fact that the sine and cosine transforms ofthe p×1 vector process xxxt = (xt1, xt2, . . . , xtp)′ with mean Exxxt = µµµt, say, withDFT1

XXX(ωk) = n−1/2n∑

t=1

xxxt e−2πiωkt

= XXXc(ωk) − iXXXs(ωk) (7.1)

and mean

MMM(ωk) = n−1/2n∑

t=1

µµµt e−2πiωkt

= MMM c(ωk) − iMMMs(ωk) (7.2)

will be approximately uncorrelated, where we evaluate at the usual Fourierfrequencies {ωk = k/n, 0 < |ωk| < 1/2}. By Theorem C.6, the approximate2p × 2p covariance matrix of the cosine and sine transforms, say, XXX(ωk) =(XXXc(ωk)′,XXXs(ωk)′)′, is

Σ(ωk) =12

(C(ωk) −Q(ωk)Q(ωk) C(ωk)

), (7.3)

and the real and imaginary parts are jointly normal. This result implies, bythe results stated in Appendix C, the density function of the vector DFT, say,XXX(ωk), can be approximated as

p(ωk) ≈ |f(ωk)|−1 exp{−(XXX(ωk) − MMM(ωk)

)∗f−1(ωk)

(XXX(ωk) − MMM(ωk)

)},

where the spectral matrix is the usual

f(ωk) = C(ωk) − iQ(ωk). (7.4)1In previous chapters, the DFT of a process xt was denoted by dx(ωk). In this chapter,

we will consider the Fourier transforms of many different processes, and so to avoid theoveruse of subscripts and hence, to ease the notation, we use a capital letter, e.g., X(ωk), todenote Fourier transform of xt.

7.3: Regression for Jointly Stationary Series 417

Certain computations that we do in the section on discriminant analysis willinvolve approximating the joint likelihood by the product of densities like theone given above over subsets of the frequency band 0 < ωk < 1/2.

To use the likelihood function for estimating the spectral matrix, for exam-ple, we appeal to the limiting result implied by Theorem C.7 and again choose Lfrequencies in the neighborhood of some target frequency ω, say, XXX(ωk ±k/n),for k = 1, . . . , m and L = 2m + 1. Then, let XXX�, for = 1, . . . , L denote theindexed values, and note the DFTs of the mean adjusted vector process areapproximately jointly normal with mean zero and complex covariance matrixf = f(ω). Then, write the log likelihood over the L sub-frequencies as

lnL(XXX1, . . . ,XXXL; f) ≈ −L ln |f | −L∑

�=1

(XXX� − MMM �)∗f−1(XXX� − MMM �), (7.5)

where we have suppressed the argument of f = f(ω) for ease of notation. Theuse of spectral approximations to the likelihood has been fairly standard, be-ginning with the work of Whittle (1961) and continuing in Brillinger (1981) andHannan (1970). In this case, assuming the mean adjusted series are available,i.e., that MMM � is known, so that we may assume that XXX� is the mean-adjustedseries. We may obtain the maximum likelihood estimator for f by writing thejoint log likelihood of the real and imaginary parts in terms of ZZZ� = (XXX ′

c�,XXX′s�)

′

and obtaining the maximum likelihood estimators for C and Q, the real andimaginary parts of f . Problem 7.2 shows we will obtain

f = L−1L∑

�=1

(XXX� − MMM �)(XXX� − MMM �)∗, (7.6)

which is just the usual mean-adjusted estimator for the spectral matrix.

7.3 Regression for Jointly Stationary Series

In §4.8, we considered a model of the form

yt =∞∑

r=−∞β1rxt−r,1 + vt, (7.7)

where xt1 is a single observed input series and yt is the observed output se-ries, and we are interested in estimating the filter coefficients β1r relating theadjacent lagged values of xt1 to the output series yt. In the case of the SOIand Recruitment series, we identified the El Nino driving series as xt1, theinput and yt, the Recruitment series, as the output. In general, more thana single plausible input series may exist. For example, the hydrological datashown in Figure 4.33 suggests there may be at least five possible series drivingthe inflow. Hence, we may envision a q × 1 input vector of driving series,


say, xxxt = (xt1, xt2, . . . , xtq)′, and a set of q × 1 vector of regression functionsβββr = (β1r, β2r,, . . . , βqr)′, which are related as

yt =∞∑

r=−∞βββ′

rxxxt−r + vt. (7.8)

Writing the matrix form out as

yt =q∑

j=1

∞∑r=−∞

βjrxt−r,j + vt (7.9)

shows the output is basically a sum of linearly filtered versions of the inputprocesses and a stationary noise process vt, assumed to be uncorrelated withxxxt. Each filtered component in the sum over j gives the contribution of laggedvalues of the j-th input series to the output series. We assume the regressionfunctions βjr are fixed and unknown.

The model given by (7.8) is useful under several different scenarios, corre-sponding to a number of different assumptions that can be made about thecomponents. Assuming the input and output processes are jointly stationarywith zero means leads to the conventional regression analysis given in thissection. The analysis depends on theory that assumes we observe the outputprocess yt conditional on fixed values of the input vector xxxt; this is the sameas the assumptions made in conventional regression analysis. Assumptionsconsidered later involve letting the coefficient vector βββt be a random unknownsignal vector that can be estimated by Bayesian arguments, using the condi-tional expectation given the data. The answers to this approach, given in §7.5,allow signal extraction and deconvolution problems to be handled. Assum-ing the inputs are fixed allows various experimental designs and analysis ofvariance to be done for both fixed and random effects models. Estimation ofthe frequency-dependent random effects variance components in the analysisof variance model is also considered in §7.5.

For the approach in this section, assume the inputs and outputs have zeromeans and are jointly stationary with the (q +1)× 1 vector process (xxx′

t, yt)′ ofinputs xxxt and outputs yt assumed to have a spectral matrix of the form

f(ω) =(


), (7.10)

where fyx(ω) = (fyx1(ω), fyx2(ω), . . . , fyxq (ω)) is the 1 × q vector of cross-spectra relating the q inputs to the output and fxx(ω) is the q × q spectralmatrix of the inputs. Generally, we observe the inputs and search for the vectorof regression functions βββt relating the inputs to the outputs. We assume allautocovariance functions satisfy the absolute summability conditions of theform ∞∑

h=−∞|h||γjk(h)| < ∞. (7.11)


(j, k = 1, . . . , q + 1), where γjk(h) is the autocovariance corresponding to thecross-spectrum fjk(ω) in (7.10). We also need to assume a linear process ofthe form (C.35) as a condition for using Theorem C.7 on the joint distributionof the discrete Fourier transforms in the neighborhood of some fixed frequency.

Estimation of the Regression Function

In order to estimate the regression function βββr, the Projection Theorem(Appendix B) applied to minimizing

MSE = E[(yt −

∞∑r=−∞

βββ′rxxxt−r)2

](7.12)

leads to the orthogonality conditions

E[(yt −

∞∑r=−∞

βββ′rxxxt−r)xxx′

t−s

]= 000′ (7.13)

for all s = 0,±1,±2, . . ., where 000′ denotes the 1 × q zero vector. Taking theexpectations inside and substituting for the definitions of the autocovariancefunctions appearing and leads to the normal equations

∞∑r=−∞

βββ′r Γxx(s − r) = γγγ′

yx(s), (7.14)

for s = 0,±1,±2, . . ., where Γxx(s) denotes the q × q autocovariance matrixof the vector series xxxt at lag s and γγγyx(s) = (γyx1(s), . . . , γyxq

(s)) is a 1 × qvector containing the lagged covariances between yt and xxxt. Again, a frequencydomain approximate solution is easier in this case because the computationscan be done frequency by frequency using cross-spectra that can be estimatedfrom sample data using the DFT. In order to develop the frequency domainsolution, substitute the representation into the normal equations, using thesame approach as used in the simple case derived in §4.8. This approach yields∫ 1/2

−1/2

∞∑r=−∞

βββ′r e2πiω(s−r) fxx(ω) dω = γγγ′

yx(s).

Now, because γγγ′yx(s) is the Fourier transform of the cross-spectral vector

fyx(ω) = f∗xy(ω), we might write the system of equations in the frequency

domain, using the uniqueness of the Fourier transform, as

BBB′(ω)fxx(ω) = f∗xy(ω), (7.15)

where fx(ω) is the q × q spectral matrix of the inputs and BBB(ω) is the q × 1vector Fourier transform of βββt. Multiplying (7.15) on the right by f−1

xx (ω),assuming fxx(ω) is nonsingular at ω, leads to the frequency domain estimator

BBB′(ω) = f∗xy(ω)f−1

xx (ω). (7.16)


Note, (7.16) implies the regression function would take the form

βββt =∫ 1/2

−1/2BBB(ω) e2πiωt dω. (7.17)

As before, it is conventional to introduce the DFT as the approximate estimatorfor the integral (7.17) and write

βββMt = M−1

M−1∑k=0

BBB(ωk) e2πiωkt, (7.18)

where ωk = k/M, M << n. The approximation was shown in Problem 4.35 tohold exactly as long as βββt = 000 for |t| ≥ M and to have a mean squared errorbounded by a function of the zero-lag autocovariance and the absolute sum ofthe neglected coefficients.

The mean squared error (7.12) can be written using the orthogonality prin-ciple, giving

MSE =∫ 1/2

−1/2fy·x(ω) dω, (7.19)

wherefy·x(ω) = fyy(ω) − f∗

xy(ω)f−1xx (ω)fxy(ω) (7.20)

denotes the residual or error spectrum The resemblance of (7.20) to the usualequations in regression analysis is striking. It is useful to pursue the multi-ple regression analogy further by noting a squared multiple coherence can bedefined as

ρ2y·x(ω) =

f∗xy(ω)f−1

xx (ω)fxy(ω)fyy(ω)

. (7.21)

This expression leads to the mean squared error in the form

MSE =∫ 1/2

−1/2fyy(ω)[1 − ρ2

y·x(ω)] dω, (7.22)

and we have an interpretation of ρ2y·x(ω) as the proportion of power accounted

for by the lagged regression on xxxt at frequency ω. If ρ2y·x(ω) = 0 for all ω, we

have

MSE =∫ 1/2

−1/2fyy(ω) dω = E[y2

t ],

which is the mean squared error when no predictive power exists. As long asfxx(ω) is positive definite at all frequencies, MSE ≥ 0, and we will have

0 ≤ ρ2y·x(ω) ≤ 1 (7.23)

for all ω. If the multiple coherence is unity for all frequencies, the mean squarederror in (7.22) is zero and the output series is perfectly predicted by a linearly


filtered combination of the inputs. Problem 7.3 shows the ordinary squaredcoherence between the series yt and the linearly filtered combinations of theinputs appearing in (7.12) is exactly (7.21).

Estimation Using Sampled Data

Clearly, the matrices of spectra and cross-spectra will not ordinarily beknown, so the regression computations need to be based on sampled data. Weassume, therefore, the inputs xt1, xt2, . . . , xtq and output yt series are availableat the time points t = 1, 2, . . . , n, as in Chapter 4. In order to develop reason-able estimates for the spectral quantities, some replication must be assumed.Often, only one replication of each of the inputs and the output will exist,so it is necessary to assume a band exists over which the spectra and cross-spectra are approximately equal to fxx(ω) and fxy(ω), respectively. Then, letY (ωk + /n) and XXX(ωk + /n) be the DFTs of yt and xxxt over the band, say, atfrequencies of the form

ωk ± /n, = 1, . . . , m,

where L = 2m + 1 as before. Then, simply substitute the sample spectralmatrix

fxx(ω) = L−1m∑

�=−m

XXX(ωk + /n)XXX∗(ωk + /n) (7.24)

and the vector of sample cross-spectra

fxy(ω) = L−1m∑

�=−m

XXX(ωk + /n)Y (ωk + /n) (7.25)

for the respective terms in (7.16) to get the regression estimator BBB(ω). For theregression estimator (7.18), we may use

βMt =

1M

M−1∑k=0

f∗xy(ωk)f−1

xx (ωk) e2πiωkt (7.26)

for t = 0,±1,±2, . . . ,±(M/2 − 1), as the estimated regression function.

Tests of Hypotheses

The estimated squared multiple coherence, corresponding to the theoreticalcoherence (7.21), becomes

ρ2y·x(ω) =

f∗xy(ω)f−1

xx (ω)fxy(ω)

fyy(ω). (7.27)


We may obtain a distributional result for the multiple coherence function anal-ogous to that obtained in the univariate case by writing the multiple regressionmodel in the frequency domain, as was done in §4.6. We obtain the statistic

F2q,2(L−q) =(L − q)

q

ρ2y·x(ω)

[1 − ρ2y·x(ω)]

, (7.28)

which has an F -distribution with 2q and 2(L−q) degrees of freedom under thenull hypothesis that ρ2

y·x(ω) = 0, or equivalently, that BBB(ω) = 0, in the model

Y (ωk + /n) = BBB′(ω)X(ωk + /n) + V (ωk + /n), (7.29)

where the spectral density of the error V (ωk + /n) is fy·x(ω). Problem 7.5sketches a derivation of this result.

A second kind of hypothesis of interest is one that might be used to testwhether a full model with q inputs is significantly better than some submodelwith q1 < q components. In the time domain, this hypothesis implies, for apartition of the vector of inputs into q1 and q2 components (q1 + q2 = q), say,xxxt = (xxx′

t1, xxx′t2)

′, and the similarly partitioned vector of regression functionsβββt = (βββ′

1t, βββ′2t)

′, we would be interested in testing whether βββ2t = 000 in thepartitioned regression model

yt =∞∑

r=−∞βββ′

1rxxxt−r,1 +∞∑

r=−∞βββ′

2rxxxt−r,2 + vt. (7.30)

Rewriting the regression model (7.30) in the frequency domain in a form thatis similar to (7.29) establishes that, under the partitions of the spectral matrixinto its qi × qj (i, j = 1, 2) submatrices, say,

fxx(ω) =

(f11(ω) f12(ω)f21(ω) f22(ω)

), (7.31)

and the cross-spectral vector into its qi × 1 (i = 1, 2) subvectors,

fxy(ω) =

(f1y(ω)f2y(ω)

), (7.32)

we may test the hypothesis βββ2t = 000 at frequency ω by comparing the estimatedresidual power

fy·x(ω) = fyy(ω) − f∗xy(ω)f−1

xx (ω)fxy(ω) (7.33)

under the full model with that under the reduced model, given by

fy·1(ω) = fyy(ω) − f∗1y(ω)f−1

11 (ω)f1y(ω). (7.34)

The power due to regression can be written as

SSR(ω) = L[fy·1(ω) − fy·x(ω)], (7.35)


Table 7.1 Analysis of Power (ANOPOW) for Testing No Contribution fromthe Subset xxxt2 in the Partitioned Regression Model

Source Power Degrees of Freedomxt,q1+1, . . . , xt,q1+q2 SSR(ω) (7.35) 2q2

Error SSE(ω) (7.36) 2(L − q1 − q2)Total Lfy·1(ω) 2(L − q1)

with the usual error power given by

SSE(ω) = Lfy·x(ω). (7.36)

The test of no regression proceeds using the F -statistic

F2q2,2(L−q) =(L − q)

q2

SSR(ω)SSE(ω)

. (7.37)

The distribution of this F -statistic with 2q2 numerator degrees of freedom and2(L− q) denominator degrees of freedom follows from an argument parallelingthat given in Chapter 4 for the case of a single input. The test results canbe summarized in an Analysis of Power (ANOPOW) table that parallels theusual analysis of variance (ANOVA) table. Table 7.1 shows the componentsof power for testing βββ2t = 000 at a particular frequency ω. The ratio of the twocomponents divided by their respective degrees of freedom just yields the F -statistic (7.37) used for testing whether the q2 add significantly to the predictivepower of the regression on the q1 series.

Example 7.1 Predicting Shasta Lake Inflow

We illustrate some of the preceding ideas by considering the problemof predicting the transformed inflow series shown in Figure 4.33 fromsome combination of the inputs. First, look for the best single inputpredictor using the squared coherence function (7.27). The results, ex-hibited in Figure 7.3, show transformed precipitation produces the mostconsistently high squared coherence values at all frequencies (L = 41),with the seasonal frequencies .08, .17, .25, and .33 cycles per month corre-sponding to 12-month, six-month, four-month, and three-month periodscontributing most significantly at the α = .001 level.

Other inputs, with the exception of wind speed, also appear to be plausi-ble contributors. In order to evaluate the other contributors, we considerpartitioned tests with models including each of the other variables andprecipitation tested against models including precipitation alone. Fig-ure 7.4 shows a plot of the F -statistic (7.28) as a function of frequencyfor testing each of the inputs as a possible additional component. We seehere some isolated significance points, particularly in the temperatureseries at some of the higher seasonal components, although the strong


0 0.1 0.2 0.3 0.4 0.50

0.5

1(c

oher

ence

)2 Temperature

0 0.1 0.2 0.3 0.4 0.50

0.5

1

(coh

eren

ce)2 Dew Point

0 0.1 0.2 0.3 0.4 0.50

0.5

1

(coh

eren

ce)2 Cloud Cover

0 0.1 0.2 0.3 0.4 0.50

0.5

1

(coh

eren

ce)2 Wind Speed

0 0.1 0.2 0.3 0.4 0.50

0.5

1

(coh

eren

ce)2 Precipitation

Figure 7.3 Univariate coherence functions relating Shasta Lake inflow to var-ious inputs (frequency scale is cycles per month).

0 0.1 0.2 0.3 0.4 0.50

5

10

15

20

cy/pt

F−St

atist

ic

Temp.−Precip.

0 0.1 0.2 0.3 0.4 0.50

5

10

15

20

cy/pt

F−St

atist

ic

Dew Pt.−Precip.

0 0.1 0.2 0.3 0.4 0.50

5

10

15

20

cy/pt

F−St

atist

ic

Cloud Cov.−Precip

0 0.1 0.2 0.3 0.4 0.50

5

10

15

20

cy/pt

F−St

atist

ic

Wind Spd.−Precip.

Figure 7.4 F -statistics for testing whether various inputs combined with pre-cipitation add to the ability to predict Shasta Lake inflow.


0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.2

0.4

0.6

0.8

1Coherence and .001 Significance Level

cy/pt

(coh

eren

ce)2

−15 −10 −5 0 5 10 15

−0.05

0

0.05

Tem

pera

ture

−15 −10 −5 0 5 10 15

−0.05

0

0.05

Pre

cipi

tatio

n

Figure 7.5 Multiple coherence between inflow and combined precipitation andtemperature along with multiple impulse response functions for the regressionrelations.

coherence at the 12-month frequency seems to have been essentially elim-inated by the incorporation of precipitation.

The additional contribution of temperature to the model seems some-what marginal because the multiple coherence (7.27), shown in the toppanel of Figure 7.5, seems only slightly better than the univariate coher-ence with precipitation shown in Figure 7.3. It is, however, instructive toproduce the multiple regression functions, using (7.26) to see if a simplemodel for inflow exists that would involve some regression combinationof inputs temperature and precipitation that would be useful for pre-dicting inflow to Shasta Lake. With this in mind, denoting the possibleinputs Pt for transformed precipitation and Tt for transformed tempera-ture, the regression function has been plotted in the lower two panels ofFigure 7.5. The time axes run over both positive and negative values andare centered at time t = 0. Hence, the relation with temperature seemsto be instantaneous and positive and an exponentially decaying relationto precipitation exists that has been noticed previously in the analysis inProblem 4.37 of Chapter 4. The plots suggest a transfer function model


of the general form fitted to the Recruitment and SOI series in Exam-ple 5.7 of Chapter 5. We might propose fitting the inflow output, say, It,using the model

It = α0 +δ0

(1 − ω1B)Pt + α2Tt + ηt,

which is the transfer function model, without the temperature compo-nent, considered in that section.

7.4 Regression with Deterministic Inputs

The previous section considered the case in which the input and output serieswere jointly stationary, but there are many circumstances where in which wemight want to assume that the input functions are fixed and have a knownfunctional form. This case happens in the analysis of data from designedexperiments. For example, we may want to take a collection of earthquakesand explosions such as are shown in Figure 7.2 and test whether the meanfunctions are the same for either the P or S components or, perhaps, for themjointly. In certain other signal detection problems using arrays, the inputs areused as dummy variables to express lags corresponding to the arrival timesof the signal at various elements, under a model corresponding to that of aplane wave from a fixed source propagating across the array. In Figure 7.1, weplotted the mean responses of the cortex as a function of various underlyingdesign configurations corresponding to various stimuli applied to awake andmildly anesthetized subjects.

It is necessary to introduce a replicated version of the underlying model tohandle even the univariate situation, and we replace (7.8) by

yjt =∞∑

r=−∞βββ′

rzzzj,t−r + vjt (7.38)

for j = 1, 2, . . . , N series, where we assume the vector of known deterministicinputs, zzzjt = (zjt1, . . . , zzzjtq)′, satisfies

∞∑t=−∞

|t||zjtk| < ∞

for j = 1, . . . , N replicates of an underlying process involving k = 1, . . . , qregression functions. The model can also be treated under the assumption thatthe deterministic function satisfy Grenanders’ conditions, as in Hannan (1970),but we do not need those conditions here and simply follow the approach inShumway (1983, 1988).

7.4: Deterministic Inputs 427

It will sometimes be convenient in what follows to represent the model inmatrix notation, writing (7.38) as

yyyt =∞∑

r=−∞zt−r βββr + vvvt, (7.39)

where zt = (zzz1t, . . . , zzzNt)′ are the N × q matrices of independent inputsand yyyt and vvvt are the N × 1 output and error vectors. The error vectorvvvt = (v1t, . . . , vNt)′ is assumed to be a multivariate, zero-mean, stationary,normal process with spectral matrix fv(ω)IN that is proportional to the N ×Nidentity matrix. That is, we assume the error series vjt are independently andidentically distributed with spectral densities fv(ω).

Example 7.2 An Infrasonic Signal from a Nuclear Explosion

Often, we will observe a common signal, say, βt on an array of sensors,with the response at the j-th sensor denoted by yjt, j = 1, . . . , N Forexample, Figure 7.6 shows an infrasonic or low-frequency acoustic sig-nal from a nuclear explosion, as observed on a small triangular arrayof N = 3 acoustic sensors. These signals appear at slightly differenttimes. Because of the way signals propagate, a plane wave signal of thiskind, from a given source, traveling at a given velocity, will arrive atelements in the array at predictable time delays. In the case of the infra-sonic signal in Figure 7.6, the delays were approximated by computingthe cross-correlation between elements and simply reading off the timedelay corresponding to the maximum. For a detailed discussion of thestatistical analysis of array signals, see Shumway et al. (1999).

A simple additive signal plus noise model of the form

yjt = βt−τj + vjt (7.40)

can be assumed, where τj , j = 1, 2, . . . , N are the time delays that deter-mine the start point of the signal at each element of the array. The model(7.40) is written in the form (7.38) by letting zjt = δt−τj , where δt = 1when t = 0 and is zero otherwise. In this case, we are interested in boththe problem of detecting the presence of the signal and in estimating itswaveform βt. In this case, a plausible estimator of the waveform wouldbe the unbiased beam, say,

βt =

∑Nj=1 yj,t+τj

N, (7.41)

where time delays in this case were measured as τ1 = −17, τ2 = 0, andτ3 = 22 from the cross-correlation function. The bottom panel of Fig-ure 7.6 shows the computed beam in this case, and the noise in theindividual channels has been reduced and the essential characteristics ofthe common signal are retained in the average.


200 400 600 800 1000 1200 1400 1600 1800 2000

−2

0

2

x 104

y1t

200 400 600 800 1000 1200 1400 1600 1800 2000

−2

0

2

x 104

y2t

200 400 600 800 1000 1200 1400 1600 1800 2000

−2

0

2

x 104

y3t

200 400 600 800 1000 1200 1400 1600 1800 2000

−2

0

2

x 104

pts (10 pts/sec)

Beam

Figure 7.6 Three series for a nuclear explosion detonated 25 km south ofChristmas Island and the delayed average or beam.

The above discussion and example serve to motivate a more detailed lookat the estimation and detection problems in the case in which the input serieszzzjt are fixed and known. We consider the modifications needed for this case inthe following sections.

Estimation of the Regression Relation

Because the regression model (7.38) involves fixed functions, we may par-allel the usual approach using the Gauss–Markov theorem to search for linear-filtered estimators of the form

βββt =N∑

j=1

∞∑r=−∞

hhhjryj,t−r, (7.42)

where hhhjt = (hjt1 . . . , hjtq)′ is a vector of filter coefficients, determined so theestimators are unbiased and have minimum variance. The equivalent matrix


form is

βββt =∞∑

r=−∞hr yyyt−r, (7.43)

where ht = (hhh1t, . . . , hhhNt) is a q × N matrix of filter functions. The matrixform resembles the usual classical regression case and is more convenient forextending the the Gauss–Markov theorem to lagged regression. The unbiasedcondition is considered in Problem 7.7. It can be shown (see Shumway andDean, 1968) that hhhjs can be taken as the Fourier transform of

HHHj(ω) = S−1z (ω)ZZZj(ω), (7.44)

where

ZZZj(ω) =∞∑

t=−∞zzzjte−2πiωt (7.45)

is the infinite Fourier transform of zzzjt. The matrix

Sz(ω) =N∑

j=1

ZZZj(ω)ZZZ ′j(ω) (7.46)

can be written in the form

Sz(ω) = Z∗(ω)Z(ω), (7.47)

where the N × q matrix Z(ω) is defined by Z(ω) = (ZZZ1(ω), . . . ,ZZZN (ω))′. Inmatrix notation, the Fourier transform of the optimal filter becomes

H(ω) = S−1z (ω)Z∗(ω), (7.48)

where H(ω) = (HHH1(ω), . . . ,HHHN (ω)) is the q × N matrix of frequency responsefunctions. The optimal filter then becomes the Fourier transform

ht =∫ 1/2

−1/2H(ω)e2πiωt dω. (7.49)

If the transform is not tractable to compute, an approximation analogous to(7.26) may be used.

Example 7.3 Estimation of the Infrasonic Signal in Example 7.2

We consider the problem of producing a best linearly filtered unbiasedestimator for the infrasonic signal in Example 7.2. In this case, q = 1and (7.45) becomes

Zj(ω) =∞∑

t=−∞δt−τj

e−2πiωt = e−2πiωτj


and Sz(ω) = N . Hence, we have

Hj(ω) =1N

e2πiωτj .

Using (7.49), we obtain hjt = 1N δ(t + τj). Substituting in (7.42), we

obtain the best linear unbiased estimator as the beam, computed as in(7.41).

Tests of Hypotheses

We consider first testing the hypothesis that the complete vector βββt is zero,i.e., that the vector signal is absent. We develop a test at each frequency ωby taking single adjacent frequencies of the form ωk = k/n, as in the initialsection. We may approximate the DFT of the observed vector in the model(7.38) using a representation of the form

Yj(ωk) = BBB′(ωk)ZZZj(ωk) + Vj(ωk) (7.50)

for j = 1, . . . , N , where the error terms will be uncorrelated with common vari-ance f(ωk), the spectral density of the error term. The independent variablesZZZj(ωk) can either be the infinite Fourier transform, or they can be approx-imated by the DFT. Hence, we can obtain the matrix version of a complexregression model, written in the form

YYY (ωk) = Z(ωk)BBB(ωk) + VVV (ωk), (7.51)

where the N × q matrix Z(ωk) has been defined previously below (7.47) andYYY (ωk) and VVV (ωk) are N × 1 vectors with the error vector VVV (ωk) having meanzero, with covariance matrix f(ωk)IN . The usual regression arguments showthat the maximum likelihood estimator for the regression coefficient will be

BBB(ωk) = S−1z (ωk)ssszy(ωk), (7.52)

where Sz(ωk) is given by (7.47) and

ssszy(ωk) = Z∗(ωk)YYY (ωk)

=N∑

j=1

ZZZj(ωk)Yj(ωk). (7.53)

Also, the maximum likelihood estimator for the error spectral matrix is pro-portional to

s2y·z(ωk) =

N∑j=1

|Yj(ωk) − BBB(ωk)′ZZZj(ωk)|2

= YYY ∗(ωk)YYY (ωk) − YYY ∗(ωk)Z(ωk)[Z∗(ωk)Z(ωk)]−1Z∗(ωk)YYY (ωk)= s2

y(ωk) − sss∗zy(ωk)S−1

z (ωk)ssszy(ωk), (7.54)


Table 7.2 Analysis of Power (ANOPOW) for Testing No Contribution fromthe Independent Series at Frequency ω in the Fixed Input Case

Source Power Degrees of FreedomRegression SSR(ω)(7.56) 2Lq

Error SSE(ω) (7.57) 2L(N − q)Total SST(ω) 2LN

where

s2y(ωk) =

N∑j=1

|Yj(ωk)|2. (7.55)

Under the null hypothesis that the regression coefficient BBB(ωk) = 000, the esti-mator for the error power is just s2

y(ωk) If smoothing is needed, we may replacethe (7.54) and (7.55) by smoothed components over the frequencies ωk + /n,for = −m, . . . , m and L = 2m + 1, close to ω. In that case, we obtain theregression and error spectral components as

SSR(ω) =m∑

�=−m

sss∗zy(ωk + /n)S−1

z (ωk + /n)ssszy(ωk + /n) (7.56)

and

SSE(ω) =m∑

�=−m

s2y·z(ωk + /n). (7.57)

The F -statistic for testing no regression relation is

F2Lq,2L(N−q) =N − q

q

SSR(ω)SSE(ω)

. (7.58)

The analysis of power pertaining to this situation appears in Table 7.2.In the fixed regression case, the partitioned hypothesis that is the analog

of βββ2t = 0 in (7.28) with xxxt1, xxxt2 replaced by zzzt1, zzzt2. Here, we partition Sz(ω)into qi × qj (i, j = 1, 2) submatrices, say,

Sz(ωk) =(

S11(ωk) S12(ωk)S21(ωk) S22(ωk)

), (7.59)

and the cross-spectral vector into its qi × 1, for i = 1, 2, subvectors

ssszy(ωk) =(

sss1y(ωk)sss2y(ωk)

). (7.60)

Here, we test the hypothesis βββ2t = 000 at frequency ω by comparing the residualpower (7.54) under the full model with the residual power under the reducedmodel, given by

s2y·1(ωk) = s2

y(ωk) − sss∗1y(ωk)S−1

11 (ωk)sss1y(ωk). (7.61)


Table 7.3 Analysis of Power (ANOPOW) for Testing No Contribution fromthe Last q2 Inputs in the Fixed Input Case

Source Power Degrees of FreedomRegression SSR(ω)(7.62) 2Lq2

Error SSE(ω) (7.63) 2L(N − q)Total SST(ω) 2L(N − q1)

Again, it is desirable to add over adjacent frequencies with roughly comparablespectra so the regression and error power components can be taken as

SSR(ω) =m∑

�=−m

[s2

y·1(ωk + /n) − s2y·z(ωk + /n)

](7.62)

and

SSE(ω) =m∑

�=−m

s2y·z(ωk + /n). (7.63)

The information can again be summarized as in Table 7.3, where the ratio ofmean power regression and error components leads to the F -statistic

F2Lq2,2L(N−q) =(N − q)

q2

SSR(ω)SSE(ω)

. (7.64)

We illustrate the analysis of power procedure using the infrasonic signal detec-tion procedure of Example 7.2.

Example 7.4 Detecting the Infrasonic Signal Using ANOPOW

We consider the problem of detecting the common signal for the threeinfrasonic series observing the common signal, as shown in Figure 7.3.The presence of the signal is obvious in the waveforms shown, so the testhere mainly confirms the statistical significance and isolates the frequen-cies containing the strongest signal components. Each series containedn = 1024 points, sampled at 10 points per second. We use the modelin (7.40) so Zj(ω) = e−2πiωτj and Sz(ω) = N as in Example 7.3, withszy(ωk) given as

szy(ωk) =N∑

j=1

e2πiωτj Yj(ωk),

using (7.46) and (7.53). The above expression can be interpreted as beingproportional to the weighted mean or beam, computed in frequency, andwe introduce the notation

Bw(ωk) =1N

N∑j=1

e2πiωτj Yj(ωk) (7.65)


0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.050

0.5

1

1.5

2

2.5x 10

10

pow

er

Total Power

Error Power

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.050

2

4

6

8

10

12

cycles/pt

F.01

(6,12)

.0034 cy/pt

Figure 7.7 Analysis of power for infrasound array (top panel) and F -statistic(bottom panel) showing detection at .033 cy/sec (10 pts/sec).

for that term. Substituting for the power components in Table 7.3 yields

sss∗zy(ωk)S−1

z (ωk)ssszy(ωk) = N |Bw(ωk)|2

and

s2y·z(ωk) =

N∑j=1

|Yj(ωk) − Bw(ωk)|2

=N∑

j=1

|Yj(ωk)|2 − N |Bw(ωk)|2

for the regression signal and error components, respectively. Becauseonly three elements in the array and a reasonable number of points intime exist, it seems advisable to employ some smoothing over frequencyto obtain additional degrees of freedom. In this case, L = 3, yielding2(3) = 6 and 2(3)(3 − 1) = 12 degrees of freedom for the numeratorand denominator of the F -statistic (7.58). Figure 7.7 shows the analysisof power components due to error and the total power. The power ismaximum at about .0044 cycles per point or about .044 cycles per second.The F -statistic is compared with the 1% significance level F.01(6, 12) =


4.82 in the bottom panel and has the strongest detection at about .034cycles per second, a result mainly because the error power is decreasingmore quickly than the regression or signal power in that band. Littlepower of consequence appears to exist in the higher range (.3-.5 cyclesper second).

Although there are examples of detecting multiple regression functions ofthe general type considered above (see, for example, Shumway, 1983), we do notconsider additional examples of partitioning in the fixed input case here. Thereason is that several examples exist in the section on designed experimentsthat illustrate the partitioned approach.

7.5 Random Coefficient Regression

The lagged regression models considered so far have assumed the input processis either stochastic or fixed and the components of the vector of regression func-tion βββt are fixed and unknown parameters to be estimated. There are manycases in time series analysis in which it is more natural to regard the regres-sion vector as an unknown stochastic signal. For example, we have studied thestate-space model in Chapter 6, where the state equation can be considered asinvolving a random parameter vector that is essentially a multivariate autore-gressive process. In §4.10, we considered estimating the univariate regressionfunction βt as a signal extraction problem.

In this section, we consider a random coefficient regression model of (7.39)in the equivalent form

yyyt =∞∑

r=−∞zt−r βββr + vvvt, (7.66)

where yyyt = (y1t, . . . , yNt)′ is the N ×1 response vector and zt = (zzz1t, . . . , zzzNt)′

are the N × q matrices containing the fixed input processes. Here, the compo-nents of the q × 1 regression vector βββt are zero-mean, uncorrelated, stationaryseries with common spectral matrix fβ(ω)Iq and the error series vvvt have zero-means and spectral matrix fv(ω)IN , where IN is the N × N identity matrix.Then, defining the N ×q matrix Z(ω) = (ZZZ1(ω),ZZZ2(ω), . . . ,ZZZN (ω))′ of Fouriertransforms of zt, as in (7.45), it is easy to show the spectral matrix of the re-sponse vector yyyt is given by

fy(ω) = fβ(ω)Z(ω)Z∗(ω) + fv(ω)IN . (7.67)

The regression model with a stochastic stationary signal component is a generalversion of the simple additive noise model

yt = βt + vt,

7.5: Random Coefficients 435

considered by Wiener (1949) and Kolmogorov (1941), who derived the min-imum mean squared error estimators for βt, as in §4.10. The more generalmultivariate version (7.66) represents the series as a convolution of the signalvector βββt and a known set of vector input series contained in the matrix zzzt.Restricting the the covariance matrices of signal and noise to diagonal form isconsistent with what is done in statistics using random effects models, whichwe consider here in a later section. The problem of estimating the regressionfunction βββt is often called deconvolution in the engineering and geophysicalliterature.

Estimation of the Regression Relation

The regression function βββt can be estimated by a general filter of the form(7.43), where we write that estimator in matrix form

βββt =∞∑

r=−∞ht yyyt−r, (7.68)

where ht = (hhh1t, . . . , hhhNt), and apply the orthogonality principle, as in §3.9. Ageneralization of the argument in that section (see Problem 7.8) leads to theestimator

H(ω) = [Sz(ω) + θ(ω)Iq]−1Z∗(ω) (7.69)

for the Fourier transform of the minimum mean-squared error filter, where theparameter

θ(ω) =fv(ω)fβ(ω)

(7.70)

is the inverse of the signal-to-noise ratio. It is clear from the frequency domainversion of the linear model (7.51), the comparable version of the estimator(7.52) can be written as

BBB(ω) = [Sz(ω) + θ(ω)Iq]−1ssszy(ω). (7.71)

This version exhibits the estimator in the stochastic regressor case as the usualestimator, with a ridge correction, θ(ω), that is proportional to the inverse ofthe signal-to-noise ratio.

The mean-squared covariance of the estimator is shown to be

E[(BBB − BBB)(BBB − BBB)∗] = fv(ω)[Sz(ω) + θ(ω)Iq]−1, (7.72)

which again exhibits the close connection between this case and the varianceof the estimator (7.52), which can be shown to be fv(ω)S−1

z (ω).


Example 7.5 Estimating the Random Infrasonic Signal

In Example 7.4, we have already determined the components needed in(7.69) and (7.70) to obtain the estimators for the random signal. TheFourier transform of the optimum filter at series j has the form

Hj(ω) =e2πiωτj

N + θ(ω)(7.73)

with the mean-squared error given by fv(ω)/[N +θ(ω)] from (7.72). Thenet effect of applying the filters will be the same as filtering the beamwith the frequency response function

H0(ω) =N

N + θ(ω)

=Nfβ(ω)

fv(ω) + Nfβ(ω), (7.74)

where the last form is more convenient in cases in which portions of thesignal spectrum are essentially zero.

The optimal filters ht have frequency response functions that depend on thesignal spectrum fβ(ω) and noise spectrum fv(ω), so we will need estimators forthese parameters to apply the optimal filters. Sometimes, there will be values,suggested from experience, for the signal-to-noise ratio 1/θ(ω) as a functionof frequency. The analogy between the model here and the usual variancecomponents model in statistics, however, suggests we try an approach alongthose lines as in the next section.

Detection and Parameter Estimation

The analogy to the usual variance components situation suggests looking atthe regression and error components of Table 7.2 under the stochastic signalassumptions. We consider the components of (7.56) and (7.57) at a singlefrequency ωk. In order to estimate the spectral components fβ(ω) and fv(ω),we reconsider the linear model (7.51) under the assumption that BBB(ωk) is arandom process with spectral matrix fβ(ωk)Iq. Then, the spectral matrix ofthe observed process is (7.67), evaluated at frequency ωk.

Consider first the component of the regression power, defined as

SSR(ωk) = sss∗zy(ωk)S−1

z (ωk)ssszy(ωk)

= YYY ∗(ωk)Z(ωk)S−1z (ωk)Z∗(ωk)YYY (ωk).

A computation shows

E[SSR(ωk)] = fβ(ωk) tr{Sz(ωk)} + qfv(ωk),

7.5: Random Coefficients 437

where tr denotes the trace of a matix. If we can find a set of frequencies of theform ωk + /n, where the spectra and the Fourier transforms Sz(ωk + /n) ≈Sz(ω) are relatively constant, the expectation of the averaged values in (7.56)yields

E[SSR(ω)] = Lfβ(ω)tr [Sz(ω)] + Lqfv(ω). (7.75)

A similar computation establishes

E[SSE(ω)] = L(N − q)fv(ω). (7.76)

We may obtain an approximately unbiased estimator for the spectra fv(ω) andfβ(ω) by replacing the expected power components by their values and solving(7.75) and (7.76).

Example 7.6 Estimating the Power Components and the RandomInfrasonic Signal

In order to provide an optimum estimator for the infrasonic signal, weneed to have estimators for the signal and noise spectra fβ(ω) and fv(ω)for the case considered in Example 7.5. The form of the filter is H0(ω),given in (7.74), and with q = 1 and the matrix Sz(ω) = N at all frequen-cies in this example simplifies the computations considerably. We mayestimate

fv(ω) =SSE(ω)L(N − 1)

(7.77)

and

fβ(ω) = (LN)−1(

SSR(ω) − SSE(ω)(N − 1)

), (7.78)

using (7.75) and (7.76) for this special case. Cases will exist in which(7.78) is negative and the estimated signal spectrum can be set to zerofor those frequencies. The estimators can be substituted into the optimalfilters to apply to the beam, say, H0(ω) in (7.74), or to use in the filterapplied to each level (7.73).

The analysis of variance estimators can be computed using the analysisof power given in Figure 7.7, and the results of that computation andapplying (7.77) and (7.78) are shown in the top panel of Figure 7.8 fora bandwidth of B = 7/2048 = cycles per point or about .03 cyclesper second (Hz). Neither spectrum contains any significant power forfrequencies greater than .04 cycles per point or about .4 Hz. As expected,the signal spectral estimator is substantial over a narrow band, and thisleads to an estimated filter, with estimated frequency response functionH0(ω), shown on the left-hand side of the second panel. The estimatedoptimal filter essentially deletes frequencies above .014 Hz and, subjectto slight modification, differs little from a standard low-pass filter with


0 0.01 0.02 0.03 0.04 0.050

2

4

6

8x 10

8

Noi

se P

ower

Noise Power

0 0.01 0.02 0.03 0.04 0.050

5

10

15x 10

8

Sig

nal P

ower Signal Power

0 0.01 0.02 0.03 0.04 0.050

0.5

1

cycles/pt

Freq. Resp.

−100 −50 0 50 100−0.01

0

0.01

0.02

0.03

Impulse Resp.

500 1000 1500 2000−2

−1

0

1

2x 10

4

Beam

500 1000 1500 2000−2

−1

0

1

2x 10

4

Filtered Beam

Figure 7.8 Estimated signal and noise spectra, filter responses, and beams.

that cutoff. Computing the time version with a cutoff at M = 201 pointsand using a taper leads to the estimated impulse response function h0(t),as shown on the right-hand side of the middle panel. Finally, we applythe optimal filter to the beam and get the filtered beam βt shown in thebottom right-hand panel. It is smoother than the left-hand bottom panel,where we have reproduced the beam shown earlier in Figure 7.3. Theanalysis shows the primary signal as basically a low-frequency signal withprimary power at about .05 Hz or, essentially, a wave with a 20-secondperiod.

7.6 Analysis of Designed Experiments

An important special case (see Brillinger, 1973, 1980) of the regression model(7.50) occurs when the regression (7.39) is of the form

yyyt = zβββt + vvvt, (7.79)

where z = (zzz1, zzz2, . . . , zzzN )′ is a matrix that determines what is observed bythe j-th series; i.e.,

yjt = zzz′jβββt + vjt. (7.80)

7.6: ANOPOW 439

In this case, the the matrix zzz of independent variables is constant and we willhave the frequency domain model.

YYY (ωk) = ZBBB(ωk) + VVV (ωk) (7.81)

corresponding to (7.51), where the matrix Z(ωk) was a function of frequencyωk. The matrix is purely real, in this case, but the equations (7.52)-(7.58) canbe applied with Z(ωk) replaced by the constant matrix Z.

Equality of Means

A typical general problem that we encounter in analyzing real data is asimple equality of means test in which there might be a collection of timeseries yijt, i = 1, . . . , I, j = 1, . . . , Ni, belonging to I possible groups, with Ni

series in group i. To test equality of means, we may write the regression modelin the form

yijt = µt + αit + vijt, (7.82)

where µt denotes the overall mean and αit denotes the effect of the i-th groupat time t and we require that

∑i αit = 0 for all t. In this case, the full model

can be written in the general regression notation as

yijt = zzz′ijβββt + vijt

whereβββt = (µt, α1t, α2t, . . . , αI−1,t)′

denotes the regression vector, subject to the constraint. The reduced modelbecomes

yijt = µt + vijt (7.83)

under the assumption that the group means are equal. In the full model,there are I possible values for the I × 1 design vectors zzzij ; the first componentis always one for the mean, and the rest have a one in the i-th position fori = 1, . . . , I − 1 and zeros elsewhere. The vectors for the last group take thevalue −1 for i = 2, 3, . . . , I − 1. Under the reduced model, each zzzij is a singlecolumn of ones. The rest of the analysis follows the approach summarizedin (7.52)-(7.58). In this particular case, the power components in Table 7.3(before smoothing) simplify to

SSR(ωk) =I∑

i=1

Ni∑j=1

|Yi·(ωk) − Y··(ωk)|2 (7.84)

and

SSE(ωk) =I∑

i=1

Ni∑j=1

|Yij(ωk) − Yi·(ωk)|2, (7.85)

which are analogous to the usual sums of squares in analysis of variance. Notethat a dot (·) stands for a mean, taken over the appropriate subscript, so the


regression power component SSR(ωk) is basically the power in the residualsof the group means from the overall mean and the error power componentSSE(ωk) reflects the departures of the group means from the original datavalues. Smoothing each component over L frequencies leads to the usual F -statistic (7.64) with 2L(I − 1) and 2L(

∑i Ni − I) degrees of freedom at each

frequency ω of interest.

Example 7.7 Means Test for the Magnetic Resonance Imaging Data

Figure 7.1 showed the mean responses of subjects to various levels of pe-riodic stimulation while awake and while under anesthesia, as collectedin a pain perception experiment of Antognini et al. (1997). Three typesof periodic stimuli were presented to awake and anesthetized subjects,namely, brushing, heat, and shock. The periodicity was introduced byapplying the stimuli, brushing, heat, and shocks in on-off sequences last-ing 32 seconds each and the sampling rate was one point every two sec-onds. The blood oxygenation level (BOLD) signal intensity (Ogawa etal., 1990) was measured at nine locations in the brain. Areas of activa-tion were determined using a technique first described by Bandettini etal. (1993). The specfic locations of the brain where the signal was mea-sured were Cortex 1: Primary Somatosensory, Contralateral, Cortex 2:Primary Somatosensory, Ipsilateral, Cortex 3: Secondary Somatosensory,Contralateral, Cortex 4: Secondary Somatosensory, Ipsilateral, Caudate,Thalamus 1: Contralateral, Thalamus 2: Ipsilateral, Cerebellum 1: Con-tralateral and Cerebellum 2: Ipsilateral. Figure 7.1 shows the meanresponse of subjects at Cortex 1 for each of the six treatment combi-nations, 1: Awake-Brush (5 subjects), 2: Awake-Heat (4 subjects), 3:Awake-Shock (5 subjects), 4: Low-Brush (3 subjects), 5: Low-Heat (5subjects), and 6: Low-Shock( 4 subjects). The objective of this firstanalysis is to test equality of these six group means, paying special at-tention to the 64-second period band (1/64 cycles per second) expectedfrom the periodic driving stimuli. Because a test of equality is neededat each of the nine brain locations, we took α = .001 to control for theoverall error rate. Figure 7.9 shows F -statistics, computed from (7.64),with L = 3, and we see substantial signals for the four cortex locationsand for the second cerebellum trace, but the effects are nonsignificant inthe caudate and thalamus regions. Hence, we will retain the four cortexlocations and the second cerebellum location for further analysis.

An Analysis of Variance Model

The arrangement of treatments for the fMRI data in Figure 7.1 suggests moreinformation might be available than was obtained from the simple equality ofmeans test. Separate effects caused by state of consciousness as well as the

7.6: ANOPOW 441

0 0.1 0.20

2

4

6

8F

−S

tatis

tic

cortex−1

T0=64 sec

0 0.1 0.20

2

4

6

8

F−

Sta

tistic

cortex−2

0 0.1 0.20

2

4

6

8

F−

Sta

tistic

cortex−3

0 0.1 0.20

2

4

6

8

F−

Sta

tistic

cortex−4

0 0.1 0.20

2

4

6

8

F−

Sta

tistic

caudate

0 0.1 0.20

2

4

6

8

F−

Sta

tistic

thalamus

0 0.1 0.20

2

4

6

8

cycles/sec

F−

Sta

tistic

thalamus

0 0.1 0.20

2

4

6

8

cycles/sec

F−

Sta

tistic cerebellum

0 0.1 0.20

2

4

6

8

cycles/sec

F−

Sta

tistic

cerebellum

Figure 7.9 Frequency-dependent equality of means tests for fMRI data at 9brain locations. L = 3 and critical value F.001(30, 120) = 2.26.

separate treatments brush, heat, and shock might exist. The reduced signalpresent in the low shock mean suggests a possible interaction between thetreatments and level of consciousness. The arrangement in the classical two-way table suggests looking at the analog of the two factor analysis of varianceas a function of frequency. In this case, we would obtain a different version ofthe regression model (7.82) of the form

yijkt = µt + αit + βjt + γijt + vijkt (7.86)

for k-th individual undergoing the i-th level of some factor A and the j-th levelof some other factor B, i = 1, . . . I, j = 1 . . . , J, k = 1, . . . nij . The number ofindividuals in each cell can be different, as for the fMRI data in the nextexample. In the above model, we assume the response can be modeled as thesum of a mean, µt, a row effect (type of stimulus), αit, a column effect (level


Table 7.4 Rows of the Design Matrix zzz′j for fMRI Data. Number of

Observations per Cell in Parentheses

Awake Low AnesthesiaBrush 1 1 0 1 1 0 (5) 1 1 0 -1 -1 0 (3)Heat 1 0 1 1 0 1 (4) 1 0 1 -1 0 -1 (5)Shock 1 -1 -1 1 -1 -1 (5) 1 -1 -1 -1 1 1 (4)

of consciousness), βjt and an interaction, γijt, with the usual restrictions∑i

αit =∑

j

βjt =∑

i

γijt =∑

j

γijt = 0

required for a full rank design matrix Z in the overall regression model (7.81).If the number of observations in each cell were the same, the usual simpleanalogous version of the power components (7.84) and (7.85) would exist fortesting various hypotheses. In the case of (7.86), we are interested in testinghypotheses obtained by dropping one set of terms at a time out of (7.86), soan A factor (testing αit = 0), a B factor (βjt = 0), and an interaction term(γijt = 0) will appear as components in the analysis of power. Because of theunequal numbers of observations in each cell, we often put the model in theform of the regression model (7.79)-(7.81).

Example 7.8 Analysis of Power Tests for the Magnetic ResonanceImaging Data

For the fMRI data given as the means in Figure 7.1, a model of theform (7.86) is plausible and will yield more detailed information thanthe simple equality of means test described earlier. The results of thattest, shown in Figure 7.9, were that the means were different for thefour cortex locations and for the second cerebellum location. We mayexamine these differences further by testing whether the mean differencesare because of the nature of the stimulus or the consciousness level, orperhaps due to an interaction between the two factors. Unequal numbersof observations exist in the cells that contributed the means in Figure 7.1.For the regression vector,

(µt, α1t, β1t, β2t, γ11t, γ21t)′,

the rows of the design matrix are as specified in Table 7.4. Note therestrictions given above for the parameters.

The results of testing the three hypotheses are shown in Figure 7.10for the four cortex locations and the cerebellum, the components thatshowed some significant differences in the means in Figure 7.9. Again,the regression power components were smoothed over L = 3 frequencies.

7.6: ANOPOW 443

0 0.1 0.20

5

10

StimulusF

−S

tatistic

cortex 1

0 0.1 0.20

5

10

Consciousness

F−

Sta

tistic

0 0.1 0.20

5

10

Interaction

F−

Sta

tistic

0 0.1 0.20

5

10

F−

Sta

tistic

cortex 2

0 0.1 0.20

5

10F

−S

tatistic

0 0.1 0.20

5

10

F−

Sta

tistic

0 0.1 0.20

5

10

F−

Sta

tistic

cortex 3

0 0.1 0.20

5

10

F−

Sta

tistic

0 0.1 0.20

5

10

F−

Sta

tistic

0 0.1 0.20

5

10

F−

Sta

tistic

cortex 4

0 0.1 0.20

5

10

F−

Sta

tistic

0 0.1 0.20

5

10

F−

Sta

tistic

0 0.1 0.20

5

10

cycles/sec

F−

Sta

tistic

cerebellum

0 0.1 0.20

5

10

cycles/sec

F−

Sta

tistic

0 0.1 0.20

5

10

cycles/sec

F−

Sta

tistic

Figure 7.10 Analysis of power for fMRI data at five locations, L = 3 andcritical values F.001(6, 120) = 4.04 for stimulus and F.001(12, 120) = 3.02 forconsciousness and interaction.

Appealing to the ANOPOW results summarized in Table 7.3 for eachof the subhypotheses, q2 = 1 when the stimulus effect is dropped, andq2 = 2 when either the conciousness effect or the interaction terms aredropped. Hence, 2Lq2 = 6, 12 for the two cases, with N =

∑ij nij = 26

total observations. Here, the form of the stimulus has the major effect,with the brushing, heat, and shock means substantially different at the


probe frequency in four out of five cases. The level of consciousness wasless significant and did not show the strong component at the signalfrequency. A significant interaction occurred, however, at the ipsilateralcomponent of the primary somatosensory cortex location. The moredetailed model does separate the stimuli as having the major effect, butdoes not isolate which of the three might be more substantial than theother two.

Simultaneous Inference

In the previous examples involving the fMRI data, it would be helpful tofocus on the components that contributed most to the rejection of the equalmeans hypothesis. One way to accomplish this is to develop a test for thesignificance of an arbitrary linear compound of the form

Ψ(ωk) = AAA∗(ωk)BBB(ωk), (7.87)

where the components of the vector AAA(ωk) = (A1(ωk), A2(ωk), . . . , Aq(ωk))′

are chosen in such a way as to isolate particular linear functions of parametersin the regression vector BBB(ωk) in the regression model (7.81). This argumentsuggests developing a test of the hypothesis Ψ(ωk) = 0 for all possible valuesof the linear coefficients in the compound (7.87) as is done in the conventionalanalysis of variance approach (see, for example, Scheffe, 1959).

Recalling the material involving the regression models of the form (7.51),the linear compound (7.87) can be estimated by

Ψ(ωk) = AAA∗(ωk)BBB(ωk), (7.88)

where BBB(ωk) is the estimated vector of regression coefficients given by (7.52)and independent of the error spectrum s2

y·z(ωk) in (7.54). It is possible to showthe maximum of the ratio

F (AAA) =N − q

q

|Ψ(ωk) − Ψ(ωk)|2s2

y·z(ωk)Q(AAA), (7.89)

whereQ(AAA) = AAA∗(ωk)S−1

z (ωk)AAA(ωk) (7.90)

is bounded by a statistic that has an F -distribution with 2q and 2(N − q) de-grees of freedom. Testing the hypothesis that the compound has a particularvalue, usually Ψ(ωk) = 0, then proceeds naturally, by comparing the statis-tic (7.89) evaluated at the hypothesized value with the α level point on anF2q,2(N−q) distribution. We can choose an infinite number of compounds ofthe form (7.87) and the test will still be valid at level α. As before, arguingthe error spectrum is relatively constant over a band enables us to smooth thenumerator and denominator of (7.89) separately over L frequencies so distrib-ution involving the smooth components is F2Lq,2L(N−q).

7.6: ANOPOW 445

Example 7.9 Simultaneous Inference for Magnetic ResonanceImaging Data

As an example, consider the previous tests for significance of the fMRIfactors, in which we have indicated the primary effects are among thestimuli but have not investigated which of the stimuli, heat, brushing, orshock, had the most effect. To analyze this further, consider the meansmodel (7.82) and a 6 × 1 contrast vector of the form

Ψ = AAA∗(ωk)BBB(ωk) =6∑

i=1

A∗i (ωk)YYY i·(ωk), (7.91)

where the means are easily shown to be the regression coefficients in thisparticular case. In this case, the means are ordered by columns; the firstthree means are the the three levels of stimuli for the awake state, andthe last three means are the levels for the anesthetized state. In thisspecial case, the denominator terms are

Q =6∑

i=1

|Ai(ωk)|2Ni

, (7.92)

with SSE(ωk) available in (7.85). In order to evaluate the effect of aparticular stimulus, like brushing over the two levels of consciousness, wemay take A1(ωk) = A4(ωk) = 1 for the two brush levels and A(ωk) = 0zero otherwise. From Figure 7.11, we see that, at the first and thirdcortex locations, brush and heat are both significant, whereas the fourthcortex shows only brush and the second cerebellum shows only heat.Shock appears to be transmitted relatively weakly, when averaged overthe awake and mildly anesthetized states.

Multivariate Tests

Although it is possible to develop multivariate regression along lines analo-gous to the usual real valued case, we will only look at tests involving equalityof group means and spectral matrices, because these tests appear to be usedmost often in applications. For these results, consider the p-variate time seriesyyyijt = (yijt1, . . . , yijtp)′ to have arisen from observations on j = 1, . . . , Ni indi-viduals in group i, all having mean µµµit and stationary autocovariance matrixΓi(h). Denote the DFTs of the group mean vectors as YYY i·(ωk) and the p × p

spectral matrices as fi(ωk) for the i = 1, 2, . . . , I groups. Assume the samegeneral properties as for the vector series considered in §7.3.

In the multivariate case, we obtain the analogous versions of (7.84) and(7.85) as the between cross-power and within cross-power matrices

SPR(ωk) =I∑

i=1

Ni∑j=1

(YYY i·(ωk) − YYY ··(ωk)

)(YYY i·(ωk) − YYY ··(ωk)

)∗ (7.93)


0 0.1 0.20

5

F−

Sta

tistic

Brush

cortex 1

0 0.1 0.20

5

F−

Sta

tistic

Heat

0 0.1 0.20

5

F−

Sta

tistic

Shock

0 0.1 0.20

5

F−

Sta

tistic

cortex 2

0 0.1 0.20

5

F−

Sta

tistic

0 0.1 0.20

5

F−

Sta

tistic

0 0.1 0.20

5

F−

Sta

tistic

cortex 3

0 0.1 0.20

5

F−

Sta

tistic

0 0.1 0.20

5

F−

Sta

tistic

0 0.1 0.20

5

F−

Sta

tistic

cortex 4

0 0.1 0.20

5

F−

Sta

tistic

0 0.1 0.20

5F

−S

tatistic

0 0.1 0.20

5

cycles/sec

F−

Sta

tistic

cerebellum

0 0.1 0.20

5

cycles/sec

F−

Sta

tistic

0 0.1 0.20

5

cycles/sec

F−

Sta

tistic

Figure 7.11 Power in simultaneous linear compounds at five locations, en-hancing brush, heat, and shock effects, L = 3, F.001(36, 120) = 1.80.

and

SPE(ωk) =I∑

i=1

Ni∑j=1

(YYY ij(ωk) − YYY i·(ωk)

)(YYY ij(ωk) − YYY i·(ωk)

)∗. (7.94)

The equality of means test is rejected using the fact that the likelihood ratio

7.6: ANOPOW 447

test yields a monotone function of

Λ(ωk) =|SPE(ωk)|

|SPE(ωk) + SPR(ωk)| . (7.95)

Khatri (1965) and Hannan (1970) give the approximate distribution of thestatistic

χ22(I−1)p = −2

(∑Ni − I − p − 1

)log Λ(ωk) (7.96)

as chi-squared with 2(I − 1)p degrees of freedom when the group means areequal.

The case of I = 2 groups reduces to Hotelling’s T 2, as has been shown byGiri (1965), where

T 2 =N1N2

(N1 + N2)[YYY 1·(ωk) − YYY 2·(ωk)

]∗f−1

v (ωk)[YYY 1·(ωk) − YYY 2·(ωk)

], (7.97)

where

fv(ωk) =SPE(ωk)∑

i Ni − I(7.98)

is the pooled error spectrum given in (7.94),with I = 2. The test statistic, inthis case, is

F2p,2(N1+N2−p−1) =(N1 + N2 − 2)p

(N1 + N2 − p − 1)T 2, (7.99)

which was shown by Giri (1965) to have the indicated limiting F -distributionwith 2p and 2(N1 + N2 − p − 1) degrees of freedom when the means are thesame. The classical t-test for inequality of two univariate means will be just(7.98) and (7.99) with p = 1.

Testing equality of the spectral matrices is also of interest, not only fordiscrimination and pattern recognition, as considered in the next section, butalso as a test indicating whether the equality of means test, which assumesequal spectral matrices, is valid. The test evolves from the likelihood rationcriterion, which compares the single group spectral matrices

fi(ωk) =1

Ni − 1

Ni∑j=1

(YYY ij(ωk) − YYY i·(ωk)

)(YYY ij(ωk) − YYY i·(ωk)

)∗ (7.100)

with the pooled spectral matrix (7.98). A modification of the likelihood ratiotest, which incorporates the degrees of freedom Mi = Ni − 1 and M =

∑Mi

rather than the sample sizes into the likelihood ratio statistic, uses

L′(ωk) =MMp∏I

i=1 MMipi

∏ |Mifi(ωk)|Mi

|Mfv(ωk)|M. (7.101)

Krishnaiah et al. (1976) have given the moments of L′(ωk) and calculated95% critical points for p = 3, 4 using a Pearson Type I approximation. For


reasonably large samples involving smoothed spectral estimators, the approx-imation involving the first term of the usual chi-squared series will suffice andShumway (1982) has given

χ2(I−1)p2 = −2r log L′(ωk), (7.102)

where

1 − r =(p + 1)(p − 1)

6p(I − 1)

(∑i

M−1i − M−1

), (7.103)

with an approximate chi-squared distribution with (I − 1)p2 degrees of free-dom when the spectral matrices are equal. Introduction of smoothing over Lfrequencies leads to replacing Mj and M by LMj and LM in the equationsabove.

Of course, it is often of great interest to use the above result for testingequality of two univariate spectra, and it is obvious from the material in Chap-ter 4

F2LM1,2LM2 =f1(ω)

f2(ω)(7.104)

will have the requisite F -distribution with 2LM1 and 2LM2 degrees of freedomwhen spectra are smoothed over L frequencies.

Example 7.10 Equality of Means and Spectral Matrices for Earth-quakes and Explosions

An interesting problem arises when attempting to develop a method-ology for discriminating between waveforms originating from explosionsand those that came from the more commonly occurring earthquakes.Figure 7.2 shows a small subset of a larger population of bivariate se-ries consisting of two phases from each of eight earthquakes and eightexplosions. If the large–sample approximations to normality hold for theDFTs of these series, it is of interest to known whether the differencesbetween the two classes are better represented by the mean functionsor by the spectral matrices. The tests described above can be appliedto look at these two questions. The upper left panel of Figure 7.12shows the test statistic (7.99) with the straight line denoting the crit-ical level for α = .001, i.e., F.001(4, 26) = 7.36, for equal means usingL = 1, and the test statistic remains well below its critical value at allfrequencies, implying that the means of the two classes of series are notsignificantly different. Checking Figure 7.2 shows little reason exists tosuspect that either the earthquakes or explosions have a nonzero meansignal. Checking the equality of the spectra and the spectral matrices,however, leads to a different conclusion. Some smoothing (L = 21) isuseful here, and univariate tests on both the P and S components using(7.104) and N1 = N2 = 8 lead to strong rejections of the equal spectra

7.7: Discrimination and Clustering 449

0 5 10 15 200

2

4

6

8

cycles/sec

F−

Sta

tistic

Equal Means

0 5 10 15 200

5

10

15

cycles/sec

F−

Sta

tistic

Equal P−Spectra

0 5 10 15 200

20

40

60

80

100

cycles/sec

F−

Sta

tistic

Equal S−Spectra

0 5 10 15 200

200

400

600

800

1000

1200

1400

cycles/sec

Chi

−S

quar

ed

Equal Spectral Matrices

Figure 7.12 Tests for equality of means, spectra, and spectral matrices forthe earthquake and explosion data p = 2, L = 21, n = 1024 points at 40 pointsper second.

hypotheses, with F.001(∞,∞) = 1.00 exceeded at almost all frequencies.The rejection seems stronger for the S component and we might tenta-tively identify that component as being dominant. Testing equality ofthe spectral matrices using (7.102) and χ2

.001(4) = 18.47 shows a similarstrong rejection of the equality of spectral matrices. We use these resultsto suggest optimal discriminant functions based on spectral differencesin the next section.

7.7 Discrimination and Cluster Analysis

The extension of classical pattern-recognition techniques to experimental timeseries is a problem of great practical interest. A series of observations indexedin time often produces a pattern that may form a basis for discriminating be-tween different classes of events. As an example, consider Figure 7.2, whichshows regional (100-2000 km) recordings of several typical Scandinavian earth-quakes and mining explosions measured by stations in Scandinavia. A listing


of the events is given in Kakizawa et al. (1998). The problem of discriminat-ing between mining explosions and earthquakes is a reasonable proxy for theproblem of discriminating between nuclear explosions and earthquakes. Thislatter problem is one of critical importance for monitoring a comprehensivetest-ban treaty. Time series classification problems are not restricted to geo-physical applications, but occur under many and varied circumstances in otherfields. Traditionally, the detecting of a signal embedded in a noise series hasbeen analyzed in the engineering literature by statistical pattern recognitiontechniques (see Problems 7.13 and 7.14).

The historical approaches to the problem of discriminating among differentclasses of time series can be divided into two distinct categories. The opti-mality approach, as found in the engineering and statistics literature, makesspecific Gaussian assumptions about the probability density functions of theseparate groups and then develops solutions that satisfy well-defined minimumerror criteria. Typically, in the time series case, we might assume the differencebetween classes is expressed through differences in the theoretical mean andcovariance functions and use likelihood methods to develop an optimal classi-fication function. A second class of techniques, which might be described as afeature extraction approach, proceeds more heuristically by looking at quanti-ties that tend to be good visual discriminators for well-separated populationsand have some basis in physical theory or intuition. Less attention is paidto finding functions that are approximations to some well-defined optimalitycriterion.

As in the case of regression, both time domain and frequency domain ap-proaches to discrimination will exist. For relatively short univariate series,a time domain approach that follows conventional multivariate discriminantanalysis as described in conventional multivariate texts, such as Anderson(1984) or Johnson and Wichern (1992) may be preferable. We might evencharacterize differences by the autocovariance functions generated by differentARMA or state-space models. For longer multivariate time series that canbe regarded as stationary after the common mean has been subtracted, thefrequency domain approach will be easier computationally because the np di-mensional vector in the time domain, represented here as xxx = (xxx′

1, xxx′t, . . . , xxx

′n)′,

with xxxt = (xt1, . . . , xtp)′, will reduced to separate computations made on the p-dimensional DFTs. This happens because of the approximate independence ofthe DFTs, XXX(ωk), 0 ≤ ωk ≤ 1, a property that we have often used in precedingchapters.

Finally, the grouping properties of measures like the discrimination informa-tion and likelihood-based statistics can be used to develop measures of dispar-ity for clustering multivariate time series. In this section, we define a measureof disparity between two multivariate times series by the spectral matricesof the two processes and then apply hierarchical clustering and partitioningtechniques to identify natural groupings within the bivariate earthquake andexplosion populations.


The General Discrimination Problem

The general problem of classifying a vector time series xxx occurs in thefollowing way. We observe a time series xxx known to belong to one of g popula-tions, denoted by Π1, Π2, . . . ,Πg. The general problem is to assign or classifythis observation into one of the g groups in some optimal fashion. An exam-ple might be the g = 2 populations of earthquakes and explosions shown inFigure 7.2. We would like to classify the unknown event, shown as NZ in thebottom two panels, as belonging to either the earthquake (Π1) or explosion(Π2) populations. To solve this problem, we need an optimality criterion thatleads to a statistic T (xxx) that can be used to assign the NZ event to either theearthquake or explosion populations. To measure the success of the classifi-cation, we need to evaluate errors that can be expected in the future relatingto the number of earthquakes classified as explosions (false alarms) and thenumber of explosions classified as earthquakes (missed signals).

The problem can be formulated by assuming the observed series xxx has aprobability density pi(xxx) when the observed series is from population Πi fori = 1, . . . , g. Then, partition the space spanned by the np-dimensional processxxx into g mutually exclusive regions R1, R2, . . . , Rg such that, if xxx falls in Ri, weassign xxx to population Πi. The misclassification probability is defined as theprobability of classifying the observation into population Πj when it belongsto Πi, for j �= i and would be given by the expression

P (j|i) =∫

Rj

pi(xxx) dxxx. (7.105)

The overall total error probability depends also on the prior probabilities,say, π1, π2, . . . , πg, of belonging to one of the g groups. For example, theprobability that an observation xxx originates from Πi and is then classified intoΠj is obviously πiP (j|i), and the total error probability becomes

Pe =g∑

i=1

πi

∑j =i

P (j|i). (7.106)

Although costs have not been incorporated into (7.106), it is easy to do so bymultiplying P (j|i) by C(j|i), the cost of assigning a series from population Πi

to Πj .The overall error Pe is minimized by classifying xxx into Πi if

pi(xxx)pj(xxx)

>πj

πi(7.107)

for all j �= i (see, for example, Anderson, 1984). A quantity of interest, fromthe Bayesian perspective, is the posterior probability an observation belongsto population Πi, conditional on observing xxx, say,

P (Πi|xxx) =πipi(xxx)∑

j πj(xxx)pj(xxx). (7.108)


The procedure that classifies xxx into the population Πi for which the poste-rior probability is largest is equivalent to that implied by using the criterion(7.107). The posterior probabilities give an intuitive idea of the relative oddsof belonging to each of the plausible populations.

Many situations occur, such as in the classification of earthquakes andexplosions, in which there are only g = 2 populations of interest. For two pop-ulations, the Neyman–Pearson lemma implies, in the absence of prior proba-bilities, classifying an observation into Π1 when

p1(xxx)p2(xxx)

> K (7.109)

minimizes each of the error probabilities for a fixed value of the other. Therule is identical to the Bayes rule (7.107) when K = π2/π1.

The theory given above takes a simple form when the vector xxx has a p-variate normal distribution with mean vectors µµµj and covariance matrices Σj

under Πj for j = 1, 2, . . . , g. In this case, simply use

pj(xxx) = (2π)−p/2|Σj |−1/2 exp{

−12(xxx − µµµj)

′Σ−1j (xxx − µµµj)

}. (7.110)

The classification functions are conveniently expressed by quantities that areproportional to the logarithms of the densities, say,

gj(xxx) = −12

ln |Σj | − 12

xxx′Σ−1j xxx + µµµ′

jΣ−1j xxx − 1

2µµµ′

jΣ−1j µµµj + lnπj . (7.111)

In expressions involving the log likelihood, we will generally ignore terms in-volving the constant − ln 2π. For this case, we may assign an observation xxx topopulation Πi whenever

gi(xxx) > gj(xxx) (7.112)

for j �= i, j = 1, . . . , g and the posterior probability (7.108) has the form

P (Πi|xxx) =exp{gi(xxx)}∑j exp{gj(xxx)} .

A common situation occurring in applications involves classification for g =2 groups under the assumption of multivariate normality and equal covariancematrices; i.e., Σ1 = Σ2 = Σ. Then, the criterion (7.112) can be expressed interms of the linear discriminant function

dl(xxx) = g1(xxx) − g2(xxx)

= (µµµ1 − µµµ2)′Σ−1xxx − 1

2(µµµ1 − µµµ2)

′Σ−1(µµµ1 + µµµ2) + lnπ1

π2, (7.113)

where we classify into Π1 or Π2 according to whether dl(xxx) ≥ 0 or dl(xxx) < 0.The linear discriminant function is clearly a combination of normal variables


and, for the case π1 = π2 = .5, will have mean D2/2 under Π1 and mean−D2/2 under Π2, with variances given by D2 under both hypotheses, where

D2 = (µµµ1 − µµµ2)′Σ−1(µµµ1 − µµµ2) (7.114)

is the Mahalanobis distance between the mean vectors µµµ1 and µµµ2. In this case,the two misclassification probabilities (7.1) are

P (1|2) = P (2|1)

= Φ(

−D

2

), (7.115)

and the performance is directly related to the Mahalanobis distance (7.114).For the case in which the covariance matrices cannot be assumed to be the

the same, the discriminant function takes a different form, with the differenceg1(xxx) − g2(xxx) taking the form

dq(xxx) = −12

ln|Σ1||Σ2| − 1

2xxx′(Σ−1

1 − Σ−12 )xxx

+(µµµ′1Σ

−11 − µµµ′

2Σ−12 )xxx + ln

π1

π2(7.116)

for g = 2 groups. This discriminant function differs from the equal covari-ance case in the linear term and in a nonlinear quadratic term involving thediffering covariance matrices. The distribution theory is not tractable for thequadratic case so no convenient expression like (7.115) is available for the errorprobabilities for the quadratic discriminant function.

A difficulty in applying the above theory to real data is that the group meanvectors µµµj and covariance matrices Σj are seldom known. Some engineeringproblems, such as the detection of a signal in white noise, assume the meansand covariance parameters are known exactly, and this can lead to an optimalsolution (see Problems 7.14 and 7.15). In the classical multivariate situation,it is possible to collect a sample of Ni training vectors from group Πi, say, xxxij ,for j = 1, . . . , Ni, and use them to estimate the mean vectors and covariancematrices for each of the groups i = 1, 2, . . . , g; i.e., simply choose xxxi· and

Si = (Ni − 1)−1Ni∑j=1

(xxxij − xxxi·)(xxxij − xxxi·)′ (7.117)

as the estimators for µµµi and Σi, respectively. In the case in which the covariancematrices are assumed to be equal, simply use the pooled estimator

S =(∑

i

Ni − g

)−1∑i

(Ni − 1)Si. (7.118)

For the case of a linear discriminant function, we may use

gi(xxx) = xxx′i·S

−1xxx − 12

xxx′i·S

−1xxxi· + log πi (7.119)


as a simple estimator for gi(xxx). For large samples, xxxi· and S converge to µµµi

and Σ in probability so gi(xxx) converges in distribution to gi(xxx) in that case.The procedure works reasonably well for the case in which Ni, i = 1, . . . g arelarge, relative to the length of the series n, a case that is relatively rare in timeseries analysis. For this reason, we will resort to using spectral approximationsfor the case in which data are given as long time series.

The performance of sample discriminant functions can be evaluated in sev-eral different ways. If the population parameters are known, (7.114) and(7.115) can be evaluated directly. If the parameters are estimated, the es-timated Mahalanobis distance D2 can be substituted for the theoretical valuein very large samples. Another approach is to calculate the apparent error ratesusing the result of applying the classification procedure to the training sam-ples. If nij denotes the number of observations from population Πj classifiedinto Πi, the sample error rates can be estimated by the ratio

P (i|j) =nij∑i nij

(7.120)

for i �= j. If the training samples are not large, this procedure may be biasedand a resampling option like cross validation or the bootstrap can be em-ployed. A simple version of cross validation is the jacknife procedure proposedby Lachenbruch and Mickey (1968), which holds out the observation to beclassified, deriving the classification function from the remaining observations.Repeating this procedure for each of the members of the training sample andcomputing (7.120) for the holdout samples leads to better estimators of theerror rates.

Example 7.11 Discriminant Analysis Using Amplitudes from Earth-quakes and Explosions

We can give a simple example of applying the above procedures to thelogarithms of the amplitudes of the separate P and S components ofthe original earthquake and explosion traces. The logarithms (base 10)of the maximum peak-to-peak amplitudes of the P and S components,denoted by log10 P and log10 S, can be considered as two-dimensionalfeature vectors, say, xxx = (x1, x2)′ = (log10 P, log10 S)′, from a bivariatenormal population with differering means and covariances. The originaldata, from Kakizawa et al. (1998), are shown in Table 7.5 and in theleft-hand panel of Figure 7.13. The table includes the Novaya Zemlya(NZ) event of unknown origin. The tendency of the earthquakes to havehigher values for log10 S, relative to log10 P has been noted by manyand the use of the logarithm of the ratio, i.e., log10 P − log10 S in somereferences (see Lay, 1997, pp. 40-41) is a tacit indicator that a linearfunction of the two parameters will be a useful discriminant.


3.5 4 4.5 5 5.54

4.5

5

5.5

6

6.5

Log P

Log

S

* Earthquakes

+ Explosions

−10 0 10 20 30 40−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Kullback−Leibler

Che

rnof

f(.3

)

* Earthquakes

+ Explosions

Figure 7.13 Classification of earthquakes and explosions using the magnitudefeatures (left panel) and the K-L and Chernoff disparity measures (right panel).

The sample means xxx1· = (4.25, 4.95)′ and xxx2· = (4.64, 4.73)′, and covari-ance matrices

S1 =(

.3096 .3954

.3954 .5378

)and

S2 =(

.0954 .0804

.0804 .1070

)are immediate from (7.117), with the pooled covariance matrix given by

S =(

.2025 .2379

.2379 .3238

)from (7.118). Although the covariance matrices are not equal, we trythe linear discriminant function anyway, which yields (with equal priorprobabilities π1 = π2 = .5) the sample discriminant functions

g1(xxx) = 22.12x1 − .98x2 − 45.23

and g2(xxx) = 42.61x2 − 16.8x2 − 59.80


Table 7.5 Logarithms of Maximum Peak-to-Peak Amplitudes from P and SComponents for Eight Earthquakes and Eight Explosions

EQ log10 P log10 S EXP log10 P log10 S1 3.91 4.67 1 4.55 4.882 4.78 5.71 2 4.74 4.433 3.98 4.86 3 4.90 5.094 3.76 4.14 3 4.60 4.865 3.80 4.14 5 4.81 4.766 4.88 5.56 6 4.36 4.557 5.06 6.03 6 5.04 5.068 3.80 4.45 8 4.08 4.14

NZ 3.18 3.27

from (7.119), with the estimated linear discriminant function (7.113) as

dl(xxx) = −20.49x1 + 15.82x2 + 14.57,

indicating log10 S − log10 P = x2 − x1 is not far from the optimal lineardiscriminant function. The jacknifed posterior probabilities of being anearthquake for the earthquake group ranged from .791 to 1.000, whereasthe explosion probabilities for the explosion group ranged from .814 to.998, except for the first explosion, which was classified as an earthquakewith a posterior probability of .949. Hence, n12 = 1 for this particularexample. The unknown event, NZ, was classified as an earthquake, withposterior probability .753. Components of the vector for the unknownevent NZ were well outside the range of the values spanned by the trainingset, so the classification here is somewhat suspect. The quadratic dis-criminant might be more appropriate here, given the observed differencesin the two covariance matrices. Applying the sample version of (7.116)leads to essentially the same results, namely, the misclassification of thefirst earthquake as an explosion with a posterior probability of .807 andthe classification of the unknown NZ event into the earthquake group.

Frequency Domain Discrimination

The feature extraction approach often works well for discriminating be-tween classes of univariate or multivariate series when there is a simple low-dimensional vector that seems to capture the essence of the differences betweenthe classes. It still seems sensible, however, to develop optimal methods forclassification that exploit the differences between the multivariate means andcovariance matrices in the time series case. Such methods can be based onthe Whittle approximation to the log likelihood given in §7.2. In this case,the vector DFTs, say, XXX(ωk), are assumed to be approximately normal, withmeans MMM j(ωk) and spectral matrices fj(ωk) for population Πj at frequencies


ωk = k/n, for k = 0, 1, . . . [n/2], and are approximately uncorrelated at differ-ent frequencies, say, ωk and ω� for k �= . Then, writing the complex normaldensities as in §7.2 leads to a criterion similar to (7.111); namely,

gj(XXX) = lnπj −∑

0<ωk<1/2

[ln |fj(ωk)| + XXX∗(ωk)f−1

j (ωk)XXX(ωk)

−2MMM∗j (ωk)f−1

j (ωk)XXX(ωk) + MMM∗j (k)f−1

j (ωk)MMM j(ωk)], (7.121)

where the sum goes over frequencies for which |fj(ωk)| �= 0. The periodicityof the spectral density matrix and DFT allows adding over 0 < k < 1/2. Theclassification rule is as in (7.112).

In the time series case, it is more likely the discriminant analysis involvesassuming the covariance matrices are different and the means are equal. Forexample, the tests, shown in Figure 7.12, imply, for the earthquakes and ex-plosions, the primary differences are in the bivariate spectral matrices and themeans are essentially the same. For this case, it will be convenient to writethe Whittle approximation to the log likelihood in the form

ln pj(XXX) =∑

0<ωk<1/2

[− ln |fj(ωk)| − XXX∗(ωk)f−1

j (ωk)XXX(ωk)], (7.122)

where we have omitted the prior probabilities from the equation. The quadraticdetector in this case can be written in the form

ln pj(XXX) =∑

0<ωk<1/2

[− ln |fj(ωk)| − tr

{I(ωk)f−1

j (ωk)}]

, (7.123)

whereI(ωk) = XXX(ωk)XXX∗(ωk) (7.124)

denotes the periodogram matrix. For equal prior probabilities, we may assignan observation xxx into population Πi whenever

ln pi(XXX) > ln pj(XXX) (7.125)

for j �= i, j = 1, 2, . . . , g.Numerous authors have considered various versions of discriminant analysis

in the frequency domain. Shumway and Unger (1974) considered (7.121) forp = 1 and equal covariance matrices, so the criterion reduces to a simplelinear one. They apply the criterion to discriminating between earthquakesand explosions using teleseismic P wave data in which the means over the twogroups might be considered as fixed. Alagon (1989) and Dargahi-Noubary andLaycock (1981) considered discriminant functions of the form (7.121) in theunivariate case when the means are zero and the spectra for the two groups aredifferent. Taniguchi et al. (1994) adopted (7.122) as a criterion and discussed


its non-Gaussian robustness. Shumway (1982) reviews general discriminantfunctions in both the univariate and multivariate time series cases.

Measures of Disparity

Before proceeding to examples of discriminant and cluster analysis, it isuseful to consider the relation to the Kullback–Leibler (K-L) discriminationinformation, as defined in Problem 2.4 of Chapter 2. Using the spectral ap-proximation and noting the periodogram matrix has the approximate expec-tation

EjI(ωk) = fj(ωk)

under the assumption that the data come from population Πj , and approxi-mating the ratio of the densities by

lnp1(XXX)p2(XXX)

=∑

0<ωk<1/2

[− ln

|f1(ωk)||f2(ωk)| − tr

{(f−12 (ωk) − f−1

1 (ωk))I(ωk)

}],

we may write the approximate discrimination information as

I(f1; f2) =1n

E1 lnp1(XXX)p2(XXX)

=1n

∑0<ωk<1/2

[tr{f1(ωk)f−1

2 (ωk)}− ln

|f1(ωk)||f2(ωk)| − p

]. (7.126)

The approximation may be carefully justified by noting the multivatiate nor-mal time series xxx = (xxx′

1, xxx′2 . . . , xxx′

n) with zero means and np × np stationarycovariance matrices Γ1 and Γ2 will have p, n × n blocks, with elements of theform γ

(l)ij (s − t), s, t = 1, . . . , n, i, j = 1, . . . , p for population Π�, = 1, 2. The

discrimination information, under these conditions, becomes

I(1; 2 : xxx) =1n

E1 lnp1(xxx)p2(xxx)

=12n

[tr{Γ1Γ−1

2

}− ln|Γ1||Γ2| − np

]. (7.127)

The limiting result

limn→∞ I(1; 2 : xxx) =

12

∫ 1/2

−1/2

[tr{f1(ω)f−1

2 (ω)} − ln|f1(ω)||f2(ω)| − p

]dω

has been shown, in various forms, by Pinsker (1964), Hannan (1970), andKazakos and Papantoni-Kazakos (1980). The discrete version of (7.126) is justthe approximation to the integral of the limiting form. The K-L measure ofdisparity is not a true distance, but it can be shown that I(1; 2) ≥ 0, withequality if and only if f1(ω) = f2(ω) almost everywhere. This result makes itpotentially suitable as a measure of disparity between the two densities.


A connection exists, of course, between the discrimination information num-ber, which is just the expectation of the likelihood criterion and the likelihooditself. For example, we may measure the disparity between the sample and theprocess defined by the theoretical spectrum fj(ωk) corresponding to populationΠj in the sense of Kullback (1978), as I(f ; fj), where

f(ωk) = L−1m∑

�=−m

I(ωk + /n) (7.128)

denotes the smoothed spectral matrix. The likelihood ratio criterion can bethought of as measuring the disparity between the periodogram and the the-oretical spectrum for each of the populations. To make the discriminationinformation finite, we replace the periodogram implied by the log likelihood bythe sample spectrum. In this case, the classification procedure can be regardedas finding the population closest, in the sense of minimizing disparity betweenthe sample and theoretical spectral matrices. The classification in this caseproceeds by simply choosing the population Πj that minimizes I(f ; fj), i.e.,assigning xxx to population Πi whenever

I(f ; fi) < I(f ; fj) (7.129)

for j �= i, j = 1, 2, . . . , g.Kakizawa et al. (1998) proposed using the Chernoff (CH) information mea-

sure (Chernoff, 1952, Renyi, 1961), defined as

Bα(1; 2) = − lnE2

{(p2(xxx)p1(xxx)

)α}, (7.130)

where the measure is indexed by a regularizing parameter α, for 0 < α < 1.When α = .5, the Chernoff measure is the symmetric divergence proposed byBhattacharya (1943). For the multivariate normal case,

Bα(1; 2 : xxx) =1n

[ln

|αΓ1 + (1 − α)Γ2||Γ2| − α ln

|Γ1||Γ2|

]. (7.131)

The large sample spectral approximation to the Chernoff information measureis analogous to that for the discrimination information, namely,

Bα(f1; f2) =12n

∑0<ωk<1/2

[ln

|αf1(ωk) + (1 − α)f2(ωk)||f2(ωk)|

−α ln|f1(ωk)||f2(ωk)|

]. (7.132)

The Chernoff measure, when divided by α(1 − α), behaves like the discrimi-nation information in the limit in the sense that it converges to I(1; 2 : xxx) forα → 0 and to I(2; 1 : xxx) for α → 1. Hence, near the boundaries of the parame-ter α, it tends to behave like discrimination information and for other values


represents a compromise between the two information measures. The classifi-cation rule for the Chernoff measure reduces to assigning xxx to population Πi

wheneverBα(f ; fi) < Bα(f ; fj) (7.133)

for j �= i, j = 1, 2, . . . , g.Although the classification rules above are well defined if the group spectral

matrices are known, this will not be the case in general. If there are g trainingsamples, xxxij , j = 1, . . . , Ni, i = 1 . . . , g, with Ni vector observations availablein each group, the natural estimator for the spectral matrix of the group i isjust the single-group spectral matrix (7.100), namely, with XXXij(ωk) denotingthe vector DFTs,

fi(ωk) =1

Ni − 1

Ni∑j=1

(XXXij(ωk) − XXXi·(ωk)

)∗(XXXij(ωk) − XXXi·(ωk)

), (7.134)

A second consideration is the choice of the regularization parameter α forthe Chernoff criterion, (7.132). For the case of g = 2 groups, it should bechosen to maximize the disparity between the two group spectra, as defined in(7.132). Kakizawa et al. (1998) simply plot (7.132) as a function of α, using theestimated group spectra in (7.134), choosing the value that gives the maximumdisparity between the two groups.

Example 7.12 Discriminant Analysis for Earthquakes and Explosions

The simplest approaches to discriminating between the earthquake andexplosion groups have been based on either the relative amplitudes ofthe P and S phases, as in Figure 7.4 or on relative power componentsin various frequency bands. Considerable effort has been expended onusing various spectral ratios involving the bivariate P and S phases asdiscrimination features. Kakizawa et al. (1998) mention a number ofmeasures that have be used in the seismological literature as features.These features include ratios of power for the two phases and ratios ofpower components in high- and low-frequency bands. The use of suchfeatures of the spectrum suggests an optimal procedure based on discrim-inating between the spectral matrices of two stationary processes wouldbe reasonable. The fact that the hypothesis that the spectral matriceswere equal, tested in Example 7.10, was also soundly rejected suggests theuse of a discriminant function based on spectral differences. Recall thesampling rate is 40 points per second, leading to a folding frequency of 20Hz. To avoid numerical problems, we used a broad band (2Hz, L = 51)and the criteria (7.126) and (7.132), summed over the interval from 0 to 8Hz, where the spectra were both positive. Narrowing the bandwidth andsumming over a broader interval did not substantially change the results.


Table 7.6 Discriminant Scores I = I(f ; f1) − I(f ; f2) andB = B.3(f ; f1) − B.3(f ; f2) for Earthquakes and Explosions

EQ I B EXP I B1 8.51 .54 1 .29 −.252 .81 .50 2 −2.55 −.753 30.80 1.04 4 −1.82 −.614 2.73 .10 4 −1.89 −.445 7.69 .11 5 −1.16 −.456 21.50 .79 6 −2.12 −.617 20.31 .85 7 −2.10 −.598 15.54 .70 8 .93 −.21

The maximum value of the estimated Chernoff disparity Bα(f1; f2) oc-curs for α = .3, and we use that value in the discriminant criterion(7.132). Discriminant scores using the holdout classification functionsare shown in Table 7.6 for both criteria. We note the generally good per-formance of the Chernoff measure, which separates the two populationswell and makes no errors; the discrimination information misclassifiedexplosions one and eight as earthquakes. The values for the two sets ofscores are plotted in the right-hand panel of Figure 7.13, and the earth-quake variances of the discrimination information have larger variancesthan do those for the explosions (the standard deviations were 9.34 and1.25, respectively). The Chernoff discriminant scores are distributed oneither side of the decision point 0, with means .58 and -.48 for the earth-quake and explosion groups, respectively; the standard deviations of thetwo samples were .34 and .20. The NZ event was also classified usingthe average spectral matrices of the eight earthquakes and explosions,giving the value -.49 for the discimination information and -.31 for theChernoff measure, putting the event in the explosion population by thiscriterion. Previously, in Example 7.11, the extracted log amplitudes clas-sified this event in the earthquake group. The Russians have asserted nomine blasting or nuclear testing occurred in the area in question, so theevent remains as somewhat of a mystery. The fact that it was relativelyremoved geographically from the test set may also have introduced someuncertainties into the procedure.

Cluster Analysis

For the purpose of clustering, it may be more useful to consider a symmetricdisparity measures and we introduce the J-Divergence measure

J(f1; f2) = I(f1; f2) + I(f2; f1) (7.135)


and the symmetric Chernoff number

JBα(f1; f2) = Bα(f1; f2) + Bα(f2; f1) (7.136)

for that purpose. In this case, we define the disparity between the samplespectral matrix of a single vector, xxx, and the population Πj as

J(f ; fj) = I(f ; fj) + I(fj ; f) (7.137)

andJBα(f ; fj) = Bα(f ; fj) + Bα(fj ; f), (7.138)

respectively and use these as quasi-distances between the vector and populationΠj .

The measures of disparity can be used to cluster multivariate time series.The symmetric measures of disparity, as defined above ensure that the disparitybetween fi and fj is the same as the disparity between fj and fi. Hence, wewill consider the symmetric forms (7.137) and (7.138) as quasi-distances forthe purpose of defining a distance matrix for input into one of the standardclustering procedures (see Johnson and Wichern, 1992). In general, we mayconsider either hierarchical or partitioned clustering methods using the quasi-distance matrix as an input.

For purposes of illustration, we may use the symmetric divergence (7.137),which implies the quasi-distance between sample series with estimated spectralmatrices fi and fj would be (7.137); i.e.,

J(fi; fj) =1n

∑0<ωk<1/2

[tr{fi(ωk)f−1

j (ωk)}

+ tr{fj(ωk)f−1

i (ωk)}− 2p

],

(7.139)for i �= j. We can also use the comparable form for the Chernoff divergence,but we may not want to make an assumption for the regularization parameterα.

For hierarchical clustering, we begin by clustering the two members of thepopulation that minimize the disparity measure (7.139). Then, these two itemsform a cluster, and we can compute distances between unclustered items asbefore. The distance between unnclustered items and a current cluster is de-fined here as the average of the distances to elements in the cluster. Again, wecombine objects that are closest together. We may also compute the distancebetween the unclustered items and clustered items as the closest distance,rather than the average. Once a series is in a cluster, it stays there. At eachstage, we have a fixed number of clusters, depending on the merging stage.

Alternatively, we may think of clustering as a partitioning of the sample intoa prespecified number of groups. MacQueen (1967) has proposed this usingk-means clustering, using the Mahalonobis distance between an observationand the group mean vectors. At each stage, a reassignment of an observationinto its closest affinity group is possible. To see how this procedure applies


Table 7.7 Clustering Results for Earthquakes and ExplosionsBeginning Cluster 1 Cluster 2 Cluster 3

Two Groups:Random EQ 123678 EX 12345678

EQ 45 NZHierarchical EQ 12345678 EX 12345678

NZThree Groups:

Random EQ 123678 EX 1234567 EQ 5 EX 8EQ 4 NZ

Hierarchical EQ 123678 EX 1234567 EQ 45 EX 8NZ

in the current context, consider a preliminary partition into a fixed number ofgroups and define the disparity between the spectral matrix of the observation,say, f , and the average spectral matrix of the group, say, fi, as J(f ; fi), wherethe group spectral matrix can be estimated by (7.134). At any pass, a singleseries is reassigned to the group for which its disparity is minimized. Thereassignment procedure is repeated until all observations stay in their currentgroups. Of course, the number of groups must be specified for each repetitionof the partitioning algorithm and a starting partition must be chosen. Thisassignment can either be random or chosen from a preliminary hierarchicalclustering, as described above. kip

Example 7.13 Cluster Analysis for Earthquakes and Explosions

It is instructive to try the clustering procedure on the population ofknown earthquakes and explosions. Table 7.7 shows the results of ap-plying partitioned clustering under the assumption that either two orthree groups are appropriate. Two groups would be simple assumingthe vectors classified naturally into the earthquake and explosion classes,whereas three groups would imply possible outliers from the two pri-mary groups. The starting partitions were defined by either randomlyassigning observations to groups or using the result of the hierarchichalclustering procedure. The two-group partition with the hierarchical startconfiguration tends to produce a final partition that agrees closely withthe known configuration, assuming the NZ event is an explosion. Therandom starting partition puts two of the earthquakes into the explo-sion group. For the three-group partitions, one or two earthquakes andthe last explosion join the third cluster that we have designated as theoutlying group.


7.8 Principal Components and Factor Analysis

In this section, we introduce the related topics of spectral domain principalcomponents analysis and factor analysis for time series. The topics of principalcomponents and canonical analysis in the frequency domain are rigorouslypresented in Brillinger (1981, Chapters 9 and 10) and many of the detailsconcerning these concepts can be found there.

The techniques presented here are related to each other in that they focuson extracting pertinent information from spectral matrices. This informationis important because dealing directly with a high-dimensional spectral matrixf(ω) itself is somewhat cumbersome because it is a function into the set of com-plex, nonnegative-definite, Hermitian matrices. We can view these techniquesas easily understood, parsimonious tools for exploring the behavior of vector-valued time series in the frequency domain with minimal loss of information.Because our focus is on spectral matrices, we assume for convenience that thetime series of interest have zero means; the techniques are easily adjusted inthe case of nonzero means.

In this and subsequent sections, it will be convenient to work occasionallywith complex-valued time series. A p × 1 complex-valued time series can berepresented as xxxt = xxx1t−ixxx2t, where xxx1t is the real part and xxx2t is the imaginarypart of xxxt. The process is said to be stationary if E(xxxt) and E(xxxt+hxxx∗

t ) existand are independent of time t. The p × p autocovariance function,

Γxx(h) = E(xxxt+hxxx∗t ) − E(xxxt+h)E(xxx∗

t ),

of xxxt satisfies conditions similar to those of the real-valued case. WritingΓxx(h) = {γij(h)}, for i, j = 1, . . . , p, we have (i) γii(0) ≥ 0 is real, (ii)|γij(h)|2 ≤ γii(0)γjj(0) for all integers h, and (iii) Γxx(h) is Hermitian, that is,Γxx(h) = Γxx(h)∗. The spectral theory of complex-valued vector time series isanalogous to the real-valued case. For example, Γx(h) is a nonnegative-definitefunction on the integers, and if

∑h ||Γxx(h)|| < ∞, the spectral density matrix

of the complex series xxxt is given by

fxx(ω) =∞∑

h=−∞Γxx(h) exp(−2πihω).

Principal Components

Classical principal component analysis (PCA) is concerned with explaining thevariance–covariance structure among p variables, xxx = (x1, . . . , xp)′, through afew linear combinations of the components of xxx. Suppose we wish to find alinear combination

y = ccc′xxx = c1x1 + · · · + cpxp (7.140)

of the components of xxx such that var(y) is as large as possible. Because var(y)can be increased by simply multiplying ccc by a constant, it is common to restrict

7.8: Principal Components and Factor Analysis 465

ccc to be of unit length; that is, ccc′ccc = 1. Noting that var(y) = ccc′Σxxccc, whereΣxx is the p × p variance–covariance matrix of xxx, another way of stating theproblem is to find ccc such that

maxccc=000

ccc′Σxxccc

ccc′ccc. (7.141)

Denote the eigenvalue–eigenvector pairs of Σxx by {(λ1, eee1), . . . , (λp, eeep)},where λ1 ≥ λ2 ≥ · · · ≥ λp ≥ 0, and the eigenvectors are of unit length. Thesolution to (7.141) is to choose ccc = eee1, in which case the linear combinationy1 = eee′

1xxx has maximum variance, var(y1) = λ1. In other words,

maxccc=000

ccc′Σxxccc

ccc′ccc=

eee′1Σxxeee1

eee′1eee1

= λ1. (7.142)

The linear combination, y1 = eee′1xxx, is called the first principal component.

Because the eigenvalues of Σxx are not necessarily unique, the first principalcomponent is not necessarily unique.

The second principal component is defined to be the linear combinationy2 = ccc′xxx that maximizes var(y2) subject to ccc′ccc = 1 and such that cov(y1, y2) =0. The solution is to choose ccc = eee2, in which case, var(y2) = λ2. In general,the k-th principal component, for k = 1, 2, . . . , p, is the the linear combinationyk = ccc′xxx that maximizes var(yk) subject to ccc′c = 1 and such that cov(yk, yj) =0, for j = 1, 2, . . . , k − 1. The solution is to choose ccc = eeek, in which casevar(yk) = λk.

One measure of the importance of a principal component is to assess theproportion of the total variance attributed to that principal component. Thetotal variance of xxx is defined to be the sum of the variances of the individualcomponents; that is, var(x1) + · · · + var(xp) = σ11 + · · · + σpp, where σjj isthe j-th diagonal element of Σxx. This sum is also denoted as tr(Σxx), orthe trace of Σxx. Because tr(Σxx) = λ1 + · · · + λp, the proportion of thetotal variance attributed to the k-th principal component is given simply byvar(yk)

/tr(Σxx) = λk

/ ∑pj=1 λj .

Given a random sample xxx1, . . . , xxxn, the sample principal components aredefined as above, but with Σxx replaced by the sample variance–covariancematrix, Sxx = (n − 1)−1∑n

i=1(xxxi − xxx)(xxxi − xxx)′. Further details can be foundin the introduction to classical principal component analysis in Johnson andWichern (1992, Chapter 9).

For the case of time series, suppose we have a zero mean, p × 1, sta-tionary vector process xxxt that has a p × p spectral density matrix given byfxx(ω). Recall fxx(ω) is a complex-valued, nonnegative-definite, Hermitianmatrix. Using the analogy of classical principal components, and in particular(7.140) and (7.141), suppose, for a fixed value of ω, we want to find a complex-valued univariate process yt(ω) = ccc(ω)∗xt, where ccc(ω) is complex, such thatthe spectral density of yt(ω) is maximized at frequency ω, and ccc(ω) is of unitlength, ccc(ω)∗ccc(ω) = 1. Because, at frequency ω, the spectral density of yt(ω)


is fy(ω) = ccc(ω)∗fxx(ω)ccc(ω), the problem can be restated as: Find complexvector ccc(ω) such that

maxccc(ω)=000

ccc(ω)∗fxx(ω)ccc(ω)ccc(ω)∗ccc(ω)

. (7.143)

Let {(λ1(ω), eee1(ω)) ,. . . , (λp(ω), eeep(ω))} denote the eigenvalue–eigenvector pairsof fxx(ω), where λ1(ω) ≥ λ2(ω) ≥ · · · ≥ λp(ω) ≥ 0, and the eigenvectors are ofunit length. We note that the eigenvalues of a Hermitian matrix are real. Thesolution to (7.143) is to choose ccc(ω) = eee1(ω); in which case the desired linearcombination is yt(ω) = eee1(ω)∗xxxt. For this choice,

maxccc(ω)=000

ccc(ω)∗fxx(ω)ccc(ω)ccc(ω)∗ccc(ω)

=eee1(ω)∗fx(ω)eee1(ω)

eee1(ω)∗eee1(ω)= λ1(ω). (7.144)

This process may be repeated for any frequency ω, and the complex-valuedprocess, yt1(ω) = eee1(ω)∗xxxt, is called the first principal component at fre-quency ω. The k-th principal component at frequency ω, for k = 1, 2, . . . , p, isthe complex-valued time series ytk(ω) = eeek(ω)∗xxxt, in analogy to the clas-sical case. In this case, the spectral density of ytk(ω) at frequency ω isfyk

(ω) = eeek(ω)∗fxx(ω)eeek(ω) = λk(ω).The previous development of spectral domain principal components is re-

lated to the spectral envelope methodology first discussed in Stoffer et al. (1993).We will present the spectral envelope in the next section, where we motivatethe use of principal components as it is presented above. Another way to mo-tivate the use of principal components in the frequency domain was given inBrillinger (1981, Chapter 9). Although this technique leads to the same analy-sis, the motivation may be more satisfactory to the reader at this point. In thiscase, we suppose we have a stationary, p-dimensional, vector-valued process xxxt

and we are only able to keep a univariate process yt such that, when needed,we may reconstruct the vector-valued process, xxxt, according to an optimalitycriterion.

Specifically, we suppose we want to approximate a mean-zero, stationary,vector-valued time series, xxxt, with spectral matrix fxx(ω), by a univariateprocess yt defined by

yt =∞∑

j=−∞ccc∗t−jxxxj , (7.145)

where {cccj} is a p × 1 vector-valued filter, such that {cccj} is absolutely sum-mable; that is,

∑∞j=−∞ |cccj | < ∞. The approximation is accomplished so the

reconstruction of xxxt from yt, say,

xxxt =∞∑

j=−∞bbbt−jyj , (7.146)

where {bbbj} is an absolutely summable p×1 filter, is such that the mean squareapproximation error

E{(xxxt − xxxt)∗(xxxt − xxxt)} (7.147)


is minimized.Let bbb(ω) and ccc(ω) be the transforms of {bbbj} and {cccj}, respectively. For

example,

ccc(ω) =∞∑

j=−∞cccj exp(−2πijω), (7.148)

and, consequently,

cccj =∫ 1/2

−1/2ccc(ω) exp(2πijω)dω. (7.149)

Brillinger (1981, Theorem 9.3.1) shows the solution to the problem is to chooseccc(ω) to satisfy (7.143) and to set bbb(ω) = ccc(ω). This is precisely the previousproblem, with the solution given by (7.144). That is, we choose ccc(ω) = eee1(ω)and bbb(ω) = eee1(ω); the filter values can be obtained via the inversion formulagiven by (7.149). Using these results, in view of (7.145), we may form the firstprincipal component series, say yt1.

This technique may be extended by requesting another series, say, yt2, forapproximating xxxt with respect to minimum mean square error, but where thecoherency between yt2 and yt1 is zero. In this case, we choose ccc(ω) = eee2(ω).Continuing this way, we can obtain the first q ≤ p principal components series,say, yyyt = (yt1, . . . , ytq)′, having spectral density fq(ω) = diag{λ1(ω), . . . , λq(ω)}.The series ytk is the k-th principal component series.

As in the classical case, given observations, xxx1, xxx2, . . . , xxxn, from the processxxxt, we can form an estimate fxx(ω) of fxx(ω) and define the sample principalcomponent series by replacing fxx(ω) with fxx(ω) in the previous discussion.Precise details pertaining to the asymptotic (n → ∞) behavior of the principalcomponent series and their spectra can be found in Brillinger (1981, Chapter9). To give a basic idea of what we can expect, we focus on the first principalcomponent series and on the spectral estimator obtained by smoothing theperiodogram matrix, In(ωj); that is

fxx(ωj) =m∑

�=−m

h�In(ωj + /n), (7.150)

where L = 2m + 1 is odd and the weights are chosen so h� = h−� are positiveand

∑� h� = 1. Under the conditions for which fxx(ωj) is a well-behaved

estimator of fxx(ωj), and for which the largest eigenvalue of fxx(ωj) is unique,{ηn

[λ1(ωj) − λ1(ωj)

] /λ1(ωj); ηn [eee1(ωj) − eee1(ωj)] ; j = 1, . . . , J

}(7.151)

converges (n → ∞) jointly in distribution to independent, zero-mean nor-mal distributions, the first of which is standard normal. In (7.151), η−2

n =∑m�=−m h2

� , noting we must have L → ∞ and ηn → ∞, but L/n → 0 asn → ∞. The asymptotic variance–covariance matrix of eee1(ω), say, Σe1(ω), is


Figure 7.14 The individual periodograms of xtk, for k = 1, . . . , 8, in Example7.14.

given by

Σe1(ω) = η−2n λ1(ω)

p∑�=2

λ�(ω) {λ1(ω) − λ�(ω)}−2eee�(ω)eee∗

� (ω). (7.152)

The distribution of eee1(ω) depends on the other latent roots and vectors offx(ω). Writing eee1(ω) = (e11(ω), e12(ω), . . . , e1p(ω))′, we may use this resultto form confidence regions for the components of eee1 by approximating thedistribution of

2 |eee1,j(ω) − eee1,j(ω)|2s2

j (ω), (7.153)

for j = 1, . . . , p, by a χ2 distribution with two degrees of freedom. In (7.153),s2

j (ω) is the j-th diagonal element of Σe1(ω), the estimate of Σe1(ω). We canuse (7.153) to check whether the value of zero is in the confidence region bycomparing 2|eee1,j(ω)|2/s2

j (ω) with χ22(1 − α), the 1 − α upper tail cutoff of the

χ22 distribution.


Figure 7.15 The estimated spectral density, λ1(j/128), of the first principalcomponent series in Example 7.14.

Example 7.14 Principal Component Analysis of the fMRI Data

Recall Example 1.6 of Chapter 1, where the vector time series xxxt =(xt1, . . . , xt8)′, t = 1, . . . , 128, represents consecutive measures of aver-age blood oxygenation level dependent (bold) signal intensity, whichmeasures areas of activation in the brain. Recall subjects were givena non-painful brush on the hand and the stimulus was applied for 32seconds and then stopped for 32 seconds; thus, the signal period is 64seconds (the sampling rate was one observation every two seconds for 256seconds). The series xtk for k = 1, 2, 3, 4 represent locations in cortex,series xt5 and xt6 represent locations in the thalamus, and xt7 and xt8represent locations in the cerebellum.

As is evident from Figure 1.6 in Chapter 1, different areas of the brain areresponding differently, and a principal component analysis may help inindicating which locations are responding with the most spectral power,and which locations do not contribute to the spectral power at the stim-ulus signal period. In this analysis, we will focus primarily on the sig-nal period of 64 seconds, which translates to four cycles in 256 secondsor ω = 4/128 cycles per time point. In addition, all calculations wereperformed using the standardized series; that is, we used xtk/sk, fork = 1, . . . , 8, where sk is the sample standard deviation of the the k-thseries, in the computations.

Figure 7.14 shows individual periodograms of the series xtk for k =1, . . . , 8. As was evident from Figure 1.6, a strong response to the brushstimulus occurred in areas of the cortex. To estimate the spectral densityof xxxt, we used (7.150) with L = 5 and {h0 = 3/9, h±1 = 2/9, h±2 = 1/9}.Calling the estimated spectrum fxx(j/128), for j = 0, 1, . . . , 64, we can


Table 7.8 Magnitudes of the PC Vector at the Stimulus Frequencyin Example 7.14

Location 1 2 3 4 5 6 7 8∣∣ eee1( 4128 )

∣∣ 0.46 0.40 0.45 0.40 0.28 0.15 0.09* 0.39

*The value of zero is in an approximate 99% confidence region for this component.

obtain the estimated spectrum of the first principal component seriesyt1 by calculating the largest eigenvalue, λ1(j/128), of fxx(j/128) foreach j = 0, 1, . . . , 64. The result, λ1(j/128), is shown in Figure 7.15.As expected, there is a large peak at the stimulus frequency 4/128,wherein λ1(4/128) = 0.339. The total power at the stimulus frequencyis tr

(fxx(4/128)

)= 0.353, so the proportion of the power at frequency

4/128 attributed to the first principal component series is 0.339/0.353= 96%. Because the first principal component explains nearly all of thetotal power at the stimulus frequency, there is no need to explore theother principal component series at this frequency.

The estimated first principal component series at frequency 4/128 is givenby yt1(4/128) = eee∗

1(4/128)xxxt, and the components of eee1(4/128) can giveinsight as to which locations of the brain are responding to the brushstimulus. Table 7.8 shows the magnitudes of eee1(4/128). In addition, anapproximate 99% confidence interval was obtained for each componentusing (7.153). As expected, the analysis indicates that location 7 is notcontributing to the power at this frequency, but surprisingly, the analysissuggests location 6 is responding to the stimulus.

Factor Analysis

Classical factor analysis is similar to classical principal component analysis.Suppose xxx is a mean-zero, p×1, random vector with variance–covariance matrixΣxx. The factor model proposes that xxx is dependent on a few unobservedcommon factors, z1, . . . , zq, plus error. In this model, one hopes that q will bemuch smaller than p. The factor model is given by

xxx = Bzzz + εεε, (7.154)

where B is a p × q matrix of factor loadings, zzz = (z1, . . . , zq)′ is a randomq × 1 vector of factors such that E(zzz) = 000 and E(zzzzzz′) = Iq, the q × q identitymatrix. The p × 1 unobserved error vector εεε is assumed to be independentof the factors, with zero mean and diagonal variance-covariance matrix D =


diag{δ21 , . . . , δ2

p}. Note, (7.154) differs from the multivariate regression modelin §5.7 because the factors, zzz, are unobserved. Equivalently, the factor model,(7.154), can be written in terms of the covariance structure of xxx,

Σxx = BB′ + D; (7.155)

i.e., the variance-covariance matrix of xxx is the sum of a symmetric, nonnegative-definite rank q ≤ p matrix and a nonnegative-definite diagonal matrix. Ifq = p, then Σxx can be reproduced exactly as BB′, using the fact that Σxx =λ1eee1eee

′1 + · · · + λpeeepeee

′p, where (λi, eeei) are the eigenvalue–eigenvector pairs of

Σxx. As previously indicated, however, we hope q will be much smaller than p.Unfortunately, most covariance matrices cannot be factored as (7.155) when qis much smaller than p.

To motivate factor analysis, suppose the components of xxx can be groupedinto meaningful groups. Within each group, the components are highly cor-related, but the correlation between variables that are not in the same groupis small. A group is supposedly formed by a single construct, represented asan unobservable factor, responsible for the high correlations within a group.For example, a person competing in a decathlon performs p = 10 athleticevents, and we may represent the outcome of the decathlon as a 10 × 1 vectorof scores. The events in a decathlon involve running, jumping, or throwing,and it is conceivable the 10 × 1 vector of scores might be able to be factoredinto q = 4 factors, (1) arm strength, (2) leg strength, (3) running speed, and(4) running endurance. The model (7.154) specifies that cov(xxx, zzz) = B, orcov(xi, zj) = bij where bij is the ij-th component of the factor loading matrixB, for i = 1, . . . , p and j = 1, . . . , q. Thus, the elements of B are used toidentify which hypothetical factors the components of xxx belong to, or load on.

At this point, some ambiguity is still associated with the factor model. LetQ be a q × q orthogonal matrix; that is Q′Q = QQ′ = Iq. Let B∗ = BQ andzzz∗ = Q′zzz so (7.154) can be written as

xxx = Bzzz + εεε = BQQ′zzz + εεε = B∗zzz∗ + εεε. (7.156)

The model in terms of B∗ and zzz∗ fulfills all of the factor model requirements,for example, cov(zzz∗) = Q′cov(zzz)Q = QQ′ = Iq, so

Σxx = B∗cov(zzz∗)B′∗ + D = BQQ′B′ + D = BB′ + D. (7.157)

Hence, on the basis of observations on xxx, we cannot distinguish between theloadings B and the rotated loadings B∗ = BQ. Typically, Q is chosen so thematrix B is easy to interpret, and this is the basis of what is called factorrotation.

Given a sample xxx1, . . . , xxxn, a number of methods are used to estimatethe parameters of the factor model, and we discuss two of them here. Thefirst method is the principal component method. Let Sxx denote the samplevariance–covariance matrix, and let (λi, eeei) be the eigenvalue–eigenvector pairs


of Sxx. The p × q matrix of estimated factor loadings is found by setting

B =[λ

1/21 eee1

∣∣∣ λ1/22 eee2

∣∣∣ · · · ∣∣∣ λ1/2q eeeq

]. (7.158)

The argument here is that if q factors exist, then

Sxx ≈ λ1eee1eee′1 + · · · + λqeeeqeee

′q = BB′, (7.159)

because the remaining eigenvalues, λq+1, . . . , λp, will be negligible. The es-timated diagonal matrix of error variances is then obtained by setting D =diag{δ2

1 , . . . , δ2p}, where δ2

j is the j-th diagonal element of Sxx − BB′.The second method, which can give answers that are considerably different

from the principal component method is maximum likelihood. Upon furtherassumption that in (7.154), zzz and ε are multivariate normal, the log likelihoodof B and D ignoring a constant is

−2 ln L(B, D) = n ln |Σxx| +n∑

j=1

xxx′jΣ

−1xx xxxj . (7.160)

The likelihood depends on B and D through (7.155), Σxx = BB′ + D. Asdiscussed in (7.156)-(7.157), the likelihood is not well defined because B canbe rotated. Typically, restricting BD−1B′ to be a diagonal matrix is a com-putationally convenient uniqueness condition. The actual maximization of thelikelihood is accomplished using numerical methods.

One obvious method of performing maximum likelihood for the Gaussianfactor model is the EM algorithm. For example, suppose the factor vector zzzis known. Then, the factor model is simply the multivariate regression modelgiven in §5.7, that is, write X ′ = [xxx1, xxx2, . . . , xxxn] and Z ′ = [zzz1, zzz2, . . . , zzzn], andnote that X is n × p and Z is n × q. Then, the MLE of B is

B = X ′Z(Z ′Z)−1 =

⎛⎝n−1n∑

j=1

xxxjzzz′j

⎞⎠⎛⎝n−1n∑

j=1

zzzjzzz′j

⎞⎠−1

def= CxzC−1zz , (7.161)

and the MLE of D is

D = Diag

⎧⎨⎩n−1n∑

j=1

(xxxj − Bzzzj

)(xxxj − Bzzzj

)′⎫⎬⎭ ; (7.162)

that is, only the diagonal elements of the right-hand side of (7.162) are used.The bracketed quantity in (7.162) reduces to

Cxx − CxzC−1zz C ′

xz, (7.163)

where Cxx = n−1∑nj=1 xxxjxxx

′j .


Based on the derivation of the EM algorithm for the state-space model,(4.66)-(4.75), we conclude that, to employ the EM algorithm here, given thecurrent parameter estimates, in Cxz, we replace xxxjzzz

′j by xxxjzzz

′j , where zzzj =

E(zzzj

∣∣ xxxj), and in Czz, we replace zzzjzzz′j by Pz + zzzjzzz

′j , where Pz = var(zzzj

∣∣ xxxj).Using the fact that the (p + q) × 1 vector (xxx′

j , zzz′j)

′ is multivariate normal withmean-zero, and variance–covariance matrix given by(BB′ + D B

B′ Iq

), (7.164)

we havezzzj ≡ E(zzzj

∣∣ xxxj) = B′(B′B + D)−1xxxj (7.165)

andPz ≡ var(zzzj

∣∣ xxxj) = Iq − B′(B′B + D)−1B. (7.166)

For time series, suppose xxxt is a stationary p × 1 process with p × p spectralmatrix fxx(ω). Analogous to the classical model displayed in (7.155), we maypostulate that at a given frequency of interest, ω, the spectral matrix of xxxt

satisfiesfxx(ω) = B(ω)B(ω)∗ + D(ω), (7.167)

where B(ω) is a complex-valued p × q matrix with rank(B(ω)

)= q ≤ p and

D(ω) is a real, nonnegative-definite, diagonal matrix. Typically, we expect qwill be much smaller than p.

As an example of a model that gives rise to (7.167), let xxxt = (xt1, . . . , xtp)′,and suppose

xtj = cjst−τj+ εtj , j = 1, . . . , p, (7.168)

where cj ≥ 0 are individual amplitudes and st is a common unobserved signal(factor) with spectral density fss(ω). The values τj are the individual phaseshifts. Assume st is independent of εεεt = (εt1, . . . , εtp)′ and the spectral matrixof εεεt, Dεε(ω), is diagonal. The DFT of xtj is given by

Xj(ω) = n−1/2n∑

t=1

xtj exp(−2πitω)

and, in terms of the model (7.168),

Xj(ω) = aj(ω)Xs(ω) + Xεj (ω), (7.169)

where aj(ω) = cj exp(−2πiτjω), and Xs(ω) and Xεj (ω) are the respectiveDFTs of the signal st and the noise εtj . Stacking the individual elements of(7.169), we obtain a complex version of the classical factor model with onefactor, ⎛⎜⎝X1(ω)

...Xp(ω)

⎞⎟⎠ =

⎛⎜⎝ a1(ω)...

ap(ω)

⎞⎟⎠Xs(ω) +

⎛⎜⎝Xε1(ω)...

Xεp(ω)

⎞⎟⎠ ,


or more succinctly,

XXX(ω) = aaa(ω)Xs(ω) + XXXε(ω). (7.170)

From (7.170), we can identify the spectral components of the model; that is,

fxx(ω) = bbb(ω)bbb(ω)∗ + Dεε(ω), (7.171)

where bbb(ω) is a p × 1 complex-valued vector, bbb(ω)bbb(ω)∗ = aaa(ω)fss(ω)aaa(ω)∗.Model (7.171) could be considered the one-factor model for time series. Thismodel can be extended to more than one factor by adding other independentsignals into the original model (7.168). More details regarding this and relatedmodels can be found in Stoffer (1999).

Example 7.15 Single Factor Analysis of the fMRI Data

The fMRI data analyzed in Example 7.14 is well suited for a single factoranalysis using the model (7.168), or, equivalently, the complex-valued,single factor model (7.170). In terms of (7.168), we can think of the signalst as representing the brush stimulus signal. As before, the frequency ofinterest is ω = 4/128, which corresponds to a period of 32 time points,or 64 seconds.

A simple way to estimate the components bbb(ω) and Dεε(ω), as specified in(7.171), is to use the principal components method. Let fxx(ω) denotethe estimate of the spectral density of xxxt = (xt1, . . . , xt8)′ obtained inExample 7.14. Then, analogous to (7.158) and (7.159), we set

bbb(ω) =√

λ1(ω) eee1(ω),

where(λ1(ω), eee1(ω)

)is the first eigenvalue–eigenvector pair of fxx(ω).

The diagonal elements of Dεε(ω) are obtained from the diagonal elementsof fxx(ω)− bbb(ω)bbb(ω)∗. The appropriateness of the model can be assessedby checking the elements of the residual matrix, fxx(ω) − [bbb(ω)bbb(ω)∗ +Dεε(ω)], are negligible in magnitude.

Concentrating on the stimulus frequency, recall λ1(4/128) = 0.339. Themagnitudes of eee1(4/128) are displayed in Table 7.8, indicating all lo-cations load on the stimulus factor except for location 7, and location6 could be considered borderline. The diagonal elements of fxx(ω) −bbb(ω)bbb(ω)∗ yield

Dεε(4/128) = 0.001 × diag{0.27, 1.06, 0.45, 1.26, 1.64, 4.22, 4.38, 1.08}.


The magnitudes of the elements of residual matrix at ω = 4/128 are

0.001 ×

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.00 0.19 0.14 0.19 0.49 0.49 0.65 0.460.19 0.00 0.49 0.86 0.71 1.11 1.80 0.580.14 0.49 0.00 0.62 0.67 0.65 0.39 0.220.19 0.86 0.62 0.00 1.02 1.33 1.16 0.140.49 0.71 0.67 1.02 0.00 0.85 1.11 0.570.49 1.11 0.65 1.33 0.85 0.00 1.81 1.360.65 1.80 0.39 1.16 1.11 1.81 0.00 1.570.46 0.58 0.22 0.14 0.57 1.36 1.57 0.00

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠,

indicating the model fit is good.

A number of authors have considered factor analysis in the spectral domain,for example Priestley et al. (1974); Priestley and Subba Rao (1975); Geweke(1977), and Geweke and Singleton (1981), to mention a few. An obviousextension of simple model (7.168) is the factor model

xxxt =∞∑

j=−∞Λjssst−j + εεεt, (7.172)

where {Λj} is a real-valued p × q filter, ssst is a q × 1 stationary, unobservedsignal, with independent components, and εεεt is white noise. We assume thesignal and noise process are independent, ssst has q × q real, diagonal spectralmatrix fss(ω) = diag{fs1(ω), . . . , fsq(ω)}, and εεεt has a real, diagonal, p × pspectral matrix given by Dεε(ω) = diag{fε1(ω), . . . , fεp(ω)}. If, in addition,∑ ||Λj || < ∞, the spectral matrix of xxxt can be written as

fxx(ω) = Λ(ω)fss(ω)Λ(ω)∗ + Dεε(ω) = B(ω)B(ω)∗ + Dεε(ω), (7.173)

where

Λ(ω) =∞∑

t=−∞Λt exp(−2πitω) (7.174)

and B(ω) = Λ(ω)f1/2ss (ω). Thus, by (7.173), the model (7.172) is seen to

satisfy the basic requirement of the spectral domain factor analysis model;that is, the p × p spectral density matrix of the process of interest, fxx(ω),is the sum of a rank q ≤ p matrix, B(ω)B(ω)∗, and a real, diagonal matrix,Dεε(ω). For the purpose of identifiability we set fss(ω) = Iq for all ω; in whichcase, B(ω) = Λ(ω). As in the classical case [see (7.157)], the model is specifiedonly up to rotations; for details, see Bloomfield and Davis (1994).

Parameter estimation for the model (7.172), or equivalently (7.173), canbe accomplished using the principal component method. Let fxx(ω) be anestimate of fxx(ω), and let

(λj(ω), eeej(ω)

), for j = 1, . . . , p, be the eigenvalue–

eigenvector pairs, in the usual order, of fxx(ω). Then, as in the classical case,the p × q matrix B is estimated by

B(ω) =[λ1(ω)1/2 eee1(ω)

∣∣∣ λ2(ω)1/2 eee2(ω)∣∣∣ · · · ∣∣∣ λq(ω)1/2 eeeq(ω)

]. (7.175)


The estimated diagonal spectral density matrix of errors is then obtained bysetting Dεε(ω) = diag{fε1(ω), . . . , fεp(ω)}, where fεj(ω) is the j-th diagonalelement of fxx(ω) − B(ω)B(ω)∗.

Alternatively, we can estimate the parameters by approximate likelihoodmethods. As in (7.170), let XXX(ωj) denote the DFT of the data xxx1, . . . , xxxn atfrequency ωj = j/n. Similarly, let XXXs(ωj) and XXXε(ωj) be the DFTs of thesignal and of the noise processes, respectively. Then, under certain conditions(see Pawitan and Shumway, 1989), for = 0,±1, . . . ,±m,

XXX(ωj + /n) = Λ(ωj)XXXs(ωj + /n) + XXXε(ωj + /n) + oas(n−α), (7.176)

where Λ(ωj) is given by (7.174) and oas(n−α) → 0 almost surely for some0 ≤ α < 1/2 as n → ∞. In (7.176), the XXX(ωj + /n) are the DFTs of thedata at the L odd frequencies {ωj + /n; = 0,±1, . . . ,±m} surrounding thecentral frequency of interest ωj = j/n.

Under appropriate conditions {XXX(ωj + /n); = 0,±1, . . . ,±m} in (7.176)are approximately (n → ∞) independent, complex Gaussian random vectorswith variance–covariance matrix fxx(ωj). The approximate likelihood is givenby

−2 ln L(B(ωj), Dεε(ωj)

)= n ln

∣∣fxx(ωj)∣∣+ m∑

�=−m

XXX∗(ωj + /n)f−1xx (ωj)XXX(ωj + /n), (7.177)

with the constraint fxx(ωj) = B(ωj)B(ωj)∗ +Dεε(ωj). As in the classical case,we can use various numerical methods to maximize L

(B(ωj), Dεε(ωj))

at everyfrequency, ωj , of interest. For example, the EM algorithm discussed for theclassical case, (7.161)-(7.166), can easily be extended to this case.

Assuming fss(ω) = Iq, the estimate of B(ωj) is also the estimate of Λ(ωj).Calling this estimate Λ(ωj), the time domain filter can be estimated by

ΛMt = M−1

M−1∑j=0

Λ(ωj) exp(2πijt/n), (7.178)

for some 0 < M ≤ n, which is the discrete and finite version of the inversionformula given by

Λt =∫ 1/2

−1/2Λ(ω) exp(2πiωt)dω. (7.179)

Note that we have used this approximation earlier in Chapter 4, (4.135), forestimating the time response of a frequency response function defined over afinite number of frequencies.


Figure 7.16 The seasonally adjusted, quarterly growth rate (as percentages) offive macroeconomic series, Unemployment, GNP, Consumption, GovernmentInvestment, and Private Investment in the U.S. between 1948 and 1988, n =160 values.

Example 7.16 Government Spending, Private Investment, andUnemployment in the U.S.

Figure 7.16 shows the seasonally adjusted, quarterly growth rate (aspercentages) of five macroeconomic series, Unemployment, GNP, Con-sumption, Government Investment, and Private Investment in the U.S.between 1948 and 1988, n = 160 values. These data are analyzed in thetime domain by Young and Pedregal (1998), who were investigating howgovernment spending and private capital investment influenced the rateof unemployment.

Spectral estimation was performed on the detrended, standardized, andtapered (using a cosine bell) growth rate values. Then, as describedin (7.150), a set of L = 11 triangular weights, {h0 = 6/36, h±1 =5/36, h±2 = 4/36, h±3 = 3/36, h±4 = 2/36, h±5 = 1/36}, were usedto smooth in 5 × 5 periodogram matrices. Figure 7.17 shows the indi-vidual estimated spectra (scaled by 1000) of each series in terms of thenumber of cycles. We focus on three interesting frequencies. First, wenote the lack of spectral power near 40 cycles (ω = 40/160 = 1/4; onecycle every four quarters, or one year), indicating the data have been sea-sonally adjusted. In addition, because of the seasonal adjustment, some


Figure 7.17 The individual estimated spectra (scaled by 1000) of each seriesshow in Figure 7.16 in terms of the number of cycles in 160 quarters.

spectral power appears near the seasonal frequency; this is a distortionapparently caused by the method of seasonally adjusting the data. Next,we note spectral power appears near 10 cycles (ω = 10/160 = 1/16; onecycle every four years) in Unemployment, GNP, Consumption, and, tolesser degree, in Private Investment. Finally, spectral power appears nearfive cycles (ω = 5/160 = 1/32; one cycle every 8 years) in GovernmentInvestment, and perhaps to lesser degrees in Unemployment, GNP, andConsumption.

Figure 7.18 shows the coherences among various series. At the frequenciesof interest (ω = 5/160 and 10/160), pairwise, GNP, Unemployment,Consumption, and Private Investment (except for Unemployment andPrivate Investment) are coherent. Government Investment is either notcoherent or minimally coherent with the other series.

Figure 7.19 shows λ1(ω) and λ2(ω), the first and second eigenvalues ofthe estimated spectral matrix fxx(ω). These eigenvalues suggest thefirst factor is identified by the frequency of one cycle every four years,whereas the second factor is identified by the frequency of one cycleevery eight years. The modulus of the corresponding eigenvectors at thefrequencies of interest, eee1(10/160) and eee2(5/160), are shown in Table 7.9.These values confirm Unemployment, GNP, Consumption, and PrivateInvestment load on the first factor, and Government Investment loadson the second factor. The remainder of the details involving the factoranalysis of these data is left as an exercise.

7.9: The Spectral Envelope 479

Figure 7.18 The squared coherencies between the various series displayed inFigure 7.16 in terms of the number of cycles in 160 quarters.

Table 7.9 Magnitudes of the Eigenvectors in Example 7.16

Series Unemp GNP Cons G. Inv. P. Inv.∣∣ eee1( 10160 )

∣∣ 0.51 0.51 0.57 0.05 0.41

∣∣ eee2( 5160 )

∣∣ 0.17 0.03 0.39 0.87 0.27

7.9 The Spectral Envelope

The concept of spectral envelope for the spectral analysis and scaling of cate-gorical time series was first introduced in Stoffer et al. (1993). Since then, theidea has been extended in various directions (not only restricted to categoricaltime series), and we will explore these problems as well. First, we give a brief


Figure 7.19 The first, λ1(ω), and second, λ2(ω), eigenvalues (scaled by 1000)of the estimated spectral matrix fxx(ω) in terms of the number of cycles in160 quarters.

introduction to the concept of scaling time series.The spectral envelope was motivated by collaborations with researchers who

collected categorical-valued time series with an interest in the cyclic behaviorof the data. For example, Table 7.10 shows the per-minute sleep state of aninfant taken from a study on the effects of prenatal exposure to alcohol. Detailscan be found in Stoffer et al. (1988), but, briefly, an electroencephalographic(EEG) sleep recording of approximately two hours is obtained on a full-terminfant 24 to 36 hours after birth, and the recording is scored by a pediatricneurologist for sleep state. Sleep state is categorized, per minute, into one ofsix possible states: qt: quiet sleep - trace alternant, qh: quiet sleep - highvoltage, tr: transitional sleep, al: active sleep - low voltage, ah: active sleep- high voltage, and aw: awake. This particular infant was never awake duringthe study.

It is not difficult to notice a pattern in the data if we concentrate on activevs. quiet sleep (that is, focus on the first letter). But, it would be difficult totry to assess patterns in a longer sequence, or if more categories were present,without some graphical aid. One simple method would be to scale the data,that is, assign numerical values to the categories and then draw a time plot ofthe scaled series. Because the states have an order, one obvious scaling is

1 = qt 2 = qh 3 = tr 4 = al 5 = ah 6 = aw, (7.180)

and Figure 7.20 shows the time plot using this scaling. Another interestingscaling might be to combine the quiet states and the active states:

1 = qt 1 = qh 2 = tr 3 = al 3 = ah 4 = aw. (7.181)

The time plot using (7.181) would be similar to Figure 7.20 as far as the


Figure 7.20 Time plot of the EEG sleep state data in Table 7.10 using thescaling in (7.180).

cyclic (in and out of quiet sleep) behavior of this infant’s sleep pattern. Figure7.21 shows the periodogram of the sleep data using the scaling in (7.180). Alarge peak exists at the frequency corresponding to one cycle every 60 minutes.As we might imagine, the general appearance of the periodogram using thescaling (7.181) (not shown) is similar to Figure 7.20. Most of us would feelcomfortable with this analysis even though we made an arbitrary and ad hocchoice about the particular scaling. It is evident from the data (without any

Table 7.10 Infant EEG Sleep-States (per minute)(read down and across)

ah qt qt al tr qt al ahah qt qt ah tr qt al ahah qt tr ah tr qt al ahah qt al ah qh qt al ahah qt al ah qh qt al ahah tr al ah qt qt al ahah qt al ah qt qt al ahah qt al ah qt qt al ahtr qt tr tr qt qt al trah qt ah tr qt tr altr qt al ah qt al alah qt al ah qt al alah qt al ah qt al alqh qt al ah qt al ah


Figure 7.21 Periodogram of the EEG sleep state data in Figure 7.20 based onthe scaling in (7.180). The peak corresponds to a frequency of approximatelyone cycle every 60 minutes.

scaling) that if the interest is in infant sleep cycling, this particular sleep studyindicates an infant cycles between active and quiet sleep at a rate of about onecycle per hour.

The intuition used in the previous example is lost when we consider a longDNA sequence. Briefly, a DNA strand can be viewed as a long string of linkednucleotides. Each nucleotide is composed of a nitrogenous base, a five carbonsugar, and a phosphate group. There are four different bases, and they can begrouped by size; the pyrimidines, thymine (T) and cytosine (C), and the purines,adenine (A) and guanine (G). The nucleotides are linked together by a backboneof alternating sugar and phosphate groups with the 5′ carbon of one sugarlinked to the 3′ carbon of the next, giving the string direction. DNA moleculesoccur naturally as a double helix composed of polynucleotide strands with thebases facing inwards. The two strands are complementary, so it is sufficient torepresent a DNA molecule by a sequence of bases on a single strand. Thus, astrand of DNA can be represented as a sequence of letters, termed base pairs(bp), from the finite alphabet {A, C, G, T}. The order of the nucleotides containsthe genetic information specific to the organism. Expression of informationstored in these molecules is a complex multistage process. One important taskis to translate the information stored in the protein-coding sequences (CDS)of the DNA. A common problem in analyzing long DNA sequence data is inidentifying CDS dispersed throughout the sequence and separated by regionsof noncoding (which makes up most of the DNA). Table 7.11 shows part ofthe Epstein–Barr virus (EBV) DNA sequence. The entire EBV DNA sequence


Table 7.11 Part of the Epstein–Barr Virus DNA Sequence(read across and down)

AGAATTCGTC TTGCTCTATT CACCCTTACT TTTCTTCTTG CCCGTTCTCT TTCTTAGTATGAATCCAGTA TGCCTGCCTG TAATTGTTGC GCCCTACCTC TTTTGGCTGG CGGCTATTGCCGCCTCGTGT TTCACGGCCT CAGTTAGTAC CGTTGTGACC GCCACCGGCT TGGCCCTCTCACTTCTACTC TTGGCAGCAG TGGCCAGCTC ATATGCCGCT GCACAAAGGA AACTGCTGACACCGGTGACA GTGCTTACTG CGGTTGTCAC TTGTGAGTAC ACACGCACCA TTTACAATGCATGATGTTCG TGAGATTGAT CTGTCTCTAA CAGTTCACTT CCTCTGCTTT TCTCCTCAGTCTTTGCAATT TGCCTAACAT GGAGGATTGA GGACCCACCT TTTAATTCTC TTCTGTTTGCATTGCTGGCC GCAGCTGGCG GACTACAAGG CATTTACGGT TAGTGTGCCT CTGTTATGAAATGCAGGTTT GACTTCATAT GTATGCCTTG GCATGACGTC AACTTTACTT TTATTTCAGTTCTGGTGATG CTTGTGCTCC TGATACTAGC GTACAGAAGG AGATGGCGCC GTTTGACTGTTTGTGGCGGC ATCATGTTTT TGGCATGTGT ACTTGTCCTC ATCGTCGACG CTGTTTTGCAGCTGAGTCCC CTCCTTGGAG CTGTAACTGT GGTTTCCATG ACGCTGCTGC TACTGGCTTTCGTCCTCTGG CTCTCTTCGC CAGGGGGCCT AGGTACTCTT GGTGCAGCCC TTTTAACATTGGCAGCAGGT AAGCCACACG TGTGACATTG CTTGCCTTTT TGCCACATGT TTTCTGGACACAGGACTAAC CATGCCATCT CTGATTATAG CTCTGGCACT GCTAGCGTCA CTGATTTTGGGCACACTTAA CTTGACTACA ATGTTCCTTC TCATGCTCCT ATGGACACTT GGTAAGTTTTCCCTTCCTTT AACTCATTAC TTGTTCTTTT GTAATCGCAG CTCTAACTTG GCATCTCTTTTACAGTGGTT CTCCTGATTT GCTCTTCGTG CTCTTCATGT CCACTGAGCA AGATCCTTCT

consists of approximately 172,000 bp.We could try scaling according to the purine–pyrimidine alphabet, that is

A = G = 0 and C = T = 1, but this is not necessarily of interest for everyCDS of EBV. Numerous possible alphabets of interest exist. For example, wemight focus on the strong–weak hydrogen-bonding alphabet C = G = 0 andA = T = 1. Although model calculations as well as experimental data stronglyagree that some kind of periodic signal exists in certain DNA sequences, alarge disagreement about the exact type of periodicity exists. In addition,a disagreement exists about which nucleotide alphabets are involved in thesignals.

If we consider the naive approach of arbitrarily assigning numerical values(scales) to the categories and then proceeding with a spectral analysis, the re-sult will depend on the particular assignment of numerical values. For example,consider the artificial sequence ACGTACGTACGT. . . . Then, setting A = G = 0and C = T = 1 yields the numerical sequence 010101010101. . . , or one cycleevery two base pairs. Another interesting scaling is A = 1, C = 2, G = 3, andT = 4, which results in the sequence 123412341234. . . , or one cycle every fourbp. In this example, both scalings (that is, {A, C, G, T} = {0, 1, 0, 1} and {A,C, G, T} = {1, 2, 3, 4}) of the nucleotides are interesting and bring out differentproperties of the sequence. Hence, we do not want to focus on only one scaling.Instead, the focus should be on finding all possible scalings that bring out allof the interesting features in the data. Rather than choose values arbitrarily,the spectral envelope approach selects scales that help emphasize any periodicfeature that exists in a categorical time series of virtually any length in a quickand automated fashion. In addition, the technique can help in determiningwhether a sequence is merely a random assignment of categories.


The Spectral Envelope for Categorical Time Series

As a general description, the spectral envelope is a frequency-based, principalcomponents technique applied to a multivariate time series. First, we willfocus on the basic concept and its use in the analysis of categorical time series.Technical details can be found in Stoffer et al. (1993).

Briefly, in establishing the spectral envelope for categorical time series, thebasic question of how to efficiently discover periodic components in categoricaltime series was addressed. This, was accomplished via nonparametric spectralanalysis as follows. Let xt, t = 0, ±1, ±2, . . . , be a categorical-valued timeseries with finite state-space C = {c1, c2, . . ., ck}. Assume xt is stationary andpj = Pr{xt = cj} > 0 for j = 1, 2, . . . , k. For βββ = (β1, β2, . . . , βk)′ ∈ Rk,denote by xt(βββ) the real-valued stationary time series corresponding to thescaling that assigns the category cj the numerical value βj , j = 1, 2, . . . , k.The spectral density of xt(βββ) will be denoted by fxx(ω;βββ). The goal is tofind scalings βββ, so the spectral density is in some sense interesting, and tosummarize the spectral information by what is called the spectral envelope.

In particular, βββ is chosen to maximize the power at each frequency, ω, ofinterest, relative to the total power σ2(βββ) = var{xt(βββ)}. That is, we choseβββ(ω), at each ω of interest, so

λ(ω) = maxβ

{fxx(ω;βββ)

σ2(βββ)

}, (7.182)

over all βββ not proportional to 111k, the k × 1 vector of ones. Note, λ(ω) is notdefined if βββ = a111k for a ∈ R because such a scaling corresponds to assigningeach category the same value a; in this case, fxx(ω ; βββ) ≡ 0 and σ2(βββ) = 0. Theoptimality criterion λ(ω) possesses the desirable property of being invariantunder location and scale changes of βββ.

As in most scaling problems for categorical data, it is useful to representthe categories in terms of the unit vectors uuu1, uuu2, . . ., uuuk, where uuuj representsthe k × 1 vector with a one in the j-th row, and zeros elsewhere. We thendefine a k-dimensional stationary time series yyyt by yyyt = uuuj when xt = cj .The time series xt(βββ) can be obtained from the yyyt time series by the rela-tionship xt(βββ) = βββ′yyyt. Assume the vector process yyyt has a continuous spec-tral density denoted by fyy(ω). For each ω, fyy(ω) is, of course, a k × kcomplex-valued Hermitian matrix. The relationship xt(βββ) = βββ′yyyt impliesfxx(ω; βββ) = βββ′fyy(ω)βββ = βββ′fre

yy(ω)βββ, where freyy(ω) denotes the real part2

of fyy(ω). The imaginary part disappears from the expression because it isskew-symmetric, that is, f im

yy (ω)′ = −f imyy (ω). The optimality criterion can

thus be expressed as

λ(ω) = maxβ

{βββ′fre

yy(ω)ββββββ′V βββ

}, (7.183)

2In this section, it is more convenient to write complex values in the form z = zre + izim,which represents a change from the notation used previously.


where V is the variance–covariance matrix of yyyt. The resulting scaling βββ(ω) iscalled the optimal scaling.

The yyyt process is a multivariate point process, and any particular com-ponent of yyyt is the individual point process for the corresponding state (forexample, the first component of yyyt indicates whether the process is in statec1 at time t). For any fixed t, yyyt represents a single observation from a sim-ple multinomial sampling scheme. It readily follows that V = D − p p′, wherep = (p1, . . ., pk)′, and D is the k×k diagonal matrix D = diag{p1, . . ., pk}. Be-cause, by assumption, pj > 0 for j = 1, 2, . . . , k, it follows that rank(V ) = k−1with the null space of V being spanned by 111k. For any k×(k−1) full rank ma-trix Q whose columns are linearly independent of 111k, Q′V Q is a (k−1)×(k−1)positive-definite symmetric matrix.

With the matrix Q as previously defined, define λ(ω) to be the largesteigenvalue of the determinantal equation

|Q′freyy(ω)Q − λ(ω)Q′V Q| = 0,

and let bbb(ω) ∈ Rk−1 be any corresponding eigenvector, that is,

Q′freyy(ω)Qbbb(ω) = λ(ω)Q′V Qbbb(ω).

The eigenvalue λ(ω) ≥ 0 does not depend on the choice of Q. Although theeigenvector bbb(ω) depends on the particular choice of Q, the equivalence classof scalings associated with βββ(ω) = Qbbb(ω) does not depend on Q. A convenientchoice of Q is Q = [Ik−1 | 000 ]′, where Ik−1 is the (k − 1) × (k − 1) identitymatrix and 000 is the (k − 1) × 1 vector of zeros . For this choice, Q′fre

yy(ω)Qand Q′V Q are the upper (k −1)× (k −1) blocks of fre

yy(ω) and V , respectively.This choice corresponds to setting the last component of βββ(ω) to zero.

The value λ(ω) itself has a useful interpretation; specifically, λ(ω)dω rep-resents the largest proportion of the total power that can be attributed to thefrequencies (ω, ω + dω) for any particular scaled process xt(βββ), with the max-imum being achieved by the scaling βββ(ω). Because of its central role, λ(ω) isdefined to be the spectral envelope of a stationary categorical time series.

The name spectral envelope is appropriate since λ(ω) envelopes the stan-dardized spectrum of any scaled process. That is, given any βββ normalized sothat xt(βββ) has total power one, fxx(ω ; βββ) ≤ λ(ω) with equality if and only ifβββ is proportional to βββ(ω).

Given observations xt, for t = 1, . . . , n, on a categorical time series, weform the multinomial point process yyyt, for t = 1, . . . , n. Then, the theory forestimating the spectral density of a multivariate, real-valued time series can beapplied to estimating fyy(ω), the k×k spectral density of yyyt. Given an estimatefyy(ω) of fyy(ω), estimates λ(ω) and βββ(ω) of the spectral envelope, λ(ω), andthe corresponding scalings, βββ(ω), can then be obtained. Details on estimationand inference for the sample spectral envelope and the optimal scalings can befound in Stoffer et al. (1993), but the main result of that paper is as follows:


If fyy(ω) is a consistent spectral estimator and if for each j = 1, . . ., J , thelargest root of fre

yy(ωj) is distinct, then{ηn[λ(ωj) − λ(ωj)]/λ(ωj), ηn[βββ(ωj) − βββ(ωj)]; j = 1, . . . , J

}(7.184)

converges (n → ∞) jointly in distribution to independent zero-mean, normal,distributions, the first of which is standard normal; the asymptotic covariancestructure of βββ(ωj) is discussed in Stoffer et al. (1993). Result (7.184) is similarto (7.151), but in this case, βββ(ω) and βββ(ω) are real. The term ηn is the same asin (7.184), and its value depends on the type of estimator being used. Based onthese results, asymptotic normal confidence intervals and tests for λ(ω) can bereadily constructed. Similarly, for βββ(ω), asymptotic confidence ellipsoids andchi-square tests can be constructed; details can be found in Stoffer et al. (1993,Theorems 3.1 – 3.3).

Peak searching for the smoothed spectral envelope estimate can be aidedusing the following approximations. Using a first-order Taylor expansion, wehave

log λ(ω) ≈ log λ(ω) +λ(ω) − λ(ω)

λ(ω), (7.185)

so ηn[log λ(ω) − log λ(ω)] is approximately standard normal. It follows thatE[log λ(ω)] ≈ log λ(ω) and var[log λ(ω)] ≈ η−2

n . If no signal is present in a se-quence of length n, we expect λ(j/n) ≈ 2/n for 1 < j < n/2, and hence approx-imately (1−α)×100% of the time, log λ(ω) will be less than log(2/n)+(zα/ηn),where zα is the (1 − α) upper tail cutoff of the standard normal distribution.Exponentiating, the α critical value for λ(ω) becomes (2/n) exp(zα/ηn). Use-ful values of zα are z.001 = 3.09, z.0001 = 3.71, and z.00001 = 4.26, and fromour experience, thresholding at these levels works well.

Example 7.17 Spectral Analysis of DNA Sequences

We give explicit instructions for the calculations involved in estimatingthe spectral envelope of a DNA sequence, xt, for t = 1, . . . , n, using thenucleotide alphabet.

• In this example, we hold the scale for T fixed at zero. In this case, weform the 3 × 1 data vectors yyyt:

yyyt = (1, 0, 0)′ if xt = A; yyyt = (0, 1, 0)′ if xt = C;yyyt = (0, 0, 1)′ if xt = G; yyyt = (0, 0, 0)′ if xt = T.

The scaling vector is βββ = (β1, β2, β3)′, and the scaled process is xt(βββ) =βββ′yyyt.

• Calculate the DFT of the data

YYY (j/n) = n−1/2n∑

t=1

yyyt exp(−2πitj/n).


Figure 7.22 Smoothed sample spectral envelope of the BNRF1 gene from theEpstein–Barr virus.

Note YYY (j/n) is a 3×1 complex-valued vector. Calculate the periodogram,I(j/n) = YYY (j/n)YYY ∗(j/n), for j = 1, . . . , [n/2], and retain only the realpart, say, Ire(j/n).

• Smooth the Ire(j/n) to obtain an estimate of freyy(j/n). For example,

using (7.150) with L = 3 and triangular weighting, we would calculate

freyy(j/n) =

14Ire

(j − 1

n

)+

12Ire

(j

n

)+

14Ire

(j + 1

n

).

• Calculate the 3 × 3 sample variance–covariance matrix,

Syy = n−1n∑

t=1

(yyyt − yyy)(yyyt − yyy)′,

where yyy = n−1∑nt=1 yyyt is the sample mean of the data.

• For each ωj = j/n, j = 0, 1, . . . , [n/2], determine the largest eigenvalueand the corresponding eigenvector of the matrix 2n−1S

−1/2yy fre

yy(ωj)S−1/2yy .

Note, S1/2yy is the unique square root matrix of Syy.

• The sample spectral envelope λ(ωj) is the eigenvalue obtained in theprevious step. If bbb(ωj) denotes the eigenvector obtained in the previousstep, the optimal sample scaling is βββ(ωj) = S

−1/2yy bbb(ωj); this will result

in three values, the value corresponding to the fourth category, T beingheld fixed at zero.


Figure 7.23 Smoothed sample spectral envelope of the BNRF1 gene from theEpstein–Barr virus: (a) first 1000 bp, (b) second 1000 bp, (c) third 1000 bp,and (d) last 954 bp.

Example 7.18 Dynamic Analysis of the Gene Labeled BNRF1 of theEpstein–Barr Virus

In this example, we focus on a dynamic (or sliding-window) analysis ofthe gene labeled BNRF1 (bp 1736-5689) of Epstein–Barr. Figure 7.22shows the spectral envelope, using (7.150) with L = 11 and h0 = 6/36, h1 =5/36, . . . , h5 = 1/36, of the entire coding sequence (3954 bp long). Thefigure also shows a strong signal at frequency 1/3; the correspondingoptimal scaling was A = 0.04, C = 0.71, G = 0.70, T = 0, which indi-cates the signal is in the strong–weak bonding alphabet, S = {C, G} andW = {A, T}.

Figure 7.23 shows the result of computing the spectral envelope over threenonoverlapping 1000-bp windows and one window of 954 bp, across theCDS, namely, the first, second, third, and fourth quarters of BNRF1.


An approximate 0.0001 significance threshold is .69%. The first threequarters contain the signal at the frequency 1/3 (Figure 7.23a-c); thecorresponding sample optimal scalings for the first three windows were(a) A = 0.06, C = 0.69, G = 0.72, T = 0; (b) A = 0.09, C = 0.70, G =0.71, T = 0; (c) A = 0.18, C = 0.59, G = 0.77, T = 0. The first two windowsare consistent with the overall analysis. The third section, however,shows some minor departure from the strong-weak bonding alphabet.The most interesting outcome is that the fourth window shows that nosignal is present. This leads to the conjecture that the fourth quarter ofBNRF1 of Epstein–Barr is actually noncoding.

The Spectral Envelope for Real-Valued Time Series

The concept of the spectral envelope for categorical time series was extendedto real-valued time series, {xt; t = 0,±1,±2, . . . , }, in McDougall et al. (1997).The process xt can be vector-valued, but here we will concentrate on the uni-variate case. Further details can be found in McDougall et al. (1997). Theconcept is similar to projection pursuit (Friedman and Stuetzle, 1981). LetG denote a k-dimensional vector space of continuous real-valued transforma-tions with {g1, . . . , gk} being a set of basis functions satisfying E[gi(xt)2] < ∞,i = 1, . . . , k. Analogous to the categorical time series case, define the scaledtime series with respect to the set G to be the real-valued process

xt(βββ) = βββ′yyyt = β1g1(xt) + · · · + βkgk(xt)

obtained from the vector process

yyyt =(g1(Xt), . . . , gk(Xt)

)′,

where βββ = (β1, . . . , βk)′ ∈ Rk. If the vector process, yyyt, is assumed to havea continuous spectral density, say, fyy(ω), then xt(βββ) will have a continuousspectral density fxx(ω;βββ) for all βββ �= 000. Noting, fxx(ω;βββ) = βββ′fyy(ω)βββ =βββ′fre

yy(ω)βββ, and σ2(βββ) = var[xt(βββ)] = βββ′V βββ, where V = var(yyyt) is assumed tobe positive definite, the optimality criterion

λ(ω) = supβββ =000

{βββ′fre

yy(ω)ββββββ′V βββ

}, (7.186)

is well defined and represents the largest proportion of the total power thatcan be attributed to the frequency ω for any particular scaled process xt(βββ).This interpretation of λ(ω) is consistent with the notion of the spectral en-velope introduced in the previous section and provides the following workingdefinition: The spectral envelope of a time series with respect to the space Gis defined to be λ(ω).

The solution to this problem, as in the categorical case, is attained byfinding the largest scalar λ(ω) such that

freyy(ω)βββ(ω) = λ(ω)V βββ(ω) (7.187)


for βββ(ω) �= 000. That is, λ(ω) is the largest eigenvalue of freyy(ω) in the metric of

V , and the optimal scaling, βββ(ω), is the corresponding eigenvector.If xt is a categorical time series taking values in the finite state-space S =

{c1, c2, . . . , ck}, where cj represents a particular category, an appropriate choicefor G is the set of indicator functions gj(xt) = I(xt = cj). Hence, this isa natural generalization of the categorical case. In the categorical case, Gdoes not consist of linearly independent g’s, but it was easy to overcome thisproblem by reducing the dimension by one. In the vector-valued case, xxxt =(x1t, . . . , xpt)′, we consider G to be the class of transformations from Rp into Rsuch that the spectral density of g(xxxt) exists. One class of transformations ofinterest are linear combinations of xxxt. In Tiao et al. (1993), for example, lineartransformations of this type are used in a time domain approach to investigatecontemporaneous relationships among the components of multivariate timeseries. Estimation and inference for the real-valued case are analogous to themethods described in the previous section for the categorical case. We focuson two examples here; numerous other examples can be found in McDougallet al. (1997).

Example 7.19 Residual Analysis

A relevant situation may be when xt is the residual process obtainedfrom some modeling procedure. If the fitted model is appropriate, theresiduals should exhibit properties similar to an iid sequence. Depar-tures of the data from the fitted model may suggest model misspecifica-tion, non-Gaussian data, or the existence of a nonlinear structure, andthe spectral envelope would provide a simple diagnostic tool to aid in aresidual analysis.

The series considered here is the quarterly U.S. real GNP which wasanalyzed in Chapter 3, Examples (3.35) and (3.36). Recall an MA(2)model was fit to the growth rate, and the residuals from this fit areplotted in Figure 3.16 . As discussed in Example (3.36), the residualsfrom the model fit appear to be uncorrelated; there appears to be oneor two outliers, but their magnitudes are not that extreme. In addition,the standard residual analyses showed no obvious structure among theresiduals.

Although the MA(2) model appears to be appropriate, Tiao and Tsay(1994) investigated the possibility of nonlinearities in GNP growth rate.Their overall conclusion was that there is subtle nonlinear behavior in thedata because the economy behaves differently during expansion periodsthan during recession periods.

The spectral envelope, used as a diagnostic tool on the residuals, clearlyindicates the MA(2) model is not adequate, and that further analysis iswarranted. Here, the generating set G = {x, |x|, x2}—which seems nat-ural for a residual analysis—was used to estimate the spectral envelope


Figure 7.24 Spectral envelope with respect to G = {x, |x|, x2} of the residualsfrom an MA(2) fit to the U.S. GNP growth rate data.

for the residuals from the MA(2) fit, and the result is plotted in Fig-ure 7.24. A smoothed periodogram estimate was obtained using L = 21and triangular weighting, h0 = 11/121, h±1 = 10/121, . . . , h±10 = 1/121in (7.150). Clearly, the residuals are not iid, and considerable poweris present at the low frequencies. The presence of spectral power atvery low frequencies in detrended economic series has been frequentlyreported and is typically associated with long-range dependence. In fact,our choice of G was partly influenced by the work of Ding et al. (1993)who applied transformations of the form |xt|d, for d ∈ (0, 3], to the S&P500 stock market series. The estimated optimal transformation at thefirst nonzero frequency, ω = 0.006, was βββ(0.006) = (1, 20,−2916)′, whichleads to the transformation

y = x + 20|x| − 2916x2. (7.188)

This transformation is plotted in Figure 7.25. The transformation givenin (7.188) is basically the absolute value (with some slight curvature andasymmetry) for most of the residual values, but the effect of extreme-valued residuals (outliers) is dampened.

Example 7.20 Optimal Transformations

In this example, we consider a contrived data set, in which we know theoptimal transformation, say, g0, and we determine whether the technol-ogy can find the transformation when g0 is not in G. The data, xt, are


Figure 7.25 Estimated optimal transformation, (7.188), for the GNP residualsat ω = 0.006.

Figure 7.26 Periodogram, in decibels, of the data generated from (7.189) aftertapering by a cosine bell.

generated by the nonlinear model

xt = exp{3 sin(2πtω0) + εt}, t = 1, . . . , 512, (7.189)

where ω0 = 51/512 and εt is white Gaussian noise with a variance of 16.This example is adapted from Breiman and Friedman (1985), where theACE algorithm is introduced. The optimal transformation in this caseis g0(xt) = ln(xt), wherein the data are generated from a sinusoid plus


Figure 7.27 Spectral envelope with respect to G = {x,√

x, 3√

x} of data gen-erated from (7.189).

Figure 7.28 Log transformation, y = ln(x) (solid line), the estimated opti-mal transformation at ω0 as given in (7.190) (dashed line), and the estimatedoptimal transformation at ω0 using the inappropriate basis {x, x2, x3} (short-dashed line).


Figure 7.29 Spectral envelope with respect to G = {x, x2, x3}.

noise. Of the 512 generated data, about 98% were less than 4000. Occa-sionally, the data values were extremely large (the data exceeded 100,000about four times). The periodogram, in decibels [10 log10 X(ωj)], of thestandardized and tapered (by a cosine bell) data is shown in Figure 7.26and provides no evidence of any dominant frequency, including ω0.

In contrast, the sample spectral envelope (Figure 7.27) computed withrespect to G = {x,

√x, 3

√x} has no difficulty in isolating ω0. No smooth-

ing was used here; so, based on Stoffer et al. (1993, Theorem 3.2), anapproximate 0.0001 null significance threshold for the spectral envelopeis 4.84% (the null hypothesis being that xt is iid).

Figure 7.28 compares the estimated optimal transformation with respectto G with the log transformation for values less than 4000. The estimatedtransformation at ω0 is given by

y = −.6 + 0.0003x − 0.3638√

x + 1.9304 3√

x; (7.190)

that is, βββ(ω0) = (0.0003,−0.3638, 1.9304)′ after rescaling so (7.190) canbe compared directly with y = ln(x).

Finally, it is worth mentioning the result obtained when the rather inap-propriate basis, {x, x2, x3}, was used. Surprisingly, the spectral envelopein this case (Figure 7.29) looks similar to that of Figure 7.27. Also,the resulting estimated optimal transformation at ω0, is close to the logtransformation. In fact, as seen in Figure 7.28, it looks like what wewould imagine as a linear approximation to y = ln(x) within the rangeof most of the data.

Problems 495

Problems

Section 7.2

7.1 Consider the complex Gaussian distribution for the random variable XXX =XXXc − iXXXs, as defined in (7.1)-(7.3), where the argument ωk has beensuppressed. Now, the 2p × 1 real random variable ZZZ = (XXX ′

c,XXX′s)

′ has amultivariate normal distribution with density

p(ZZZ) = (2π)−p|Σ|−1/2 exp{

−12(ZZZ − µµµ)′Σ−1(ZZZ − µµµ)

},

where µµµ = (MMM ′c,MMM

′s)

′ is the mean vector. Prove

|Σ| =(

12

)2p

|C − iQ|2,

using the result that the eigenvectors and eigenvalues of Σ occur in pairs,i.e., (vvv′

c, vvv′s)

′ and (vvv′s,−vvv′

c)′, where vvvc−ivvvs denotes the eigenvector of fxx.

Show that12(ZZZ − µµµ)′Σ−1(ZZZ − µµµ)) = (XXX − MMM)∗f−1(XXX − MMM)

so p(XXX) = p(ZZZ) and we can identify the density of the complex mul-tivariate normal variable XXX with that of the real multivariate normalZZZ.

7.2 Prove f in (7.6) maximizes the log likelihood (7.5) by minimizing thenegative of the log likelihood

L ln |f | + L tr {ff−1}in the form

L∑

i

(λi − lnλi − 1

)+ Lp + L ln |f |,

where the λi values correspond to the eigenvalues in a simultaneous diag-onalization of the matrices f and f ; i.e., there exists a matrix P such thatP ∗fP = I and P ∗fP = diag (λ1, . . . , λp) = Λ. Note, λi − lnλi − 1 ≥ 0with equality if and only if λi = 1, implying Λ = I maximizes the loglikelihood and f = f is the maximimizing value.

Section 7.3

7.3 Verify (7.19) and (7.20) for the mean-squared prediction error MSE in(7.12). Use the orthogonality principle, which implies

MSE = E

[(yt −

∞∑r=−∞

βββ′rxxxt−r)yt

]


and gives a set of equations involving the autocovariance functions. Then,use the spectral representations and Fourier transform results to get thefinal result.

7.4 Consider the predicted series

yt =∞∑

r=−∞βββ′

rxxxt−r,

where βββr satisfies (7.14). Show the ordinary coherence between yt andyt is exactly the multiple coherence (7.21).

7.5 Consider the complex regression model (7.29) in the form

YYY = XB + VVV ,

where YYY = (Y1, Y2, . . . YL)′ denotes the observed DFTs after they havebeen re-indexed and X = (XXX1,XXX2, . . . ,XXXL)′ is a matrix containing thereindexed input vectors. The model is a complex regression model withYYY = YYY c − iYYY s, X = Xc − iXs,BBB = BBBc − iBBBs, and VVV = VVV c − iVVV s denotingthe representation in terms of the usual cosine and sine transforms. Showthe partitioned real regression model involving the 2L×1 vector of cosineand sine transforms, say,(

YYY c

YYY s

)=(

Xc −Xs

Xs Xc

)(BBBc

BBBs

)+(

VVV c

VVV s

),

is isomorphic to the complex regression regression model in the sense thatthe real and imaginary parts of the complex model appear as componentsof the vectors in the real regression model. Use the usual regressiontheory to verify (7.28) holds. For example, writing the real regressionmodel as

yyy = xbbb + vvv,

the isomorphism would imply

L(fyy − f∗xy f−1

xx fxy) = YYY ∗YYY − YYY ∗X(X∗X)−1X∗YYY

= yyy′yyy − yyy′x(x′x)−1x′yyy.

Section 7.4

7.6 Consider estimating the function

ψt =∞∑

r=−∞aaa′

rβββt−r

Problems 497

by a linear filter estimator of the form

ψt =∞∑

r=−∞aaa′

rβββt−r,

where βββt is defined by (7.43). Show a sufficient condition for ψt to be anunbiased estimator; i.e., E ψt = ψt, is

H(ω)Z(ω) = I

for all ω. Similarly, show any other unbiased estimator satisfying theabove condition has minimum variance (see Shumway and Dean, 1968),so the estimator given is a best linear unbiased (BLUE) estimator.

7.7 Consider a linear model with mean value function µt and a signal αt

delayed by an amount τj on each sensor, i.e.,

yjt = µt + αt−τj+ vjt

Show the estimators (7.43) for the mean and the signal are the Fouriertransforms of

M(ω) =Y·(ω) − φ(ω)Bw(ω)

1 − |φ(ω)|2and

A(ω) =Bw(ω) − φ(ω)Y·(ω)

1 − |φ(ω)|2 ,

where

φ(ω) =1N

N∑j=1

eee2πiωτj

and Bw(ω) is defined in (7.65).

Section 7.5

7.8 Consider the estimator (7.68) as applied in the context of the randomcoefficient model (7.66). Prove the filter coefficients for the minimummean square estimator can be determined from (7.69) and the meansquare covariance is given by (7.72).

7.9 For the random coefficient model, verify the expected mean square of theregression power component is

E[SSR(ωk)] = E[Y ∗(ωk)Z(ωk)S−1z (ωk)Z∗(ωk)Y (ωk)]

= Lfβ(ωk)tr {Sz(ωk)} + Lqfv(ωk).


Recall, the underlying frequency domain model is

YYY (ωk) = Z(ωk)BBB(ωk) + VVV (ωk),

where BBB(ωk) has spectrum fβ(ωk)Iq and VVV (ωk) has spectrum fv(ωk)IN

and the two processes are uncorrelated.

Section 7.6

7.10 Suppose we have I = 2 groups and the models

y1jt = µt + α1t + v1jt

for the j = 1, . . . , N observations in group 1 and

y2jt = µt + α2t + v2jt

for the j = 1, . . . , N observations in group 2, with α1t+α2t = 0. Supposewe want to test equality of the two group means; i.e.,

yijt = µt + vijt, i = 1, 2.

Derive the residual and error power components corresponding to (7.84)and (7.85) for this particular case.

7.11 Verify the forms of the linear compounds involving the mean given in(7.91) and (7.92), using (7.89) and (7.90).

7.12 Show the ratio of the two smoothed spectra in (7.104) has the indicatedF -distribution when f1(ω) = f2(ω). When the spectra are not equal,show the variable is proportional to an F -distribution, where the pro-portionality constant depends on the ratio of the spectra.

Section 7.7

7.13 The problem of detecting a signal in noise can be considered using themodel

xt = st + wt, t = 1, . . . , n,

for p1(xxx) when a signal is present and the model

xt = wt, t = 1, . . . , n,

for p2(xxx) when no signal is present. Under multivariate normality, wemight specialize even further by assuming the vector www = (w1, . . . , wn)′

has a multivariate normal distribution with mean 0 and covariance ma-trix Σ = σ2

wIn, corresponding to white noise. Assuming the signal vector

Problems 499

sss = (s1, . . . , sn)′ is fixed and known, show the discriminant function(7.113) becomes the matched filter

1σ2

w

n∑t=1

stxt − 12

(S

N

)+ ln

π1

π2,

where (S

N

)=∑n

t=1 s2t

σ2w

denotes the signal-to-noise ratio. Give the decision criterion if the priorprobabilities are assumed to be the same. Express the false alarm andmissed signal probabilities in terms of the normal cdf and the signal-to-noise ratio.

7.14 Assume the same additive signal plus noise representations as in theprevious problem, except, the signal is now a random process with azero mean and covariance matrix σ2

sI. Derive the comparable versionof (7.116) as a quadratic detector, and characterize its performance un-der both hypotheses in terms of constant multiples of the chi-squareddistribution.

Section 7.8

7.15 The data set ch5fmri.dat contains data from other stimulus conditionsin the fMRI experiment, as discussed in Example 7.14 (one location—Caudate—was left out of the analysis for brevity). Perform princi-pal component analyses on the stimulus conditions (i) awake-heat and(ii) awake-shock, and compare your results to the results of Example 7.14.

7.16 For this problem, consider the first three earthquake series listed ineq+exp.dat.

(a) Estimate and compare the spectral density of the P component andthen of the S component for each individual earthquake.

(b) Estimate and compare the squared coherency between the P andS components of each individual earthquake. Comment on thestrength of the coherence.

(c) Let xti be the P component of earthquake i = 1, 2, 3, and let xxxt =(xt1, xt2, xt3)′ be the 3 × 1 vector of P components. Estimate thespectral density, λ1(ω), of the first principal component series of xxxt.Compare this to the corresponding spectra calculated in (a).

(d) Analogous to part (c), let yyyt denote the 3 × 1 vector series of Scomponents of the first three earthquakes. Repeat the analysis ofpart (c) on yyyt.


7.17 In the factor analysis model (7.155), let p = 3, q = 1, and

Σxx =

⎡⎣ 1 .4 .9.4 1 .7.9 .7 1

⎤⎦ .

Show there is a unique choice for B and D, but δ23 < 0, so the choice is

not valid.

7.18 Extend the EM algorithm for classical factor analysis, (7.161)-(7.166),to the time series case of maximizing ln L

(B(ωj), Dεε(ωj))

in (7.177).Then, for the data used in Example 7.16, find the approximate maximumlikelihood estimates of B(ωj) and Dεε(ωj), and, consequently, Λt.

Section 7.9

7.19 Verify, as stated in (7.182), the imaginary part of a k×k spectral matrix,f im(ω), is skew symmetric, and then show βββ′f im

yy (ω)βββ = 0 for a real k×1vector, βββ.

7.20 Repeat the analysis of Example 7.18 on BNRF1 of herpesvirus saimiri(the data file is bnrf1hvs.dat), and compare the results with the resultsobtained for Epstein–Barr.

7.21 For the NYSE returns, say, rt, analyzed in Chapter 5, Example 5.4:

(a) Estimate the spectrum of the rt. Does the spectral estimate appearto support the hypothesis that the returns are white?

(b) Examine the possibility of spectral power near the zero frequencyfor a transformation of the returns, say, g(rt), using the spectralenvelope with Example 7.19 as your guide. Compare the optimaltransformation near or at the zero frequency with the usual trans-formation yt = r2

t .

Appendix ALarge Sample Theory

A.1 Convergence Modes

The study of the optimality properties of various estimators (such as the sam-ple autocorrelation function) depends, in part, on being able to assess thelarge-sample behavior of these estimators. We summarize briefly here thekinds of convergence useful in this setting, namely, mean square convergence,convergence in probability, and convergence in distribution.

We consider first a particular class of random variables that plays an impor-tant role in the study of second-order time series, namely, the class of randomvariables belonging to the space L2, satisfying E|x|2 < ∞. In proving certainproperties of the class L2 we will often use, for random variables x, y ∈ L2, theCauchy–Schwarz inequality,

|E(xy)|2 ≤ E(|x|2)E(|y|2), (A.1)

and the Tchebycheff inequality,

P{|x| ≥ a} ≤ E(|x|2)a2 , (A.2)

for a > 0.Next, we investigate the properties of mean square convergence of random

variables in L2.

Definition A.1 A sequence of L2 random variables {xn}, is said to convergein mean square to a random variable x ∈ L2, denoted by

xnms→ x, (A.3)

if and only ifE|xn − x|2 → 0 (A.4)

as n → ∞.

501

502 Appendix A

Example A.1 Mean Square Convergence of the Sample Mean

Consider the white noise sequence wt and the signal plus noise series

xt = µ + wt.

Then, because

E|xn − µ|2 =σ2

w

n→ 0

as n → ∞, where xn = n−1∑nt=1 xt is the sample mean, we have xn

ms→ µ.

We summarize some of the properties of mean square convergence as fol-lows. If xn

ms→ x, and ynms→ y, then, as n → ∞,

(i) E(xn) → E(x); (A.5)(ii) E(|xn|2) → E(|x|2); (A.6)

(iii) E(xnyn) → E(xy). (A.7)

We also note the L2 completeness theorem known as the Riesz–Fisher The-orem.

Theorem A.1 Let {xn} be a sequence in L2. Then, there exists an x in L2

such that xnms→ x if and only if

E|xn − xm|2 → 0 (A.8)

for m, n → ∞.

Often the condition of Theorem A.1 is easier to verify to establish that a meansquare limit x exists without knowing what it is. Sequences that satisfy (A.8)are said to be Cauchy sequences in L2 and (A.8) is also known as the Cauchycriterion for L2.

Example A.2 Time Invariant Linear Filter

As an important example of the use of the Riesz–Fisher Theorem A.1 andthe properties (i), (ii), and (iii) of mean square convergent series givenin (A.5)–(A.7), a time-invariant linear filter is defined as a convolutionof the form

yt =∞∑

j=−∞ajxt−j (A.9)

for each t = 0,±1,±2, . . ., where xt is a weakly stationary input se-ries with mean µx and autocovariance function γx(h), and aj , for j =0,±1,±2, . . . are constants satisfying

∞∑j=−∞

|aj | < ∞. (A.10)

Large Sample Theory 503

The output series yt defines a filtering or smoothing of the input seriesthat changes the character of the time series in a predictable way. Weneed to know the conditions under which the outputs yt in (A.9) and thelinear process (1.31) exist.

Considering the sequence

ynt =

n∑j=−n

ajxt−j , (A.11)

n = 1, 2, . . ., we need to show first that ynt has a mean square limit. By

Theorem A.1, it is enough to show that

E |ynt − ym

t |2 → 0

as m, n → ∞. For n > m > 0,

E |ynt − ym

t |2 = E

∣∣∣∣∣∣∑

m<|j|≤n

ajxt−j

∣∣∣∣∣∣2

=∑

m<|j|≤n

∑m≤|k|≤n

ajakE(xt−jxt−k)

≤∑

m<|j|≤n

∑m≤|k|≤n

|aj ||ak||E(xt−jxt−k)|

≤∑

m<|j|≤n

∑m≤|k|≤n

|aj ||ak|(E|xt−j |2)1/2(E|xt−k|2)1/2

= γx(0)

⎛⎝ ∑m≤|j|≤n

|aj |⎞⎠2

→ 0

as m, n → ∞, because γx(0) is a constant and {aj} is absolutely sum-mable (the second inequality follows from the Cauchy–Schwarz inequal-ity).

Although we know that the sequence {ynt } given by (A.11) converges in

mean square, we have not established its mean square limit. It should beobvious, however, that yn

tms→ yt as n → ∞, where yt is given by (A.9).1

Finally, we may use (A.5) and (A.7) to establish the mean, µy and autoco-variance function, γy(h) of yt. In particular we have,

µy = µx

∞∑j=−∞

aj , (A.12)

1If S denotes the mean square limit of ynt , then using Fatou’s Lemma, E|S − yt|2 =

E lim infn→∞ |S − ynt |2 ≤ lim infn→∞ E|S − yn

t |2 = 0, which establishes that yt is the meansquare limit of yn

t .

504 Appendix A

and

γy(h) = E

∞∑j=−∞

∞∑k=−∞

aj(xt+h−j − µx)aj(xt−k − µx)

=∞∑

j=−∞

∞∑k=−∞

ajγx(h − j + k)ak (A.13)

A second important kind of convergence is convergence in probability.

Definition A.2 The sequence {xn}, for n = 1, 2, . . ., converges in proba-bility to a random variable x, denoted by

xnp→ x, (A.14)

if and only ifP{|xn − x| > ε} → 0 (A.15)

for all ε > 0, as n → ∞.

An immediate consequence of the Tchebycheff inequality, (A.2), is that

P{|xn − x| ≥ ε} ≤ E(|xn − x|2)ε2

,

so convergence in mean square implies convergence in probability, i.e.,

xnms→ x ⇒ xn

p→ x. (A.16)

This result implies, for example, that the filter (A.9) exists as a limit in prob-ability because it converges in mean square [it is also easily established that(A.9) exists with probability one]. We mention, at this point, the useful WeakLaw of Large Numbers which states that, for an independent identically dis-tributed sequence xn of random variables with mean µ, we have

xnp→ µ (A.17)

as n → ∞, where xn = n−1∑nt=1 xt is the usual sample mean.

We also will make use of the following concepts.

Definition A.3 For order in probability we write

xn = op(an) (A.18)

if and only ifxn

an

p→ 0. (A.19)

The term boundedness in probability, written xn = Op(an), means thatfor every ε > 0, there exists a δ(ε) > 0 such that

P

{∣∣∣∣xn

an

∣∣∣∣ > δ(ε)}

≤ ε (A.20)

for all n.


Under this convention, e.g., the notation for xnp→ x becomes xn − x =

op(1). The definitions can be compared with their nonrandom counterparts,namely, for a fixed sequence xn = o(1) if xn → 0 and xn = O(1) if xn, forn = 1, 2, . . . is bounded. Some handy properties of op(·) and Op(·) are asfollows.

(i) If xn = op(an) and yn = op(bn), then xnyn = op(anbn) and xn + yn =op(max(an, bn)).

(ii) If xn = op(an) and yn = Op(bn), then xnyn = op(anbn).

(iii) Statement (i) is true if Op(·) replaces op(·).

Example A.3 Convergence and Order in Probability for the SampleMean

For the sample mean, xn, of iid random variables with mean µ andvariance σ2, by the Tchebycheff inequality,

P{|xn − µ| > ε} ≤ E[(xn − µ)2]ε2

=σ2

nε2→ 0,

as n → ∞. It follows that xnp→ µ, or xn − µ = op(1). To find the rate,

it follows that, for δ(ε) > 0,

P{√

n |xn − µ| > δ(ε)} ≤ σ2/n

δ2(ε)/n=

σ2

δ2(ε)

by Tchebycheff’s inequality, so taking ε = σ2/δ2(ε) shows that δ(ε) =σ/

√ε does the job and

xn − µ = Op(n−1/2).

For k × 1 random vectors xxxn, convergence in probability, written xxxnp→ xxx

or xxxn −xxx = op(1) is defined as element-by-element convergence in probability,or equivalently, as convergence in terms of the Euclidean distance

‖xxxn − xxx‖ p→ 0, (A.21)

where ‖aaa‖ =∑

j a2j for any vector aaa. In this context, we note the result that

if xxxnp→ xxx and g(xxxn) is a continuous mapping,

g(xxxn)p→ g(xxx). (A.22)

506 Appendix A

Furthermore, if xxxn−aaa = Op(δn) with δn → 0 and g(·) is a function with con-tinuous first derivatives continuous in a neighborhood of aaa = (a1, a2, . . . , ak)′,we have the Taylor series expansion in probability

g(xxxn) = g(aaa) +∂g(xxx)∂xxx

∣∣∣∣′xxx=aaa

(xxxn − aaa) + Op(δn), (A.23)

where∂g(xxx)∂xxx

∣∣∣∣xxx=aaa

=(

∂g(xxx)∂x1

∣∣∣∣xxx=aaa

, . . . ,∂g(xxx)∂xk

∣∣∣∣xxx=aaa

)′

denotes the vector of partial derivatives with respect to x1, x2, . . . , xk, evalu-ated at aaa. This result remains true if Op(δn) is replaced everywhere by op(δn).

Example A.4 Expansion for the Logarithm of the Sample Mean

With the same conditions as Example A.3, consider g(xn) = log xn,which has a derivative at µ, for µ > 0. Then, because xn−µ = Op(n−1/2)from Example A.3, the conditions for the Taylor expansion in probability,(A.23), are satisfied and we have

log xn = log µ + µ−1(xn − µ) + Op(n−1/2).

The large sample distributions of sample mean and sample autocorrelationfunctions defined earlier can be developed using the notion of convergence indistribution.

Definition A.4 A sequence of k×1 random vectors {xxxn} is said to convergein distribution, written

xxxnd→ xxx (A.24)

if and only ifFn(xxx) → F (xxx) (A.25)

at the continuity points of distribution function F (·).

Example A.5 Convergence in Distribution

Consider a sequence {xn} of iid normal random variables with mean zeroand variance 1/n. Now, using the normal cdf (1.10), we have Fn(x) =Φ(

√nx), so

Fn(x) →{ 0 x < 0,

1/2, x = 01, x > 0

and we may take

F (x) ={

0, x < 01, x ≥ 0,

because the point where the two functions differ is not a continuity pointof F (x).


The distribution function relates uniquely to the characteristic functionthrough the Fourier transform, defined as a function with vector argumentλλλ = (λ1, λ2, . . . , λk)′, say

φ(λλλ) = E(exp{iλλλ′xxx})

=∫

exp{iλλλ′xxx} dF (xxx). (A.26)

Hence, for a sequence {xxxn} we may characterize convergence in distribution ofFn(·) in terms of convergence of the sequence of characteristic functions φn(·),i.e.,

φn(λλλ) → φ(λλλ) ⇔ Fn(xxx) d→ F (xxx), (A.27)

where ⇔ means that the implication goes both directions. In this connection,the Cramer–Wold device says that for every ccc = (c1, c2, . . . , ck)′

ccc′xxxnd→ ccc′xxx ⇔ xxxn

d→ xxx. (A.28)

Also, convergence in probability implies convergence in distribution, namely,

xxxnp→ xxx ⇒ xxxn

d→ xxx, (A.29)

but the converse is only true when xxxnd→ ccc, where ccc is a constant vector. If

xxxnd→ xxx and yyyn

d→ ccc are two sequences of random vectors and ccc is a constantvector,

xxxn + yyynd→ xxx + ccc (A.30)

andyyy′

nxxxnd→ ccc′xxx. (A.31)

For a continuous mapping h(xxx),

xxxnd→ xxx ⇒ h(xxxn) d→ h(xxx). (A.32)

A number of results in time series depend on making a series of approx-imations to prove convergence in distribution. For example, we have that ifxxxn

d→ xxx can be approximated by the sequence yyyn in the sense that

yyyn − xxxn = op(1), (A.33)

then we have that yyynd→ xxx, so the approximating sequence yyyn has the same

limiting distribution as xxx. We present the following Basic Approximation The-orem (BAT) that will be used later to derive asymptotic distributions for thesample mean and ACF.

508 Appendix A

Theorem A.2 Let xxxn for n = 1, 2, . . . , and yyymn for m = 1, 2, . . . , be randomk × 1 vectors such that

(i) yyymnd→ yyym as n → ∞ for each m;

(ii) yyymd→ yyy as m → ∞;

(iii) limm→∞ lim supn→∞ P{|xxxn − yyymn| > ε} = 0 for every ε > 0.

Then, xxxnd→ yyy.

As a practical matter, condition (iii) is implied by the Tchebycheff inequalityif

(iii′) E{|xxxn −yyymn|2} → 0 (A.34)

as m, n → ∞, and (iii′) is often much easier to establish than (iii).The theorem allows approximation of the underlying sequence in two steps,

through the intermediary sequence yyymn, depending on two arguments. In thetime series case, n is generally the sample length and m is generally the numberof terms in an approximation to the linear process of the form (A.11).

Proof. The proof of the theorem is a simple exercise in using the characteristicfunctions and appealing to (A.27). We need to show

|φxxxn − φyyy | → 0,

where we use the shorthand notation φ ≡ φ(λλλ) for ease. First,

|φxxxn − φyyy | ≤ |φxxxn− φyyymn

| + |φyyymn− φyyym

| + |φyyym− φyyy |. (A.35)

By the condition (ii) and (A.27), the last term converges to zero, and bycondition (i) and (A.27), the second term converges to zero and we only needconsider the first term in (A.35). Now, write∣∣φxxxn

− φyyymn

∣∣ =∣∣∣E(eiλλλ′xxxn − eiλλλ′yyymn)

∣∣∣≤ E

∣∣∣eiλλλ′xxxn(1 − eiλλλ′(yyymn−xxxn))∣∣∣

= E∣∣∣1 − eiλλλ′(yyymn−xxxn)

∣∣∣= E

{∣∣∣1 − eiλλλ′(yyymn−xxxn)∣∣∣ I{|yyymn − xxxn| < δ}

}+ E

{∣∣∣1 − eiλλλ′(yyymn−xxxn)∣∣∣ I{|yyymn − xxxn| ≥ δ}

},

where δ > 0 and I{A} denotes the indicator function of the set A. Then, givenλλλ and ε > 0, choose δ(ε) > 0 such that∣∣∣1 − eiλλλ′(yyymn−xxxn)

∣∣∣ < ε


if |yyymn−xxxn| < δ, and the first term is less than ε, an arbitrarily small constant.For the second term, note that∣∣∣1 − eiλλλ′(yyymn−xxxn)

∣∣∣ ≤ 2

and we have

E{∣∣∣1 − eiλλλ′

(yyymn−xxxn)∣∣∣ I{|yyymn − xxxn| ≥ δ}

}≤ 2P

{|yyymn − xxxn| ≥ δ},

which converges to zero as n → ∞ by property (iii).

A.2 Central Limit Theorems

We will generally be concerned with the large-sample properties of estimatorsthat turn out to be normally distributed as n → ∞.

Definition A.5 A sequence of random variables {xn} is said to be asymp-totically normal with mean µn and variance σ2

n if, as n → ∞,

σ−1n (xn − µn) d→ z,

where z has the standard normal distribution. We shall abbreviate this as

xn ∼ AN(µn, σ2n), (A.36)

where ∼ will denote is distributed as.

We state the important Central Limit Theorem, as follows.

Theorem A.3 Let x1, . . . , xn be independent and identically distributed withmean µ and variance σ2. If xn = (x1 + · · · + xn)/n denotes the sample mean,then

xn ∼ AN(µ, σ2/n). (A.37)

Often, we will be concerned with a sequence of k × 1 vectors {xxxn}. Thefollowing definition is motivated by the Cramer–Wold device considered earlier.

Definition A.6 We define asymptotic normality for the vector case as

xxxn ∼ AN(µµµn, Σn) (A.38)

if and only ifccc′xxxn ∼ AN(ccc′µµµn, ccc′Σnccc) (A.39)

for all ccc and Σn is positive definite.

510 Appendix A

In order to begin to consider what happens for dependent data in the lim-iting case, it is necessary to define, first of all, a particular kind of dependenceknown as M-dependence. We say that a time series xt is M-dependent if theset of values xs, s ≤ t is independent of the set of values xs, s ≥ t + M + 1,so time points separated by more than M units are independent. A centrallimit theorem for such dependent processes, used in conjunction with the BasicApproximation Theorem, will allow us to develop large-sample distributionalresults for the sample mean x and the sample ACF ρx(h) in the stationarycase.

In the arguments that follow, we often make use of the formula for thevariance of xn in the stationary case, namely,

var xn = n−1(n−1)∑

u=−(n−1)

(1 − |u|

n

)γ(u). (A.40)

To prove the above formula, letting u = s − t and v = t in

n2E[(xn − µ)2] =n∑

s=1

n∑t=1

E[(xs − µ)(xt − µ)]

=n∑

s=1

n∑t=1

γ(s − t)

=−1∑

u=−(n−1)

n∑v=−(u−1)

γ(u) +n−1∑u=0

n−u∑v=1

γ(u)

=−1∑

u=−(n−1)

(n + u)γ(u) +n−1∑u=0

(n − u)γ(u)

=(n−1)∑

u=−(n−1)

(n − |u|)γ(u)

gives the required result. We shall also use the fact that, for

∞∑u=−∞

|γ(u)| < ∞,


we would have, by dominated convergence,2

n var xn →∞∑

u=−∞γ(u), (A.41)

because |(1 − |u|/n)γ(u)| ≤ |γ(u)| and (1 − |u|/n)γ(u) → γ(u). We may nowstate the M-Dependent Central Limit Theorem as follows.

Theorem A.4 If xt is a strictly stationary M-dependent sequence of randomvariables with mean zero and autocovariance function γ(·) and if

VM =M∑

u=−M

γ(u), (A.42)

where VM �= 0,xn ∼ AN(0, VM/n). (A.43)

Proof. To prove the theorem, using Theorem A.2, the Basic ApproximationTheorem, we may construct a sequence of variables ymn approximating

n1/2xn = n−1/2n∑

t=1

xt

in the dependent case and then simply verify conditions (i), (ii), and (iii) ofTheorem A.2. For m > 2M , we may first consider the approximation

ymn = n−1/2[(x1 + · · · + xm−M ) + (xm+1 + · · · + x2m−M )+ (x2m+1 + · · · + x3m−M ) + · · · + (x(r−1)m+1 + · · · + xrm−M )]

= n−1/2(z1 + z2 + · · · + zr),

where r = [n/m], with [n/m] denoting the greatest integer less than or equalto n/m. This approximation contains only part of n1/2xn, but the randomvariables z1, z2, . . . , zr are independent because they are separated by morethan M time points, e.g., m + 1 − (m − M) = M + 1 points separate z1 andz2. Because of strict stationarity, z1, z2, . . . , zr are identically distributed withzero means and variances

Sm−M =∑

|u|≤M

(m − M − |u|)γ(u)

by a computation similar to that producing (A.40). We now verify the condi-tions of the Basic Approximation Theorem hold.

2Dominated convergence technically relates to convergent sequences (with respect to asigma-additive measure µ) of measurable functions fn → f bounded by an integrable func-tion g,

∫g dµ < ∞. For such a sequence,∫

fn dµ →∫

f dµ.

For the case in point, take fn(u) = (1 − |u|/n)γ(u) for |u| < n and as zero for |u| ≥ n. Takeµ(u) = 1, u = ±1, ±2, . . . to be counting measure.

512 Appendix A

(i): Applying the Central Limit Theorem to the sum ymn gives

ymn = n−1/2r∑

i=1

zi = (n/r)−1/2r−1/2r∑

i=1

zi.

Because (n/r)−1/2 → m1/2 and

r−1/2r∑

i=1

zid→ N(0, Sm−M ),

it follows from (A.31) that

ymnd→ ym ∼ N(0, Sm−M/m).

as n → ∞, for a fixed m.

(ii): Note that as m → ∞, Sm−M/m → VM using dominated convergence,where VM is defined in (A.42). Hence, the characteristic function of ym,say,

φm(λ) = exp{

−12λ2 Sm−M

m

}→ exp

{−1

2λ2 VM

},

as m → ∞, which is the characteristic function of a random variabley ∼ N(0, VM ) and the result follows because of (A.27).

(iii): To verify the last condition of the BAT theorem,

n1/2xn − ymn = n−1/2[(xm−M+1 + · · · + xm)+ (x2m−M+1 + · · · + x2m)+ (x(r−1)m−M+1 + · · · + x(r−1)m)...

+ (xrm−M+1 + · · · + xn)]= n−1/2(w1 + w2 + · · · + wr),

so the error is expressed as a scaled sum of iid variables with varianceSM for the first r − 1 variables and

var(wr) =∑

|u|≤m−M

(n − [n/m]m + M − |u|

)γ(u)

≤∑|u|≤m−M (m + M − |u|)γ(u).

Hence,var [n1/2x − ymn] = n−1[(r − 1)SM + var wr],

which converges to m−1SM as n → ∞. Because m−1SM → 0 as m → ∞,the condition of (iii) holds by the Tchebycheff inequality.


A.3 The Mean and Autocorrelation Functions

The background material in the previous two sections can be used to developthe asymptotic properties of the sample mean and ACF used to evaluate sta-tistical significance. In particular, we are interested in verifying Property P1.1.

We begin with the distribution of the sample mean xn, noting that (A.41)suggests a form for the limiting variance. In all of the asymptotics, we will usethe assumption that xt is a linear process, as defined in Definition 1.12, butwith the added condition that {wt} is iid. That is, throughout this section weassume

xt = µx +∞∑

j=−∞ψjwt−j (A.44)

where wt ∼ iid(0, σ2w), and the coefficients satisfy

∞∑j=−∞

|ψj | < ∞. (A.45)

Before proceeding further, we should note that the exact sampling dis-tribution of xn is available if the distribution of the underlying vector xxx =(x1, x2, . . . , xn)′ is multivariate normal. Then, xn is just a linear combinationof jointly normal variables that will have the normal distribution

xn ∼ N

⎛⎝µx, n−1∑

|u|<n

(1 − |u|

n

)γx(u)

⎞⎠ , (A.46)

by (A.40). In the case where xt are not jointly normally distributed, we havethe following theorem.

Theorem A.5 If xt is a linear process of the form (A.44) and∑

j ψj �= 0,then

xn ∼ AN(µx, n−1V ), (A.47)

where

V =∞∑

h=−∞γx(h) = σ2

w

( ∞∑j=−∞

ψj

)2

(A.48)

and γx(·) is the autocovariance function of xt.

Proof. To prove the above, we can again use the Basic Approximation The-orem A.2 by first defining the strictly stationary 2m-dependent linear processwith finite limits

xmt =

m∑j=−m

ψjwt−j

514 Appendix A

as an approximation to xt to use in the approximating mean

xn,m = n−1n∑

t=1

xmt .

Then, takeymn = n1/2(xn,m − µx)

as an approximation to n1/2(xn − µx).

(i): Applying Theorem A.4, we have

ymnd→ ym ∼ N(0, Vm),

as n → ∞, where

Vm =2m∑

h=−2m

γx(h) = σ2w

( m∑j=−m

ψj

)2

.

To verify the above, we note that for the general linear process withinfinite limits, (1.33) implies that

∞∑h=−∞

γx(h) = σ2w

∞∑h=−∞

∞∑j=−∞

ψj+hψj = σ2w

( ∞∑j=−∞

ψj

)2

,

so taking the special case ψj = 0, for |j| > m, we obtain Vm.

(ii): Because Vm → V in (A.48) as m → ∞, we may use the same character-istic function argument as under (ii) in the proof of Theorem A.4 to notethat

ymd→ y ∼ N(0, V ),

where V is given by (A.48).

(iii): Finally,

var[n1/2(xn − µx) − ymn

]= n var

⎡⎣n−1n∑

t=1

∑|j|>m

ψjwt−j

⎤⎦= σ2

w

⎛⎝ ∑|j|>m

ψj

⎞⎠2

→ 0

as m → ∞.


In order to develop the sampling distribution of the sample autocovariancefunction, γx(h), and the sample autocorrelation function, ρx(h), we need todevelop some idea as to the mean and variance of γx(h) under some reasonableassumptions. These computations for γx(h) are messy, and we consider acomparable quantity

γx(h) = n−1n∑

t=1

(xt+h − µx)(xt − µx) (A.49)

as an approximation. By Problem 1.29,

n1/2[γx(h) − γx(h)] = op(1),

so that limiting distributional results proved for n1/2γx(h) will hold for n1/2γx(h)by (A.33).

We begin by proving formulas for the variance and for the limiting varianceof γx(h) under the assumptions that xt is a linear process of the form (A.44),satisfying (A.45) with the white noise variates wt having variance σ2

w as before,but also required to have fourth moments satisfying

E(w4t ) = ησ4

w < ∞, (A.50)

where η is some constant. We seek results comparable with (A.40) and (A.41)for γx(h). To ease the notation, we will henceforth drop the subscript x fromthe notation.

Using (A.49), E[γ(h)] = γ(h). Under the above assumptions, we show nowthat, for p, q = 0, 1, 2, . . .,

cov [γ(p), γ(q)] = n−1(n−1)∑

u=−(n−1)

(1 − |u|

n

)Vu, (A.51)

where

Vu = γ(u)γ(u + p − q) + γ(u + p)γ(u − q)

+ (η − 3)σ4w

∑i

ψi+u+qψi+uψi+pψi. (A.52)

The absolute summability of the ψj can then be shown to imply the absolutesummability of the Vu.3 Thus, the dominated convergence theorem implies

n cov [γ(p), γ(q)] →∞∑

u=−∞Vu

= (η − 3)γ(p)γ(q) (A.53)

+∞∑

u=−∞

[γ(u)γ(u + p − q) + γ(u + p)γ(u − q)

].

3Note:∑∞

j=−∞ |aj | < ∞ and∑∞

j=−∞ |bj| < ∞ implies∑∞

j=−∞ |ajbj | < ∞.

516 Appendix A

To verify (A.51) is somewhat tedious, so we only go partially through thecalculations, leaving the repetitive details to the reader. First, rewrite (A.44)as

xt = µ +∞∑

i=−∞ψt−iwi,

so that

E[γ(p)γ(q)] = n−2∑s,t

∑i,j,k,�

ψs+p−iψs−jψt+q−k ψt−�E(wiwjwkw�).

Then, evaluate, using the easily verified properties of the wt series

E(wiwjwkw�) =

⎧⎨⎩ ησ4w if i = j = k =

σ4w if i = j �= k =

0 if i �= j, i �= k and i �= .

To apply the rules, we break the sum over the subscripts i, j, k, into fourterms, namely,∑

i,j,k,�

=∑

i=j=k=�

+∑

i=j =k=�

+∑

i=k =j=�

+∑

i=�=j=k

= S1 + S2 + S3 + S4.

Now,

S1 = ησ4w

∑i

ψs+p−iψs−iψt+q−iψt−i

= ησ4w

∑i

ψi+s−t+pψi+s−tψi+qψi,

where we have let i′ = t − i to get the final form. For the second term,

S2 =∑

i=j =k=�

ψs+p−iψs−jψt+q−kψt−�E(wiwjwkw�)

=∑i=k

ψs+p−iψs−iψt+q−kψt−kE(w2i )E(w2

k).

Then, using the fact that ∑i=k

=∑i,k

−∑i=k

,

we have

S2 = σ4w

∑i,k

ψs+p−iψs−iψt+q−kψt−k − σ4w

∑i

ψs+p−iψs−iψt+q−iψt−i

= γ(p)γ(q) − σ4w

∑i

ψi+s−t+pψi+s−tψi+qψi,


letting i′ = s − i, k′ = t − k in the first term and i′ = s − i in the second term.Repeating the argument for S3 and S4 and substituting into the covarianceexpression yields

E[γ(p)γ(q)] = n−2∑s,t

[γ(p)γ(q) + γ(s − t)γ(s − t + p − q)

+ γ(s − t + p)γ(s − t − q)

+ (η − 3)σ4w

∑i

ψi+s−t+pψi+s−tψi+qψi

].

Then, letting u = s − t and subtracting E[γ(p)]E[γ(q)] = γ(p)γ(q) from thesummation leads to the result (A.52). Summing (A.52) over u and applyingdominated convergence leads to (A.53).

The above results for the variances and covariances of the approximatingstatistics γ(·) enable proving the following central limit theorem for the auto-covariance functions γ(·).

Theorem A.6 If xt is a stationary linear process of the form (A.44) satisfyingthe fourth moment condition (A.50), then, for fixed K,⎛⎜⎜⎝

γ(0)γ(1)

...γ(K)

⎞⎟⎟⎠ ∼ AN

⎡⎢⎢⎣⎛⎜⎜⎝

γ(0)γ(1)

...γ(K)

⎞⎟⎟⎠ , n−1V

⎤⎥⎥⎦ ,

where V is the matrix with elements given by

vpq = (η − 3)γ(p)γ(q)

+∞∑

u=−∞

[γ(u)γ(u − p + q) + γ(u + q)γ(u − p)

]. (A.54)

Proof. It suffices to show the result for the approximate autocovariance (A.49)for γ(·) by the remark given below it (see also Problem 1.29). First, define thestrictly stationary (2m + K)-dependent (K + 1) × 1 vector

yyymt =

⎛⎜⎜⎜⎝(xm

t − µ)2

(xmt+1 − µ)(xm

t − µ)...

(xmt+K − µ)(xm

t − µ)

⎞⎟⎟⎟⎠ ,

where

xmt = µ +

m∑j=−m

ψjwt−j

518 Appendix A

is the usual approximation. The sample mean of the above vector is

yyymn = n−1n∑

t=1

yyymt =

⎛⎜⎜⎝γmn(0)γmn(1)

...γmn(K)

⎞⎟⎟⎠ ,

where

γmn(h) = n−1n∑

t=1

(xmt+h − µ)(xm

t − µ)

denotes the sample autocovariance of the approximating series. Also,

Eyyymt =

⎛⎜⎜⎝γm(0)γm(1)

...γm(K)

⎞⎟⎟⎠ ,

where γm(h) is the theoretical covariance function of the series xmt . Then,

consider the vectoryyymn = n1/2[yyymn − E(yyymn)]

as an approximation to

yyyn = n1/2

⎡⎢⎢⎣⎛⎜⎜⎝

γ(0)γ(1)

...γ(K)

⎞⎟⎟⎠−

⎛⎜⎜⎝γ(0)γ(1)

...γ(K)

⎞⎟⎟⎠⎤⎥⎥⎦ ,

where E(yyymn) is the same as E(yyymt ) given above. The elements of the vector

approximation yyymn are clearly n1/2(γmn(h) − γm(h)). Note that the elementsof yyyn are based on the linear process xt, whereas the elements of yyymn arebased on the m-dependent linear process xm

t . To obtain a limiting distributionfor yyyn, we apply the Basic Approximation Theorem A.2 using yyymn as ourapproximation. We now verify (i), (ii), and (iii) of Theorem A.2.

(i): First, let ccc be a (K+1)×1 vector of constants, and apply the central limittheorem to the (2m+K)−dependent series ccc′yyymn using the Cramer–Wolddevice (A.28). We obtain

ccc′yyymn = n1/2ccc′[yyymn − E(yyymn)] d→ ccc′yyym ∼ N(0, ccc′Vmccc),

as n → ∞, where Vm is a matrix containing the finite analogs of theelements vpq defined in (A.54).

(ii): Note that, since Vm → V as m → ∞, it follows that

ccc′yyymd→ ccc′yyy ∼ N(0, ccc′V ccc),

so, by the Cramer–Wold device, the limiting (K + 1) × 1 multivariatenormal variable is N(000, V ).


(iii): To show condition (iii) of the Basic Approximation Theorem, we canfocus on the element-by-element components of

P{|yyyn − yyymn| > ε

}.

For example, using the Tchebycheff inequality, the h-th element of theprobability statement can be bounded by

nε−2var (γ(h) − γm(h))

= ε−2 {n var γ(h) + n var γm(h) − 2n cov[γ(h), γm(h)]} .

Using the results that led to (A.53), we see that the preceding expressionapproaches

(vhh + vhh − 2vhh)/ε2 = 0,

as m, n → ∞.

To obtain a result comparable to Theorem A.6 for the autocorrelation func-tion ACF, we note the following theorem.

Theorem A.7 If xt is a stationary linear process of the form (1.31) satisfyingthe fourth moment condition (A.50), then for fixed K,⎛⎜⎝ ρ(1)

...ρ(K)

⎞⎟⎠ ∼ AN

⎡⎢⎣⎛⎜⎝ ρ(1)

...ρ(K)

⎞⎟⎠ , n−1W

⎤⎥⎦ ,

where W is the matrix with elements given by

wpq =∞∑

u=−∞

[ρ(u + p)ρ(u + q) + ρ(u − p)ρ(u + q) + 2ρ(p)ρ(q)ρ2(u)

− 2ρ(p)ρ(u)ρ(u + q) − 2ρ(q)ρ(u)ρ(u + p)]

=∞∑

u=1

[ρ(u + p) + ρ(u − p) − 2ρ(p)ρ(u)]

× [ρ(u + q) + ρ(u − q) − 2ρ(q)ρ(u)], (A.55)

where the last form is more convenient.

Proof. To prove the theorem, we use the delta method4 for the limitingdistribution of a function of the form

ggg(x0, x1, . . . , xK) = (x1/x0, . . . , xK/x0)′,

4The delta method states that if a k-dimensional vector sequence xxxn ∼ AN(µµµ, a2nΣ),

with an → 0, and ggg(xxx) is an r × 1 continuously differentiable vector function of xxx, thenggg(xxxn) ∼ AN(ggg(µµµ), a2

nDΣD′) where D is the r × k matrix with elements dij = ∂gi(xxx)∂xj

∣∣µµµ

.

520 Appendix A

where xh = γ(h), for h = 0, 1, . . . , K. Hence, using the delta method andTheorem A.6,

ggg (γ(0), γ(1), . . . , γ(K)) = (ρ(1), . . . , ρ(K))′

is asymptotically normal with mean vector (ρ(1), . . . , ρ(K))′ and covariancematrix

n−1W = n−1DV D′,

where V is defined by (A.54) and D is the (K + 1) × K matrix of partialderivatives

D =1x2

0

⎛⎜⎜⎝−x1 x0 0 . . . 0−x2 0 x0 . . . 0

......

.... . .

...−xK 0 0 . . . x0,

⎞⎟⎟⎠Substituting γ(h) for xh, we note that D can be written as the patternedmatrix

D =1

γ(0)( −ρρρ IK ) ,

where ρρρ = (ρ(1), ρ(2), . . . , ρ(K))′ is the K × 1 matrix of autocorrelations andIK is the K × K identity matrix. Then, it follows from writing the matrix Vin the partitioned form

V =(

v00 vvv′1

vvv1 V22

)that

W = γ−2(0)[v00ρρρρρρ

′ − ρρρvvv′1 − vvv1ρρρ

′ + V22],

where vvv1 = (v10, v20, . . . , vK0)′ and V22 = {vpq; p, q = 1, . . . , K}. Hence,

wpq = γ−2(0)[vpq − ρ(p)v0q − ρ(q)vp0 + ρ(p)ρ(q)v00

]=

∞∑u=−∞

[ρ(u)ρ(u − p + q) + ρ(u − p)ρ(u + q) + 2ρ(p)ρ(q)ρ2(u)

− 2ρ(p)ρ(u)ρ(u + q) − 2ρ(q)ρ(u)ρ(u − p)].

Interchanging the summations, we get the wpq specified in the statement ofthe theorem, finishing the proof.

Specializing the theorem to the case of interest in this chapter, we notethat if {xt} is iid with finite fourth moment, then wpq = 1 for p = q and iszero otherwise. In this case, for h = 1, . . . , K, the ρ(h) are asymptoticallyindependent and jointly normal with

ρ(h) ∼ AN(0, n−1). (A.56)

This justifies the use of (1.38) and the discussion below it as a method fortesting whether a series is white noise.


For the cross-correlation, it has been noted that the same kind of approxi-mation holds and we quote the following theorem for the bivariate case, whichcan be proved using similar arguments (see Brockwell and Davis, 1991, p. 410).

Theorem A.8 If

xt =∞∑

j=−∞αjwt−j,1

and

yt =∞∑

j=−∞βjwt−j,2

are two linear processes of the form with absolutely summable coefficients andthe two white noise sequences are iid and independent of each other with vari-ances σ2

1 and σ22, then for h ≥ 0,

ρxy(h) ∼ AN

(ρxy(h), n−1

∑j

ρx(j)ρy(j))

(A.57)

and the joint distribution of (ρxy(h), ρxy(k))′ is asymptotically normal withmean vector zero and

cov (ρxy(h), ρxy(k)) = n−1∑

j

ρx(j)ρy(j + k − h). (A.58)

Again, specializing to the case of interest in this chapter, as long as at leastone of the two series is white (iid) noise, we obtain

ρxy(h) ∼ AN(0, n−1), (A.59)

which justifies Property P1.2.

Appendix BTime Domain Theory

B.1 Hilbert Spaces and the ProjectionTheorem

Most of the material on mean square estimation and regression can be em-bedded in a more general setting involving an inner product space that isalso complete (that is, satisfies the Cauchy condition). Two examples of innerproducts are E(xy∗), where the elements are random variables, and

∑xiy

∗i ,

where the elements are sequences. These examples include the possibility ofcomplex elements, in which case, ∗ denotes the conjugation. We denote aninner product, in general, by the notation 〈x, y〉. Now, define an inner productspace by its properties, namely,

(i) 〈x, y〉 = 〈y, x〉∗

(ii) 〈x + y, z〉 = 〈x, z〉 + 〈y, z〉(iii) 〈αx, y〉 = α〈x, y〉(iv) 〈x, x〉 = ‖x‖2 ≥ 0

(v) 〈x, x〉 = 0 iff x = 0.

We introduced the notation ‖ ·‖ for the norm or distance in property (iv). Thenorm satisfies the triangle inequality

‖x + y‖ ≤ ‖x‖ + ‖y‖ (B.1)

and the Cauchy–Schwarz inequality

|〈x, y〉|2 ≤ ‖x‖2‖y‖2, (B.2)

which we have seen before for random variables in (A.36). Now, a Hilbert space,H, is defined as an inner product space with the Cauchy property. In otherwords, H is a complete inner product space. This means that every Cauchysequence converges in norm; that is, xn → x ∈ H if an only if ‖xn − xm‖ → 0

522

Time Domain Theory 523

as m, n → ∞. This is just the L2 completeness Theorem A.1 for randomvariables.

For a broad overview of Hilbert space techniques that are useful in statisti-cal inference and in probability, see Small and McLeish (1994). Also, Brockwelland Davis (1991, Chapter 2) is a nice summary of Hilbert space techniques thatare useful in time series analysis. In our discussions, we mainly use the pro-jection theorem (Theorem B.1) and the associated orthogonality principle asa means for solving various kinds of linear estimation problems.

Theorem B.1 Let M be a closed subspace of the Hilbert space H and let y bean element in H. Then, y can be uniquely represented as

y = y + z, (B.3)

where y belongs to M and z is orthogonal to M; that is, 〈z, w〉 = 0 for all win M. Furthermore, the point y is closest to y in the sense that, for any w inM, ‖y − w‖ ≥ ‖y − y‖, where equality holds if and only if w = y.

We note that (B.3) and the statement following it yield the orthogonalityproperty

〈y − y, w〉 = 0 (B.4)

for any w belonging to M, which can sometimes be used easily to find anexpression for the projection. The norm of the error can be written as

‖y − y‖2 = 〈y − y, y − y〉= 〈y − y, y〉 − 〈y − y, y〉= 〈y − y, y〉 (B.5)

because of orthogonality.Using the notation of Theorem B.1, we call the mapping PMy = y, for

y ∈ H, the projection mapping of H onto M. In addition, the closed span of afinite set {x1, . . . , xn} of elements in a Hilbert space, H, is defined to be the setof all linear combinations w = a1x1 + · · · + anxn, where a1, . . . , an are scalars.This subspace of H is denoted by M = sp{x1, . . . , xn}. By the projectiontheorem, the projection of y ∈ H onto M = sp{x1, . . . , xn} is unique andgiven by

PMy = a1x1 + · · · + anxn,

where {a1, . . . , an} are found using the orthogonality principle

〈y − PMy, xj〉 = 0 j = 1, . . . , n.

Evidently, {a1, . . . , an} can be obtained by solvingn∑

i=1

ai〈xi, xj〉 = 〈y, xj〉 j = 1, . . . , n. (B.6)

When the elements of H are vectors, this problem is the linear regressionproblem.

524 Appendix B

Example B.1 Linear Regression Analysis

For the regression model introduced in §2.2, we want to find the regressioncoefficients βi that minimize the residual sum of squares. Consider thevectors yyy = (y1, . . . , yn)′ and zzzi = (z1i, . . . , zni)′, for i = 1, . . . , q and theinner product

〈zzzi, yyy〉 =n∑

t=1

ztiyt = zzz′i yyy.

We solve the problem of finding a projection of the observed yyy on thelinear space spanned by β1zzz1 + · · ·+βqzzzq, that is, linear combinations ofthe zzzi. The orthogonality principle gives

〈yyy −q∑

i=1

βizzzi, zzzj〉 = 0

for j = 1, . . . , q. Writing the orthogonality condition, as in (B.6), invector form gives

yyy′zzzj =q∑

i=1

βizzz′izzzj j = 1, . . . , q, (B.7)

which can be written in the usual matrix form by letting Z = (zzz1, . . . , zzzq),which is assumed to be full rank. That is, (B.7) can be written as

yyy′Z = βββ′(Z ′Z), (B.8)

where βββ = (β1, . . . , βq)′. Transposing both sides of (B.8) provides thesolution for the coefficients,

βββ = (Z ′Z)−1Z ′yyy.

The mean square error in this case would be

‖yyy −q∑

i=1

βizzzi‖2 = 〈yyy −q∑

i=1

βizzzi , yyy〉

= 〈yyy, yyy〉 −q∑

i=1

βi〈zzzi , yyy〉

= yyy′yyy − βββ′Z ′yyy,

which is in agreement with §2.2.

The extra generality in the above approach hardly seems necessary in thefinite dimensional case, where differentiation works perfectly well. It is conve-nient, however, in many cases to regard the elements of H as infinite dimen-sional, so that the orthogonality principle becomes of use. For example, the


projection of the process {xt; t = 0±1,±2, . . .} on the linear manifold spannedby all filtered convolutions of the form

xt =∞∑

k=−∞akxt−k

would be in this form.There are some useful results, that we state without proof, pertaining to

projection mappings.

Theorem B.2 Under the established notation and conditions:

(i) PM(ax + by) = aPMx + bPMy, for x, y ∈ H, where a and b are scalars.

(ii) If ||yn − y|| → 0, then PMyn → PMy, as n → ∞.

(iii) w ∈ M if and only if PMw = w. Consequently, a projection mappingcan be characterized by the property that P 2

M = PM, in the sense that,for any y ∈ H, PM(PMy) = PMy.

(iv) Let M1 and M2 be closed subspaces of H. Then, M1 ⊆ M2 if and onlyif PM1(PM2y) = PM1y for all y ∈ H.

(v) Let M be a closed subspace of H and let M⊥ denote the orthogonalcomplement of M. Then, M⊥ is also a closed subspace of H, and forany y ∈ H, y = PMy + PM⊥y.

Part (iii) of Theorem B.2 leads to the well-known result, often used inlinear models, that a square matrix M is a projection matrix if and only if it issymmetric and idempotent (that is, M2 = M). For example, using notation ofExample B.1 for linear regression, the projection of yyy onto sp{zzz1, . . . , zzzq}, thespace generated by the columns of Z, is PZ(yyy) = Zβββ = Z(Z ′Z)−1Z ′yyy. Thematrix M = Z(Z ′Z)−1Z ′ is an n × n, symmetric and idempotent matrix ofrank q (which is the dimension of the space that M projects yyy onto). Parts(iv) and (v) of Theorem B.2 are useful for establishing recursive solutions forestimation and prediction.

By imposing extra structure, conditional expectation can be defined as aprojection mapping for random variables in L2 with the equivalence relationthat, for x, y ∈ L2, x = y if Pr(x = y) = 1. In particular, for y ∈ L2, if M is aclosed subspace of L2 containing 1, the conditional expectation of y given Mis defined to be the projection of y onto M, namely, EMy = PMy. This meansthat conditional expectation, EM, must satisfy the orthogonality principle ofthe Projection Theorem and that the results of Theorem B.2 remain valid(the most widely used tool in this case is item (iv) of the theorem). If we letM(x) denote the closed subspace of all random variables in L2 that can bewritten as a (measurable) function of x, then we may define, for x, y ∈ L2,the conditional expectation of y given x as E(y|x) = EM(x)y. This idea may

526 Appendix B

be generalized in an obvious way to define the conditional expectation of ygiven xxx = (x1, . . . , xn); that is E(y|xxx) = EM(xxx)y. Of particular interest tous is the following result which states that, in the Gaussian case, conditionalexpectation and linear prediction are equivalent.

Theorem B.3 Under the established notation and conditions, if (y, x1, . . . , xn)is multivariate normal, then

E(y∣∣ x1, . . . , xn) = Psp{1,x1,...,xn}y.

Proof. First, by the projection theorem, the conditional expectation of y givenxxx = {x1, . . . , xn} is the unique element EM(xxx)y that satisfies the orthogonalityprinciple,

E{(

y − EM(xxx)y)w}

= 0 for all w ∈ M(xxx).

We will show that y = Psp{1,x1,...,xn}y is that element. In fact, by the projectiontheorem, y satisfies

〈y − y, xi〉 = 0 for i = 0, 1, . . . , n,

where we have set x0 = 1. But 〈y − y, xi〉 = cov(y − y, xi) = 0, implying thaty − y and (x1, . . . , xn) are independent because the vector (y − y, x1, . . . , xn)′

is multivariate normal. Thus, if w ∈ M(xxx), then w and y − y are independentand, hence, 〈y − y, w〉 = E{(y − y)w} = E(y − y)E(w) = 0, recalling that0 = 〈y − y, 1〉 = E(y − y).

In the Gaussian case, conditional expectation has an explicit form. Let yyy =(y1, . . . , ym)′, xxx = (x1, . . . , xn)′, and suppose the (m + n) × 1 vector (yyy′, xxx′)′ ismultivariate normal. Then

E(yyy∣∣ xxx) = µµµy + ΣyxΣ−1

xx (xxx − µµµx) (B.9)

var(yyy∣∣ xxx) = Σyy − ΣyxΣ−1

xx Σxy, (B.10)

where µµµy = E(yyy) is m × 1, Σyy = var(yyy) is m × m, µµµx = E(xxx) is an n × 1vector, Σyx = Σ′

xy = cov(yyy,xxx) is m × n, and Σxx = var(xxx) is an n × n matrix,assumed to be nonsingular.

B.2 Causal Conditions for ARMA Models

In this section, we prove Property P3.1 of §3.2 pertaining to the causality ofARMA models. The proof of Property P3.2, which pertains to invertibility ofARMA models, is similar.

Proof of Property P3.1. Suppose first that the roots of φ(z), say, z1, . . . , zp,lie outside the unit circle. We write the roots in the following order, 1 < |z1| ≤|z2| ≤ · · · ≤ |zp|, noting that z1, . . . , zp are not necessarily unique, and put


|z1| = 1 + ε, for some ε > 0. Thus, φ(z) �= 0 as long as |z| < |z1| = 1 + ε and,hence, φ−1(z) exists and has a power series expansion,

1φ(z)

=∞∑

j=0

ajzj , |z| < 1 + ε.

Now, choose a value δ such that 0 < δ < ε, and set z = 1 + δ, which is insidethe radius of convergence. It then follows that

φ−1(1 + δ) =∞∑

j=0

aj(1 + δ)j < ∞. (B.11)

Thus, we can bound each of the terms in the sum in (B.11) by constant, say,|aj(1 + δ)j | < c, for c > 0. In turn, |aj | < c(1 + δ)−j , from which it followsthat ∞∑

j=0

|aj | < ∞. (B.12)

Hence, φ−1(B) exists and we may apply it to both sides of the ARMA model,φ(B)xt = θ(B)wt, to obtain

xt = φ−1(B)φ(B)xt = φ−1(B)θ(B)wt.

Thus, putting ψ(B) = φ−1(B)θ(B), we have

xt = ψ(B)wt =∞∑

j=0

ψjwt−j ,

where the ψ-weights, which are absolutely summable, can be evaluated byψ(z) = φ−1(z)θ(z), for |z| ≤ 1.

Now, suppose xt is a causal process; that is, it has the representation

xt =∞∑

j=0

ψjwt−j ,

∞∑j=0

|ψj | < ∞.

In this case, we writext = ψ(B)wt,

and premultiplying by φ(B) yields

φ(B)xt = φ(B)ψ(B)wt. (B.13)

In addition to (B.13), the model is ARMA, and can be written as

φ(B)xt = θ(B)wt. (B.14)

528 Appendix B

From (B.13) and (B.14), we see that

φ(B)ψ(B)wt = θ(B)wt. (B.15)

Now, let

a(z) = φ(z)ψ(z) =∞∑

j=0

ajzj |z| ≤ 1

and, hence, we can write (B.15) as

∞∑j=0

ajwt−j =q∑

j=0

θjwt−j . (B.16)

Next, multiply both sides of (B.16) by wt−h, for h = 0, 1, 2, . . . , and takeexpectation. In doing this, we obtain

ah = θh, h = 0, 1, . . . , q

ah = 0, h > q. (B.17)

From (B.17), we conclude that

φ(z)ψ(z) = a(z) = θ(z), |z| ≤ 1. (B.18)

If there is a complex number in the unit circle, say z0, for which φ(z0) = 0,then by (B.18), θ(z0) = 0. But, if there is such a z0, then φ(z) and θ(z) havea common factor which is not allowed. Thus, we may write ψ(z) = θ(z)/φ(z).In addition, by hypothesis, we have that |ψ(z)| < ∞ for |z| ≤ 1, and hence

|ψ(z)| =∣∣∣∣ θ(z)φ(z)

∣∣∣∣ < ∞, for |z| ≤ 1. (B.19)

Finally, (B.19) implies φ(z) �= 0 for |z| ≤ 1; that is, the roots of φ(z) lie outsidethe unit circle.

B.3 Large Sample Distribution of the AR(p)Conditional Least Squares Estimators

In §3.6 we discussed the conditional least squares procedure for estimating theparameters φ1, φ2, . . . , φp and σ2

w in the AR(p) model

xt =p∑

k=1

φkxt−k + wt,

where we assume µ = 0, for convenience. Write the model as

xt = φφφ′xxxt−1 + wt, (B.20)


where xxxt−1 = (xt−1, xt−2, . . . , xt−p)′ is a p × 1 vector of lagged values, andφφφ = (φ1, φ2, . . . , φp)′ is the p × 1 vector of regression coefficients. Assumingobservations are available at x1, . . . , xn, the conditional least squares procedureis to minimize

Sc(φφφ) =n∑

t=p+1

(xt − φφφ′xxxt−1

)2with respect to φφφ. The solution is

φφφ =

(n∑

t=p+1

xxxt−1xxx′t−1

)−1 n∑t=p+1

xxxt−1xt (B.21)

for the regression vector φφφ; the conditional least squares estimate of σ2w is

σ2w =

1n − p

n∑t=p+1

(xt − φφφ

′xxxt−1

)2. (B.22)

As pointed out following (3.104), Yule–Walker estimators and least squaresestimators are approximately the same in that the estimators differ only byinclusion or exclusion of terms involving the endpoints of the data. Hence, it iseasy to show the asymptotic equivalence of the two estimators; this is why, forAR(p) models, (3.93) and (3.118), are equivalent. Details on the asymptoticequivalence can be found in Brockwell and Davis (1991, Chapter 8).

Here, we use the same approach as in Appendix A, replacing the lowerlimits of the sums in (B.21) and (B.22) by one and noting the asymptoticequivalence of the estimators

φφφ =

(n∑

t=1

xxxt−1xxx′t−1

)−1 n∑t=1

xxxt−1xt (B.23)

and

σ2w =

1n

n∑t=1

(xt − φφφ

′xxxt−1

)2(B.24)

to those two estimators. In (B.23) and (B.24), we are acting as if we are ableto observe x1−p, . . . , x0 in addition to x1, . . . , xn. The asymptotic equivalenceis then seen by arguing that for n sufficiently large, it makes no differencewhether or not we observe x1−p, . . . , x0. In the case of (B.23) and (B.24), weobtain the following theorem.

Theorem B.4 Let xt be a causal AR(p) series with white (iid) noise wt sat-isfying E(w4

t ) = ησ4w. Then,

φφφ ∼ AN(

φφφ, n−1σ2wΓ−1

p

), (B.25)

530 Appendix B

where Γp = {γ(i − j)}pi,j=1 is the p × p autocovariance matrix of the vector

xxxt−1. We also have, as n → ∞,

n−1n∑

t=1

xxxt−1xxx′t−1

p→ Γp (B.26)

andσ2

wp→ σ2

w. (B.27)

Proof. First, (B.26) follows from the fact that E(xxxt−1xxx′t−1) = Γp, recalling

that from Theorem 1.6, second-order sample moments converge in probabilityto their population moments for linear processes in which wt has a finite fourthmoment. To show (B.25), we can write

φφφ =

(n∑

t=1

xxxt−1xxx′t−1

)−1 n∑t=1

xxxt−1(xxx′t−1φφφ + wt)

= φφφ +

(n∑

t=1

xxxt−1xxx′t−1

)−1 n∑t=1

xxxt−1wt,

so that

n1/2(φφφ − φφφ) =

(n−1

n∑t=1

xxxt−1xxx′t−1

)−1

n−1/2n∑

t=1

xxxt−1wt

=

(n−1

n∑t=1

xxxt−1xxx′t−1

)−1

n−1/2n∑

t=1

uuut,

where uuut = xxxt−1wt. We use the fact that wt and xxxt−1 are independent to writeEuuut = E(xxxt−1)E(wt) = 000, because the errors have zero means. Also,

Euuutuuu′t = Exxxt−1wtwtxxx

′t−1

= Exxxt−1xxx′t−1Ew2

t

= σ2wΓp.

In addition, we have, for h > 0,

Euuut+huuu′t = Exxxt+h−1wt+hwtxxx

′t−1

= Exxxt+h−1wtxxx′t−1Ewt+h

= 0.

A similar computation works for h < 0.Next, consider the mean square convergent approximation

xmt =

m∑j=0

ψjwt−j


for xt, and define the (m+p)-dependent process uuumt = wt(xm

t−1, xmt−2, . . . , x

mt−p)

′.Note that we need only look at a central limit theorem for the sum

ynm = n−1/2n∑

t=1

λλλ′uuumt ,

for arbitrary vectors λλλ = (λ1, . . . , λp)′, where ynm is used as an approximationto

Sn = n−1/2n∑

t=1

λλλ′uuut.

First, apply the m-dependent central limit theorem to ynm as n → ∞ for fixedm to establish (i) of Theorem A.4. This result shows ynm

d→ ym, where ym isasymptotically normal with covariance λλλ′Γ(m)

p λλλ, where Γ(m)p is the covariance

matrix of uuumt . Then, we have Γ(m)

p → Γp, so that ym converges in distributionto a normal random variable with mean zero and variance λλλ′Γpλλλ and we haveverified part (ii) of Theorem A.4. We verify part (iii) of Theorem A.4 by notingthat

E[(Sn − ynm)2] = n−1n∑

t=1

λλλ′E[(uuut − uuumt )(uuut − uuum

t )′]λλλ

clearly converges to zero as n, m → ∞ because

xt − xmt =

∞∑j=m+1

ψjwt−j

form the components of uuut − uuumt .

Now, the form for√

n(φφφ − φφφ) contains the premultiplying matrix(n−1

n∑t=1

xxxt−1xxx′t−1

)−1p→ Γ−1

p ,

because (A.22) can be applied to the function that defines the inverse of thematrix. Then, applying (A.31), shows that

n1/2(φφφ − φφφ

)d→ N

(0, σ2

wΓ−1p ΓpΓ−1

p

),

so we may regard it as being multivariate normal with mean zero and covari-ance matrix σ2

wΓ−1p .

To investigate σ2w, note

σ2w = n−1

n∑t=1

(xt − φφφ

′xxxt−1

)2

= n−1n∑

t=1

x2t − n−1

n∑t=1

xxx′t−1xt

(n−1

n∑t=1

xxxt−1xxx′t−1

)−1

n−1n∑

t=1

xxxt−1xt

532 Appendix B

p→ γ(0) − γγγ′pΓ

−1p γγγp

= σ2w,

and we have that the sample estimator converges in probability to σ2w, which

is written in the form of (3.59).

The arguments above imply that, for sufficiently large n, we may considerthe estimator φφφ in (B.21) as being approximately multivariate normal withmean φφφ and variance–covariance matrix σ2

wΓ−1p /n. Inferences about the para-

meter φφφ are obtained by replacing the σ2w and Γp by their estimates given by

(B.22) and

Γp = n−1n∑

t=p+1

xxxt−1xxx′t−1,

respectively. In the case of a nonzero mean, the data xt are replaced by xt − xin the estimates and the results of Theorem B.4 remain valid.

B.4 The Wold Decomposition

The ARMA approach to modeling time series is generally implied by the as-sumption that the dependence between adjacent values in time is best ex-plained in terms of a regression of the current values on the past values. Thisassumption is partially justified, in theory, by the Wold decomposition.

In this section we assume that {xt; t = 0,±1,±2, . . .} is a stationary,mean-zero process. Using the notation of §B.1, we define

Mn = sp{xt, −∞ < t ≤ n}, with M−∞ =∞⋂

n=−∞Mn,

andσ2 = E (xn+1 − PMn

xn+1)2.

Next, we say that {vt; t = 0,±1,±2, . . .} is a deterministic process if andonly if σ2 = 0. That is, a deterministic process is one in which its future isperfectly predictable from its past; a simple example is vt = cos(.2πt). We arenow ready to present the decomposition.

Theorem B.5 (The Wold Decomposition) Under the conditions and no-tation of this section, if σ2 > 0, then xt can be expressed as

xt =∞∑

j=0

ψjwt−j + vt

where


• ∑∞j=0 ψ2

j < ∞ (ψ0 = 1)

• {wt} is white noise with variance σ2

• wt ∈ Mt

• E(wtvt) = 0 for all s, t = 0,±1,±2, . . . .

• vt ∈ M−∞

• {vt} is deterministic.

The proof of the decomposition follows from the theory of §B.1 by definingthe unique sequences:

(i) wt = xt − PMt−1xt,

(ii) ψj = σ−2 < xt, wt−j >= σ−2E(xtwt−j), and

(iii) vt = xt −∑∞j=0 ψjwt−j .

Although every stationary process can be represented by the Wold decom-position, it does not mean that the decomposition is the best way to describethe process. In addition, there may be some dependence structure among the{wt}; we are only guaranteed that the sequence is an uncorrelated sequence.The theorem, in its generality, falls short of our needs because we would preferthe noise process, {wt}, to be white independent noise. But, the decomposi-tion does give us the confidence that we will not be completely off the markby fitting ARMA models to time series data.

Appendix CSpectral Domain Theory

C.1 Spectral Representation Theorem

In this section, we present a spectral representation for the process xt itself,which allows us to think of a stationary process as a random sum of sinesand cosines as described in (4.4). In addition, we present results that justifyrepresenting the autocovariance function γx(h) of the weakly stationary processxt in terms of a non-negative spectral density function. The spectral densityfunction essentially measures the variance or power in a particular kind ofperiodic oscillation in the function. We denote this spectral density of variancefunction by f(ω), where the variance is measured as a function of the frequencyof oscillation ω, measured in cycles per unit time.

First, we consider developing a representation for the autocovariance func-tion of a stationary, possibly complex, series xt with zero mean and autocovari-ance function γx(h) = E(xt+hx∗

t ). We prove the representation for arbitrarynon-negative definite functions γ(h) and then simply note the autocovariancefunction is Hermitian non-negative definite, because, for any set of complexconstants, at, t = 0 ± 1,±2, . . ., we may write, for any finite subset,

E

∣∣∣∣∣n∑

s=1

a∗sxs

∣∣∣∣∣2

=n∑

s=1

n∑t=1

a∗sγ(s − t)at ≥ 0.

The representation is stated in terms of non-negative definite functions anda spectral distribution function F (ω) that is monotone nondecreasing, andcontinuous from the right, taking the values F (−1/2) = 0 and F (1/2) = σ2 =γx(0) at ω = −1/2 and 1/2, respectively.

Theorem C.1 A function γ(h), for h = 0 ± 1,±2, . . . is Hermitian non-negative definite if and only if it can be expressed as

γ(h) =∫ 1/2

−1/2exp{2πiωh}dF (ω) (C.1)

534

Spectral Domain Theory 535

where F (·) is monotone non-decreasing. The function F (·) is right continuous,bounded in [−1/2, 1/2], and uniquely determined by the conditions F (−1/2) =0, F (1/2) = γ(0).

Proof. To prove the result, note first if γ(h) has the representation above,

n∑s=1

n∑t=1

a∗sγ(s − t)at =

∫ 1/2

−1/2a∗

sγ(s − t)ate2πiω(s−t)dF (ω)

=∫ 1/2

−1/2

∣∣∣∣∣n∑

s=1

ase−2πiωs

∣∣∣∣∣2

dF (ω)

≥ 0

and γ(h) is non-negative definite. Conversely, suppose γ(h) is a non-negativedefinite function, and define the non-negative function

fn(ω) = n−1n∑

s=1

n∑t=1

e−2πiωsγ(s − t)e2πiωt

= n−1(n−1)∑

u=−(n−1)

(n − |u|)e−2πiωuγ(u) (C.2)

≥ 0.

Now, let Fn(ω) be the distribution function corresponding to fn(ω)I(−1/2,1/2],where I(·) denotes the indicator function of the interval in the subscript. Notethat Fn(ω) = 0, ω ≤ −1/2 and Fn(ω) = Fn(1/2) for ω ≥ 1/2. Then,∫ 1/2

−1/2e2πiωudFn(ω) =

∫ 1/2

−1/2e2πiωufn(ω) dω

={

(1 − |u|/n)γ(u), |u| < n0, elsewhere.

We also have

Fn(1/2) =∫ 1/2

−1/2fn(ω) dω

=∫ 1/2

−1/2

∑|u|<n

(1 − |u|/n)γ(u)e−2πiωudω

= γ(0).

Now, by Helly’s first convergence theorem (Bhat, 1985, p. 157), there exists asubsequence Fnk

converging to F , and by the Helly-Bray Lemma (see Bhat,

536 Appendix C

p. 157), this implies∫ 1/2

−1/2e2πiωudFnk

(ω) →∫ 1/2

−1/2e2πiωudF (ω)

and, from the right-hand side of the earlier equation,

(1 − |u|/nk)γ(u) → γ(u)

as nk → ∞, and the required result follows.

Next, present the version of the Spectral Representation Theorem in termsof a mean-zero, stationary process, xt. We refer the reader to Hannan (1970,§2.3) for details. This version allows us to think of a stationary process asbeing generated (approximately) by a random sum of sines and cosines suchas described in (4.4).

Theorem C.2 If xt is a mean-zero stationary process, with spectral distribu-tion F (ω) as given in Theorem C.1, then there exists a complex-valued stochas-tic process z(ω), on the interval ω ∈ [−1/2, 1/2], having stationary uncorrelatedincrements, such that xt can be written as the stochastic integral

xt =∫ 1/2

−1/2exp(−2πitω)dz(ω)

where, for −1/2 ≤ ω1 ≤ ω2 ≤ 1/2,

var {z(ω2) − z(ω1)} = F (ω2) − F (ω1).

An uncorrelated increment process such as z(ω) is a mean-zero, finite vari-ance, continuous-time stochastic process for which events occurring in non-overlapping intervals are uncorrelated. The integral in this representation is astochastic integral. To understand its meaning, let ω0, ω1, . . . , ωn be a partitionof the interval [−1/2, 1/2]. Define

In =n∑

j=1

exp(−2πitωj)[z(ωj) − z(ωj−1)].

Then, assuming it exists, I =∫ 1/2

−1/2 exp(−2πitωj)dz(ω) is defined to be themean square limit of In as n → ∞. Theorem C.2 allows us to think of astationary process, approximately, as the random superposition of sines andcosines.

In general, the spectral distribution function can be a mixture of discreteand continuous distributions. The special case of greatest interest is the ab-solutely continuous case, namely, when dF (ω) = f(ω)dω, and the resulting


function is the spectral density considered in §4.3. What made the proof ofTheorem C.1 difficult was that, after we defined

fn(ω) =(n−1)∑

h=−(n−1)

(1 − |h|

n

)γ(h)e−2πiωh

in (C.2), we could not simply allow n → ∞ because γ(h) may not be absolutelysummable. If, however, γ(h) is absolutely summable we may define f(ω) =limn→∞ fn(ω), and we have the following result.

Theorem C.3 If γ(h) is the autocovariance function of a stationary process,xt, with

∞∑h=−∞

|γ(h)| < ∞, (C.3)

then the spectral density of xt is given by

f(ω) =∞∑

h=−∞γ(h)e−2πiωh. (C.4)

We may extend the representation to the vector case xxxt = (xt1, . . . , xtp)′

by considering linear combinations of the form

yt =p∑

j=1

a∗jxtj ,

which will be stationary with autocovariance functions of the form

γy(h) =p∑

j=1

p∑k=1

a∗jγjk(h)ak,

where γjk(h) is the usual cross-covariance function between xtj and xtk. Todevelop the spectral representation of γjk(h) from the representations of theunivariate series, consider the linear combinations

yt1 = xtj + xtk

andyt2 = xtj + ixtk,

which are both stationary series with respective representations

γy1(h) = γj(h) + γjk(h) + γkj(h) + γk(h)

=∫ 1/2

−1/2e2πiωhdG1(ω)

538 Appendix C

and

γy2(h) = γj(h) + iγkj(h) − iγjk(h) + γk(h)

=∫ 1/2

−1/2e2πiωhdG2(ω).

Introducing the spectral representations for γj(h) and γk(h) yields

γjk(h) =∫ 1/2

−1/2e2πiωhdFjk(ω),

withFjk(ω) =

12[G1(ω) + iG2(ω) − (1 + i)(Fj(ω) + Fk(ω)].

Now, under the summability condition∞∑

h=−∞|γjk(h)| < ∞,

we have the representation

γjk(h) =∫ 1/2

−1/2e2πiωhfjk(ω)dω,

where the cross-spectral density function has the inverse Fourier representation

fjk(ω) =∞∑

h=−∞γjk(h)e−2πiωh.

The cross-covariance function satisfies γjk(h) = γkj(−h), which implies fjk(ω) =fkj(−ω) using the above representation.

Then, defining the autocovariance function of the general vector process xxxt

as the p × p matrix

Γ(h) = E[(xxxt+h − µµµx)(xxxt − µµµx)′],

and the p × p spectral matrix as f(ω) = {fjk(ω), j, k = 1, . . . , p}, we have therepresentation in matrix form, written as

Γ(h) =∫ 1/2

−1/2e2πiωhf(ω) dω, (C.5)

and the inverse result

f(ω) =∞∑

h=−∞Γ(h)e−2πiωh. (C.6)

which appears as Property P4.3 in §4.6. Theorem C.2 can also be extended tothe multivariate case.


C.2 Large Sample Distribution of the DFT andSmoothed Periodogram

We have previously introduced the DFT, for the stationary zero-mean processxt, observed at t = 1, . . . , n as

d(ω) = n−1/2n∑

t=1

xt e−2πiωt, (C.7)

as the result of matching sines and cosines of frequency ω against the seriesxt. We will suppose now that xt has an absolutely continuous spectrum f(ω)corresponding to the absolutely summable autocovariance function γ(h). Ourpurpose in this section is to examine the statistical properties of the complexrandom variables d(ωk), for ωk = k/n, k = 0, 1, . . . , n − 1 in providing a basisfor the estimation of f(ω). To develop the statistical properties, we examinethe behavior of

Sn(ω, ω) = E{

|d(ω)|2}

= n−1E

[ n∑s=1

xs e−2πiωsn∑

t=1

xte2πiωt

]

= n−1n∑

s=1

n∑t=1

e−2πiωse2πiωtγ(s − t)

=(n−1)∑

u=−(n−1)

(1 − |u|/n)γ(u)e−2πiωu, (C.8)

where we have let u = s − t. Using dominated convergence,

Sn(ω, ω) →∞∑

u=−∞γ(u)e−2πiωu = f(ω),

as n → ∞, making the large sample variance of the Fourier transform equal tothe spectrum evaluated at ω. We have already seen this result in Theorem C.3.For exact bounds it is also convenient to add an absolute summability assump-tion for the autocovariance function, namely,

θ =∞∑

h=−∞|h||γ(h)| < ∞. (C.9)

540 Appendix C

Example C.1 Condition (C.9) Verified for an AR(1)

We may write the condition for an AR(1) series. xt = φxt−1 + wt, as

θ =σ2

w

1 − φ2

∞∑h=−∞

|h|φ|h|

being finite. Note the condition is equivalent to summability of

∞∑h=1

hφh = φ

∞∑h=1

hφh−1

= φ∂

∂φ

∞∑h=1

φh

=φ

(1 − φ)2,

and hence,

θ =2σ2

wφ

(1 − φ)3(1 + φ)< ∞.

To elaborate further, we derive two approximation lemmas.

Lemma C.1 For Sn(ω, ω) as defined in (C.8) and θ in (C.9) finite, we have

|Sn(ω, ω) − f(ω)| ≤ θ

n(C.10)

orSn(ω, ω) = f(ω) + O(n−1). (C.11)

Proof. To prove the lemma, write

n|Sn(ω, ω) − fx(ω)| =

∣∣∣∣∣∣∑

|u|<n

(n − |u|)γ(u)e−2πiωu − n

∞∑u=−∞

γ(u)e−2πiωu

∣∣∣∣∣∣=

∣∣∣∣∣∣−n∑

|u|≥n

γ(u)e−2πiωu −∑

|u|<n

|u|γ(u)e−2πiωu

∣∣∣∣∣∣≤

∑|u|≥n

|u||γ(u)| +∑

|u|<n

|u||γ(u)|

= θ,

which establishes the lemma.


Lemma C.2 For ωk = k/n, ω� = /n, ωk − ω� �= 0,±1,±2,±3, . . ., and θ in(C.9), we have

|Sn(ωk, ω�)| ≤ θ

n= O(n−1), (C.12)

whereSn(ωk, ω�) = E{d(ωk)d(ω�)}. (C.13)

Proof. Write

n|Sn(ωk, ω�)| =−1∑

u=−(n−1)

γ(u)n∑

v=−(u−1)

e−2πi(ωk−ω�)ve−2πiωku

+n−1∑u=0

γ(u)n−u∑v=1

e−2πi(ωk−ω�)ve−2πiωku.

Now, for the first term, with u < 0,

n∑v=−(u−1)

e−2πi(ωk−ω�)v =( n∑

v=1

−−u∑v=1

)e−2πi(ωk−ω�)v

= 0 −−u∑v=1

e−2πi(ωk−ω�)v.

For the second term with u ≥ 0,

n−u∑v=1

e−2πi(ωk−ω�)v =( n∑

v=1

−n∑

v=n−u+1

)e−2πi(ωk−ω�)v

= 0 −n∑

v=n−u+1

e−2πi(ωk−ω�)v.

Consequently,

n|Sn(ωk, ω�)| =

∣∣∣∣∣∣−−1∑

u=−(n−1)

γ(u)−u∑v=1


−n−1∑u=1

γ(u)n∑

v=n−u+1


∣∣∣∣∣≤

0∑u=−(n−1)

(−u)|γ(u)| +n−1∑u=1

u|γ(u)|

=(n−1)∑

u=−(n−1)

|u| |γ(u)|.

542 Appendix C

Hence, we have

Sn(ωk, ω�) ≤ θ

n,

and the asserted relations of the lemma follow.

Because the DFTs are approximately uncorrelated, say, of order 1/n, whenthe frequencies are of the form ωk = k/n, we shall compute at those frequencies.The behavior of f(ω) at neighboring frequencies, however, will often be ofinterest and we shall use Lemma C.3 below to handle such cases

Lemma C.3 For |ωk − ω| ≤ L/2n and θ in (C.9), we have

|f(ωk) − f(ω)| ≤ πθL

n(C.14)

orf(ωk) − f(ω) = O(L/n). (C.15)

Proof. To prove Lemma C.3, we write the difference

|f(ωk) − f(ω)| =

∣∣∣∣∣∞∑

h=−∞γ(h)

(e−2πiωkh − e−2πiωh)

∣∣∣∣∣≤

∞∑h=−∞

|γ(h)||e−πi(ωk−ω)h − eπi(ωk−ω)h|

= 2∞∑

h=−∞|γ(h)|| sin[π(ωk − ω)h]|

≤ 2π|ωk − ω|∞∑

h=−∞|h||γ(h)|

≤ πθL

n

because | sin x| ≤ |x|.The main use of the properties described by Lemmas C.1 and C.2 is in

identifying the covariance structure of the DFT, say,

d(ωk) = n−1/2n∑

t=1

xt e−2πiωkt

= dc(ωk) − ids(ωk),

where

dc(ωk) = n−1/2n∑

t=1

xt cos(2πωkt)

and

ds(ωk) = n−1/2n∑

t=1

xt sin(2πωkt)


are the cosine and sine transforms, respectively, of the observed series, de-fined previously in (4.24) and (4.25). For example, assuming zero means forconvenience, we will have

E[dc(ωk)dc(ω�)] =14n−1

n∑s=1

n∑t=1

γ(s − t)(e2πiωks + e−2πiωks

)×(e2πiω�t + e−2πiω�t

)=

14[Sn(−ωk, ωl) + Sn(ωk, ω�) + Sn(ω�, ωk)

+Sn(ωk,−ω�)].

Lemmas C.1 and C.2 imply, for k = ,

E[dc(ωk)dc(ω�)] =14[O(n−1) + f(ωk) + O(n−1)

+f(ωk) + O(n−1) + O(n−1)]

=12f(ωk) + O(n−1). (C.16)

For k �= , all terms are O(n−1). Hence, we have

E[dc(ωk)dc(ω�)] ={ 1

2f(ωk) + O(n−1), k = O(n−1), k �= .

(C.17)

A similar argument gives

E[ds(ωk)ds(ω�)] ={ 1

2f(ωk) + O(n−1), k = ,O(n−1), k �=

(C.18)

and we also have E[ds(ωk)dc(ω�)] = O(n−1) for all k, . We may summarizethe results of Lemmas C.1–C.3 as follows.

Theorem C.4 For a stationary mean zero process with autocovariance func-tion satisfying (C.9) and frequencies ωk:n, such that |ωk:n − ω| < 1/n, areclose to some target frequency ω, the cosine and sine transforms (4.24) and(4.25) are approximately uncorrelated with variances equal to (1/2)f(ω), andthe error in the approximation can be uniformly bounded by πθL/n.

Now, consider estimating the spectrum in a neighborhood of some targetfrequency ω, using the periodogram estimator

I(ωk:n) = |d(ωk:n)|2 = d2c(ωk:n) + d2

s(ωk:n),

where we take |ωk:n − ω| ≤ n−1 for each n. In case the series xt is Gaussianwith zero mean,(

dc(ωk:n)ds(ωk:n)

)d→ N

{(00

),12

(f(ω) 0

0 f(ω)

)},

544 Appendix C

and we have that2 I(ωk:n)

f(ω)d→ χ2

2,

where χ2ν denotes a chi-squared random variable with ν degrees of freedom,

as usual. Unfortunately, the distribution does not become more concentratedas n → ∞, because the variance of the periodogram estimator does not go tozero.

We develop a fix for the deficiencies mentioned above by considering theaverage of the periodogram over a set of frequencies in the neighborhood of ω.For example, we can always find a set of L = 2m + 1 frequencies of the form{ωj:n + k/n; k = 0,±1,±2, . . . , m}, for which

f(ωj:n + k/n) = f(ω) + O(Ln−1)

by Lemma C.3. As n increases, the values of the separate frequencies change.Now, we can consider the smoothed periodogram estimator, f(ω), given in

(4.56); this case includes the averaged periodogram, f(ω). First, we note that(C.9), θ =

∑∞h=−∞ |h||γ(h)| < ∞, is a crucial condition in the estimation of

spectra. In investigating local averages of the periodogram, we will require acondition on the rate of (C.9), namely

n∑h=−n

|h||γ(h)| = O(n−1/2). (C.19)

One can show that a sufficient condition for (C.19) is that the time series isthe linear process given by,

xt =∞∑

j=−∞ψjwt−j ,

∞∑j=0

√j |ψj | < ∞ (C.20)

where wt ∼ iid(0, σ2w) and wt has finite fourth moment,

E(w4t ) = ησ4

w < ∞.

We leave it to the reader (Problem 4.40) to show (C.20) implies (C.19) underthe condition that wt ∼ wn(0, σ2

w).We now state the following lemma.

Lemma C.4 Suppose xt is the linear process given by (C.20), and let I(ωj)be the periodogram of the data {x1, . . . , xn}. Then

cov (I(ωj), I(ωk)) =

⎧⎨⎩ 2f2(ωj) + o(1) ωj = ωk = 0, 1/2f2(ωj) + o(1) ωj = ωk �= 0, 1/2O(n−1) ωj �= ωk.


The proof of Lemma C.4 is straight forward but tedious, and details maybe found in Fuller (1976, Theorem 7.2.1) or in Brockwell and Davis (1991,Theorem 10.3.2). For demonstration purposes, we present the proof of thelemma for the pure white noise case; i.e., xt = wt, in which case f(ω) ≡ σ2

w.By definition, the periodogram in this case is

I(ωj) = n−1n∑

s=1

n∑t=1

wswte2πiωj(t−s),

where ωj = j/n, and hence

E{I(ωj)I(ωk)} = n−2n∑

s=1

n∑t=1

n∑u=1

n∑v=1

wswtwuwve2πiωj(t−s)e2πiωk(u−v).

Now when all the subscripts match, E(wswtwuwv) = ησ4w, when two of the

subscripts match, E(wswtwuwv) = σ4w, otherwise, E(wswtwuwv) = 0. Thus,

E{I(ωj)I(ωk)} = n−1(η − 3)σ4w + σ4

w

(1 + n−2[A(ωj + ωk) + A(ωk − ωj)]

),

where

A(u) =

∣∣∣∣∣n∑

t=1

e2πiut

∣∣∣∣∣2

.

Noting that EI(ωj) = n−1∑nt=1 E(w2

t ) = σ2w, we have

cov{I(ωj), I(ωk)} = E{I(ωj)I(ωk)} − σ4w.

= n−1(η − 3)σ4w + n−2σ4

w[A(ωj + ωk) + A(ωk − ωj)].

Thus we conclude that

var{I(ωj)} = n−1(η − 3)σ4w + σ4

w for ωj �= 0, 1/2

var{I(ωj)} = n−1(η − 3)σ4w + 2σ4

w for ωj = 0, 1/2

cov{I(ωj), I(ωk)} = n−1(η − 3)σ4w for ωj �= ωk,

which establishes the result in this case. We also note that if wt is Gaussian,then η = 3 and the periodogram ordinates are independent. Using Lemma C.4,we may establish the following fundamental result.

Theorem C.5 Suppose xt is the linear process given by (C.20). Then, withf(ω) defined in (4.56) and corresponding conditions on the weights hk, wehave, as n → ∞,

(i) E(f(ω)

)→ f(ω)

(ii)(∑m

k=−m h2k

)−1cov

(f(ω), f(λ)

)→ f2(ω) for ω = λ �= 0, 1/2.

546 Appendix C

In (ii), replace f2(ω) by 0 if ω �= λ and by 2f2(ω) if ω = λ = 0 or 1/2.

Proof. (i): First, recall (4.29)

E [I(ωj:n)] =n−1∑

h=−(n−1)

(n − |h|

n

)γ(h)e−2πiωj:nh def= fn(ωj:n).

But since fn(ωj:n) → f(ω) uniformly, and |f(ωj:n) − f(ωj:n + k/n)| → 0 bythe continuity of f , we have

Ef(ω) =m∑

k=−m

hkEI(ωj:n + k/n) =m∑

k=−m

hkfn(ωj:n + k/n)

=m∑

k=−m

hk [f(ω) + o(1)] → f(ωj),

because∑m

k=−m hk = 1.

(ii): First, suppose we have ωj:n → ω1 and ω�:n → ω2, and ω1 �= ω2. Then, forn large enough to separate the bands, using Lemma C.4, we have

∣∣∣cov (f(ω1), f(ω2))∣∣∣ =

∣∣∣∣∣∣∑

|k|≤m

∑|r|≤m

hk hrcov[I(ωj:n+k/n), I(ω�:n+r/n)

]∣∣∣∣∣∣=

∣∣∣∣∣∣∑

|k|≤m

∑|r|≤m

hk hr O(n−1)

∣∣∣∣∣∣≤ c

n

⎛⎝ ∑|k|≤m

hk

⎞⎠2

(where c is a constant)

≤ cL

n

⎛⎝ ∑|k|≤m

h2k

⎞⎠ ,

which establishes (ii) for the case of different frequencies. The case of thesame frequencies, i.e., var[f(ω1)] is established in a similar manner to theabove arguments.

Theorem C.5 justifies the distributional properties used throughout §4.5and in Chapter 7. We may extend the results of this section to vector series ofthe form xxxt = (xt1, . . . , xtp)′, when the cross-spectrum is given by

fij(ω) =∞∑

h=−∞γij(h)e−2πiωh

= cij(ω) − iqij(ω), (C.21)


where

cij(ω) =∞∑

h=−∞γij(h) cos(2πωh) (C.22)

and

qij(ω) =∞∑

h=−∞γij(h) sin(2πωh) (C.23)

denote the cospectrum and quadspectrum, respectively. We denote the DFTof the series xtj by

dj(ωk) = n−1/2n∑

t=1

xtj e−2πiωkt

= dcj(ωk) − idsj(ωk),

where dcj and dsj are the cosine and sine transforms of xtj , for j = 1, 2, . . . , p.We bound the covariance structure as before and summarize the results asfollows.

Theorem C.6 The covariance structure of the multivariate cosine and sinetransforms, subject to

θij =∞∑

h=−∞|h||γij(h)| < ∞, (C.24)

is given by

E[dci(ωk)dcj(ω�)] ={ 1

2cij(ωk) + O(n−1), k = O(n−1), k �= .

(C.25)

E[dci(ωk)dsj(ω�)] ={ − 1

2qij(ωk) + O(n−1), k = O(n−1), k �=

(C.26)

E[dsi(ωk)dcj(ω�)] ={ 1

2qij(ωk) + O(n−1), k = O(n−1), k �=

(C.27)

E[dsi(ωk)dsj(ω�)] ={ 1

2cij(ωk) + O(n−1), k = O(n−1), k �= .

(C.28)

Proof. We define

Sijn (ωk, ω�) =

n∑s=1

n∑t=1

γij(s − t)e−2πiωkse2πiω�t. (C.29)

548 Appendix C

Then, we may verify the theorem with manipulations like

E[dci(ωk)dsj(ωk)] =14i

n∑s=1

n∑t=1

γij(s − t)(e2πiωks + e−2πiωks)

×(e2πiωkt − e−2πiωkt)

=14i

[Sij

n (−ωk, ωk) + Sijn (ωk, ωk)

−Sijn (ωk, ωk) − Sij

n (ωk,−ωk)]

=14i

[cij(ωk) − iqij(ωk)

−(cij(ωk) + iqij(ωk)) + O(n−1)]

= −12qij(ωk) + O(n−1),

where we have used the fact that the properties given in Lemmas C.1–C.3 canbe verified for the cross-spectral density functions fij(ω), i, j = 1, . . . , p.

Now, if the underlying multivariate time series xxxt is a normal process, it isclear that the DFTs will be jointly normal and we may define the vector DFT,ddd(ωk) = (d1(ωk), . . . , dp(ωk))′ as

ddd(ωk) = n−1/2n∑

t=1

xxxt e−2πωkt

= dddc(ωk) − iddds(ωk), (C.30)

where

dddc(ωk) = n−1/2n∑

t=1

xxxt cos(2πωkt) (C.31)

and

ddds(ωk) = n−1/2n∑

t=1

xxxt sin(2πωkt) (C.32)

are the cosine and sine transforms, respectively, of the observed vector seriesxxxt. Then, constructing the vector of real and imaginary parts (ddd′

c(ωk), ddd′s(ωk))′,

we may note it has mean zero and 2p × 2p covariance matrix

Σ(ωk) =12

(C(ωk) −Q(ωk)Q(ωk) C(ωk)

)(C.33)

to order n−1 as long as ωk − ω = O(n−1). We have introduced the p × pmatrices C(ωk) = {cij(ωk)} and Q = {qij(ωk)}. The complex random variable


ddd(ωk) has covariance

S(ωk) = E[ddd(ωk)ddd∗(ωk)]

= E

[(dddc(ωk) − iddds(ωk)

)(dddc(ωk) − iddds(ωk)

)∗]= E[dddc(ωk)dddc(ωk)′] + E[ddds(ωk)ddds(ωk)′]

−i(E[ddds(ωk)dddc(ωk)′] − E[dddc(ωk)ddds(ωk)′]

)= C(ωk) − iQ(ωk). (C.34)

If the process xxxt has a multivariate normal distribution, the complex vectorddd(ωk) has approximately the complex multivariate normal distribution withmean zero and covariance matrix S(ωk) = C(ωk) − iQ(ωk) if the real andimaginary parts have the covariance structure as specified above. In the nextsection, we work further with this distribution and show how it adapts to thereal case. If we wish to estimate the spectral matrix S(ω), it is natural to takea band of frequencies of the form ωk:n + /n, for = −m, . . . , m as before, sothat the estimator becomes (4.87) of §4.6. A discussion of further propertiesof the multivariate complex normal distribution is deferred.

It is also of interest to develop a large sample theory for cases in whichthe underlying distribution is not necessarily normal. If xt is not necessarilya normal process, some additional conditions are needed to get asymptoticnormality. In particular, introduce the notion of a generalized linear process

yyyt =∞∑

r=−∞Arwwwt−r, (C.35)

where wwwt is a p×1 vector white noise process with p×p covariance E[wwwtwww′t] = G

and the p × p matrices of filter coefficients At satisfy

∞∑t=−∞

tr{AtA′t} =

∞∑t=−∞

‖At‖2 < ∞. (C.36)

In particular, stable vector ARMA processes satisfy these conditions. Forgeneralized linear processes, we state the following general result from Han-nan (1970, p.224).

Theorem C.7 If xxxt is generated by a generalized linear process with a con-tinuous spectrum that is not zero at ω and ωk:n + /n are a set of frequencieswithin L/n of ω, the joint density of the cosine and sine transforms (C.31)and (C.32) converges to that of L independent 2p × 1 normal vectors with co-variance matrix Σ(ω) with structure given by (C.33). At ω = 0 or ω = 1/2,the distribution is real with covariance matrix 2Σ(ω).

The above result provides the basis for inference involving the Fourier trans-forms of stationary series because it justifies approximations to the likelihood

550 Appendix C

function based on multivariate normal theory. We make extensive use of thisresult in Chapter 7, but will still need a simple form to justify the distribu-tional result for the sample coherence given in (4.93). The next section givesan elementary introduction to the complex normal distribution.

C.3 The Complex Multivariate NormalDistribution

The multivariate normal distribution will be the fundamental tool for express-ing the likelihood function and determining approximate maximum likelihoodestimators and their large sample probability distributions. A detailed treat-ment of the multivariate normal distribution can be found in standard textssuch as Anderson (1984). We will use the multivariate normal distribution ofthe p × 1 vector x = (x1, x2, . . . , xp)′, as defined by its density function

p(xxx) = (2π)−p/2|Σ|−1/2 exp{−1

2(xxx − µµµ)′Σ−1(xxx − µµµ)

}, (C.37)

which can be shown to have mean vector E[xxx] = µµµ = (µ1, . . . , µp)′ and covari-ance matrix

Σ = E[(xxx − µµµ)(xxx − µµµ)′]. (C.38)

We use the notation xxx ∼ Np{µµµ,Σ} for densities of the form (C.37) and notethat linearly transformed multivariate normal variables of the form yyy = Axxx,with A a q ×p matrix q ≤ p, will also be multivariate normal with distribution

yyy ∼ Nq{Aµµµ, AΣA′}. (C.39)

Often, the partitioned multivariate normal , based on the vector xxx =(xxx′

1, xxx′2)

′, split into to p1 × 1 and p2 × 1 components xxx1 and xxx2, respectively,will be used where p = p1 +p2. If the mean vector µµµ = (µµµ′

1, µµµ′2)

′ and covariancematrices

Σ =(

Σ11 Σ12Σ21 Σ22

)(C.40)

are also compatibly partitioned, the marginal distribution of any subset ofcomponents is multivariate normal, say,

xxx1 ∼ Np1{µµµ1, Σ11},

and that the conditional distribution xxx2 given xxx1 is normal with mean

E[xxx2|xxx1] = µµµ2 + Σ21Σ−111 (xxx1 − µµµ1) (C.41)

and conditional covariance

cov[xxx2|xxx1] = Σ22 − Σ21Σ−111 Σ12. (C.42)


In the previous section, the real and imaginary parts of the DFT had apartitioned covariance matrix as given in (C.33), and we use this result to saythe complex p × 1 vector

zzz = xxx1 − ixxx2 (C.43)

has a complex multivariate normal distribution, with mean vector µµµz = µµµ1−iµµµ2and p × p covariance matrix

Σz = C − iQ (C.44)

if the real multivariate 2p×1 normal vector xxx = (xxx′1, xxx

′2)

′ has a real multivariatenormal distribution with mean vector µµµ = (µµµ′

1, µµµ′2)

′ and covariance matrix

Σ =12

(C −QQ C

). (C.45)

The restrictions C ′ = C and Q′ = −Q are necessary for the matrix Σ to bea covariance matrix, and these conditions then imply Σz = Σ∗

z is Hermitian.The probability density function of the complex multivariate normal vector zzzcan be expressed in the concise form

pzzz(zzz) = π−p|Σz|−1 exp{−(zzz − µµµz)∗Σ−1

z (zzz − µµµz)}, (C.46)

and this is the form that we will often use in the likelihood. The result followsfrom showing that pxxx(xxx1, xxx2) = pzzz(zzz) exactly, using the fact that the quadraticand Hermitian forms in the exponent are equal and that |Σx| = |Σz|2. Thesecond assertion follows directly from the fact that the matrix Σx has repeatedeigenvalues, λ1, λ2, . . . , λp corresponding to eigenvectors (α′

1, α′2)

′ and the sameset, λ1, λ2, . . . , λp corresponding to (α′

2,−α′1)

′. Hence

|Σx| =p∏

i=1

λ2i = |Σz|2.

For further material relating to the complex multivariate normal distribution,see Goodman (1963), Giri (1965), or Khatri (1965).

Example C.2 A Bivariate Complex Normal Distribution

Consider the joint distribution of the complex random variables u1 =x1 − ix2 and u2 = y1 − iy2, where the partitioned vector (x1, x2, y1, y2)′

has a real multivariate normal distribution with mean (0, 0, 0, 0)′ andcovariance matrix

Σ =12

⎛⎜⎜⎝cxx 0 cxy −qxy

0 cxx qxy cxy

cxy qxy cyy 0−qxy cyx 0 cyy

⎞⎟⎟⎠ . (C.47)

552 Appendix C

Now, consider the conditional distribution of yyy = (y1, y2)′, given xxx =(x1, x2)′. Using (C.41), we obtain

E(yyy∣∣ xxx) =

(x1 −x2x2 x1

)(b1b2

), (C.48)

where

(b1, b2) =(

cyx

cxx,qyx

cxx

). (C.49)

It is natural to identify the cross-spectrum

fxy = cxy − iqxy, (C.50)

so that the complex variable identified with the pair is just

b = b1 − ib2

=cyx − iqyx

cxx

=fyx

fxx,

and we identify it as the complex regression coefficient. The conditionalcovariance follows from (C.42) and simplifies to

cov(yyy∣∣ xxx) =

12fy·x I2, (C.51)

where I2 denotes the 2 × 2 identity matrix and

fy·x = cyy − c2xy + q2

xy

cxx

= fyy − |fxy|2fxx

(C.52)

Example C.2 leads to an approach for justifying the distributional resultsfor the function coherence given in (4.93). That equation suggests that the re-sult can be derived using the regression results that lead to the F-statisticsin §2.2. Suppose that we consider L values of the sine and cosine trans-forms of the input xt and output yt, which we will denote by dx,c(ωk +/n), dx,s(ωk + /n), dy,c(ωk + /n), dy,s(ωk + /n), sampled at L = 2m + 1frequencies, = −m, . . . , m, in the neighborhood of some target frequencyω. Suppose these cosine and sine transforms are re-indexed and denoted bydx,cj , dx,sj , dy,cj , dy,sj , for j = 1, 2, . . . , L, producing 2L real random variableswith a large sample normal distribution that have limiting covariance matricesof the form (C.47) for each j. Then, the conditional normal distribution of the2 × 1 vector dy,cj , dy,sj given dx,cj , dx,sj , given in Example C.2, shows that wemay write, approximately, the regression model


(dy,cj

dy,sj

)=(

dx,cj −dx,sj

dx,sj dx,cj

)(b1b2

)+(

Vcj

Vsj

),

where Vcj , Vsj are approximately uncorrelated with approximate variances

E[V 2cj ] = E[V 2

sj ] = (1/2)fy·x.

Now, construct, by stacking, the 2L × 1 vectors yyyc = (dy,c1, . . . , dy,cL)′, yyys =(dy,s1, . . . , dy,sL)′, xxxc = (dx,c1, . . . , dx,cL)′ and xxxs = (dx,s1, . . . , dx,sL)′, andrewrite the regression model as(

yyyc

yyys

)=(

xxxc −xxxs

xxxs xxxc

)(b1b2

)+(

vvvc

vvvs

)where vvvc and vvvs are the error stacks. Finally, write the overall model as theregression model in Chapter 2, namely,

yyy = Zbbb + vvv,

making the obvious identifications in the previous equation. Conditional on Z,the model becomes exactly the regression model considered in Chapter 2 wherethere are q = 2 regression coefficients and 2L observations in the observationvector yyy. To test the hypothesis of no regression for that model, we use anF-Statistic that depends on the difference between the residual sum of squaresfor the full model, say,

RSS = yyy′yyy − yyy′Z(Z ′Z)−1Z ′yyy (C.53)

and the residual sum of squares for the reduced model, RSS0 = yyy′yyy. Then,

F2,2L−2 = (L − 1)RSS0 − RSS

RSS(C.54)

has the F-distribution with 2 and 2L − 2 degrees of freedom. Also, it followsby substitution for yyy that

RSS0 = yyy′yyy= yyy′

cyyyc + yyy′syyys

=L∑

j=1

(d2y,cj + d2

y,sj)

= Lfy(ω),

which is just the sample spectrum of the output series. Similarly,

Z ′Z =

(Lfx 00 Lfx

)

554 Appendix C

and

Z ′yyy =(

(xxx′cyyyc + xxx′

syyys)(xxx′

cyyys − xxx′syyyc)

)

=

⎛⎜⎝∑L

j=1(dx,cjdy,cj + dx,sjdy,sj)∑Lj=1(dx,cjdy,sj − dx,sjdy,cj)

⎞⎟⎠=

(Lcyx

Lqyx

).

together imply that

yyy′Z(Z ′Z)−1Z ′yyy = L |fxy|2/fx.

Substituting into (C.54) gives

F2,2L−2 = (L − 1)|fxy|2/fx(

fy − |fxy|2/fx

) ,

which converts directly into the F-statistic (4.93), using the sample coherencedefined in (4.92).

References

Akaike, H. (1969). Fitting autoregressive models for prediction. Ann. Inst.Stat. Math., 21, 243-247.

Akaike, H. (1973). Information theory and an extension of the maximumlikelihood principal. In 2nd Int. Symp. Inform. Theory, 267-281. B.N.Petrov and F. Csake, eds. Budapest: Akademia Kiado.

Akaike, H. (1974). A new look at statistical model identification. IEEE Trans.Automat. Contr., AC-19, 716-723.

Alagon, J. (1989). Spectral discrimination for two groups of time series. J.Time Series Anal., 10, 203-214.

Alspach, D.L. and H.W. Sorensen (1972). Nonlinear Bayesian estimation usingGaussian sum approximations. IEEE Trans. Automat. Contr., AC-17,439-447.

Anderson, B.D.O. and J.B. Moore (1979). Optimal Filtering. EnglewoodCliffs, NJ: Prentice-Hall.

Anderson, T.W. (1978). Estimation for autoregressive moving average modelsin the time and frequency domain. Ann. Stat., 5, 842-865.

Anderson, T.W. (1984). An Introduction to Multivariate Statistical Analysis,2nd ed. New York: Wiley.

Ansley, C.F. and P. Newbold (1980). Finite sample properties of estimatorsfor autoregressive moving average processes. J. Econ., 13, 159-183.

Antognini, J.F., M.H. Buonocore, E.A. Disbrow, and E. Carstens (1997).Isoflurane anesthesia blunts cerebral responses to noxious and innocuousstimuli: a fMRI study. Life Sci., 61, PL349-PL354.

Bandettini, A., A. Jesmanowicz, E.C. Wong, and J.S. Hyde (1993). Processingstrategies for time-course data sets in functional MRI of the human brain.Magnetic Res. Med., 30, 161-173.

Bar-Shalom, Y. (1978). Tracking methods in a multi-target environment.IEEE Trans. Automat. Contr., AC-23, 618-626.

Bar-Shalom, Y. and E. Tse (1975). Tracking in a cluttered environment withprobabilistic data association. Automatica, 11, 4451-4460.

555

556 References

Bazza, M., R.H. Shumway, and D.R. Nielsen (1988). Two-dimensional spectralanalysis of soil surface temperatures. Hilgardia, 56, 1-28.

Bedrick, E.J. and C.-L. Tsai (1994). Model selection for multivariate regressionin small samples. Biometrics, 50, 226-231.

Beran, J. (1994). Statistics for Long Memory Processes. New York: Chapmanand Hall.

Berk, K.N. (1974). Consistent autoregressive spectral estimates. Ann. Stat.,2, 489-502.

Besag, J. (1974). Spatial interaction and the statistical analysis of latticesystems (with discussion). J. R. Stat. Soc. B, 36, 192-236.

Bhat, R.R. (1985). Modern Probability Theory, 2nd ed. New York: Wiley.Bhattacharya, A. (1943). On a measure of divergence between two statistical

populations. Bull. Calcutta Math. Soc., 35, 99-109.Blackman, R.B. and J.W. Tukey (1959). The Measurement of Power Spec-

tra from the Point of View of Communications Engineering. New York:Dover.

Blight, B.J.N. (1974). Recursive solutions for the estimation of a stochasticparameter J. Am. Stat. Assoc., 69, 477-481

Bloomfield, P. (1976). Fourier Analysis of Time Series: An Introduction. NewYork: Wiley.

Bloomfield, P. (2000). Fourier Analysis of Time Series: An Introduction, 2nded. New York: Wiley.

Bloomfield, P. and J.M. Davis (1994). Orthogonal rotation of complex princi-pal components. Int. J. Climatol., 14, 759-775.

Bogart, B. P., M. J. R. Healy, and J.W. Tukey (1962). The Quefrency Analy-sis of Time Series for Echoes: Cepstrum, Pseudo-Autocovariance, Cross-Cepstrum and Saphe Cracking. In Proc. of the Symposium on Time SeriesAnalysis, pp. 209-243, Brown University, Providence, USA.

Bollerslev, T. (1986). Generalized autoregressive conditional heteroscedastic-ity. J. Econ., 31, 307- 327.

Box, G.E.P. and G.M. Jenkins (1970). Time Series Analysis, Forecasting, andControl. Oakland, CA: Holden-Day.

Box, G.E.P., G.M. Jenkins and G.C. Reinsel (1994). Time Series Analysis,Forecasting, and Control, 3rd ed. Englewood Cliffs, NJ: Prentice Hall.

Box, G.E.P. and D.A. Pierce (1970). Distributions of residual autocorrelationsin autoregressive integrated moving average models. J. Am. Stat. Assoc.,72, 397-402.

Box, G.E.P. and G.C. Tiao (1973). Bayesian Inference in Statistical Analysis.New York: Wiley.

Breiman, L. and J. Friedman (1985). Estimating optimal transformations formultiple regression and correlation (with discussion). J. Am. Stat. Assoc.,

References 557

80, 580-619.

Brillinger, D.R. (1973). The analysis of time series collected in an experimentaldesign. In Multivariate Analysis-III., pp. 241-256. P.R. Krishnaiah ed.New York: Academic Press.

Brillinger, D.R. (1980). Analysis of variance and problems under time seriesmodels. In Handbook of Statistics, Vol I, pp. 237-278. P.R. Krishnaiahand D.R. Brillinger, eds. Amsterdam: North Holland.

Brillinger, D.R. (1981). Time Series: Data Analysis and Theory, 2nd ed. SanFrancisco: Holden-Day. Republished in 2001 by the Society for Industrialand Applied Mathematics, Philadelphia.

Brockwell, P.J. and R.A. Davis (1991). Time Series: Theory and Methods,2nd ed. New York: Springer-Verlag.

Bruce, A. and H-Y. Gao (1996). Applied Wavelet Analysis with S-PLUS. NewYork: Springer-Verlag.

Caines, P.E. (1988). Linear Stochastic Systems. New York: Wiley.

Carlin, B.P., N.G. Polson, and D.S. Stoffer (1992). A Monte Carlo approachto nonnormal and nonlinear state-space modeling. J. Am. Stat. Assoc.,87, 493-500.

Carter, C. K. and R. Kohn (1994). On Gibbs sampling for state space models.Biometrika, 81, 541-553.

Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hy-pothesis based on the sum of the observations. Ann. Math. Stat., 25,573-578.

Cleveland, W.S. (1979). Robust locally weighted regression and smoothingscatterplots. J. Am. Stat. Assoc., 74, 829-836.

Cochrane, D. and G.H. Orcutt (1949). Applications of least squares regressionto relationships containing autocorrelated errors. J. Am. Stat. Assoc.,44, 32-61.

Cooley, J.W. and J.W. Tukey (1965). An algorithm for the machine compu-tation of complex Fourier series. Math. Comput., 19, 297-301.

Cressie, N.A.C. (1993). Statistics for Spatial Data. New York: Wiley.

Dahlhaus, R. (1989). Efficient parameter estimation for self-similar processes.Ann. Stat., 17, 1749-1766.

Dargahi-Noubary, G.R. and P.J. Laycock (1981). Spectral ratio discriminantsand information theory. J. Time Series Anal., 16, 201-219.

Danielson, J. (1994). Stochastic volatility in asset prices: Estimation withsimulated maximum likelihood. J. Econometrics, 61, 375-400.

Daubechies, I. (1992). Ten Lectures on Wavelets. Philadelphia: CBMS-NSFRegional Conference Series in Applied Mathematics.

Davies, N., C.M. Triggs, and P. Newbold (1977). Significance levels of the

558 References

Box-Pierce portmanteau statistic in finite samples. Biometrika, 64, 517-522.

Dent, W. and A.-S. Min. (1978). A Monte Carlo study of autoregressive-integrated-moving average processes. J. Econ., 7, 23-55.

Dempster, A.P., N.M. Laird and D.B. Rubin (1977). Maximum likelihood fromincomplete data via the EM algorithm. J. R. Stat. Soc. B, 39, 1-38.

Diggle, P.J., K.-Y. Liang, and S.L. Zeger (1994). The Analysis of LongitudinalData. Oxford: Clarendon Press.

Ding, Z., C.W.J. Granger, and R.F. Engle (1993). A long memory property ofstock market returns and a new model. J. Empirical Finance, 1, 83-106.

Donoho, D.L. and I.M. Johnstone (1994). Ideal spatial adaptation by waveletshrinkage. Biometrika, 81, 425-455.

Donoho, D.L. and I.M. Johnstone (1995). Adapting to unknown smoothnessvia wavelet shrinkage. J. of Am. Stat. Assoc., 90, 1200-1224.

Durbin, J. (1960). Estimation of parameters in time series regression models.J. R. Stat. Soc. B, 22, 139-153.

Durbin, J. and S.J. Koopman (2001). Time Series Analysis by State SpaceMethods Oxford: Oxford University Press.

Efron, B. and R. Tibshirani (1994). An Introduction to the Bootstrap. NewYork: Chapman and Hall.

Engle, R.F. (1982). Autoregressive conditional heteroscedasticity with esti-mates of the variance of United Kingdom inflation. Econometrica, 50,987-1007.

Engle, R.F., D. Nelson, and T. Bollerslev (1994). ARCH Models. In Handbookof Econometrics, Vol IV, pp. 2959-3038. R. Engle and D. McFadden, eds.Amsterdam: North Holland.

Fahrmeir, L. and G. Tutz (1994). Multivariate Statistical Modelling Based onGeneralized Linear Models. New York: Springer-Verlag.

Fox, R. and M.S. Taqqu (1986). Large sample properties of parameter esti-mates for strongly dependent stationary Gaussian time series. Ann. Stat.,14, 517-532.

Friedman, J.H. (1984). A Variable Span Smoother. Tech. Rep. No. 5, Lab.for Computational Statistics, Dept. Statistics, Stanford Univ., California.

Friedman, J.H. and W. Stuetzle. (1981). Projection pursuit regression. J.Am. Stat. Assoc., 76, 817-823.

Fruhwirth-Schnatter, S. (1994). Data Augmentation and Dynamic LinearModels. J. Time Series Anal., 15, 183–202.

Fuller, W.A. (1976). Introduction to Statistical Time Series. New York: Wiley.

Fuller, W.A. (1995). Introduction to Statistical Time Series, 2nd ed. NewYork: Wiley.

References 559

Gabr, M.M. and T. Subba-Rao (1981). The estimation and prediction of subsetbilinear time series models with applications. J. Time Series Anal., 2,155-171.

Gelfand, A.E. and A.F.M. Smith (1990). Sampling-based approaches to cal-culating marginal densities. J. Am. Stat. Assoc., 85, 398-409.

Gelman, A., J. Carlin, H. Stern, and D. Rubin (1995). Bayesian Data Analysis.London: Chapman and Hall.

Geman, S. and D. Geman (1984). Stochastic relaxation, Gibbs distributions,and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Ma-chine Intell., 6, 721-741.

Geweke, J.F. (1977). The dynamic factor analysis of economic time seriesmodels. In Latent Variables in Socio-Economic Models, pp 365-383. D.Aigner and A. Goldberger, eds. Amsterdam: North Holland.

Geweke, J.F. and K.J. Singleton (1981). Latent variable models for time se-ries: A frequency domain approach with an application to the PermanentIncome Hypothesis. J. Econ., 17, 287-304.

Geweke, J.F. and S. Porter-Hudak (1983). The estimation and application oflong-memory time series models. J. Time Series Anal., 4, 221-238.

Gilks, W.R., S. Richardson, and D.J. Spiegelhalter (eds.) (1996). MarkovChain Monte Carlo in Practice. London: Chapman and Hall.

Giri, N. (1965). On complex analogues of T 2 and R2 tests. Ann. Math. Stat.,36, 664-670.

Goldfeld, S.M. and R.E. Quandt (1973). A Markov model for switching re-gressions. J. Econ., 1, 3-16.

Goodman, N.R. (1963). Statistical analysis based on a certain multivariatecomplex Gaussian distribution. Ann. Math. Stat., 34, 152-177.

Goodrich, R.L. and P.E. Caines (1979). Linear system identification fromnonstationary cross-sectional data. IEEE Trans. Automat. Contr., AC-24, 403-411.

Gordon, K. and A.F.M. Smith (1990). Modeling and monitoring biomedicaltime series. J. Am. Stat. Assoc., 85, 328-337.

Gourieroux, C. (1997). ARCH Models and Financial Applications. New York:Springer-Verlag.

Granger, C.W. and R. Joyeux (1980). An introduction to long-memory timeseries models and fractional differencing. J. Time Series Anal., 1, 15-29.

Grether, D.M. and M. Nerlove (1970). Some properties of optimal seasonaladjustment. Econometrica, 38, 682-703.

Gupta N.K. and R.K. Mehra (1974). Computational aspects of maximumlikelihood estimation and reduction in sensitivity function calculations.IEEE Trans. Automat. Contr., AC-19, 774-783.

560 References

Hamilton, J.D. (1989). A new approach to the economic analysis of nonsta-tionary time series and the business cycle. Econometrica, 57, 357-384.

Hannan, E.J. (1970). Multiple Time Series. New York: Wiley.

Hannan, E.J. (1973). The asymptotic theory of linear time series models. J.Appl. Prob., 10, 130-145.

Hannan, E.J. and M. Deistler (1988). The Statistical Theory of Linear Systems.New York: Wiley.

Hansen, J. and S. Lebedeff (1987). Global trends of measured surface airtemperature. J. Geophys. Res., 92, 1345-1372.

Hansen, J. and S. Lebedeff (1988). Global surface air temperatures: Updatethrough 1987. J. Geophys. Lett., 15, 323-326.

Harrison, P.J. and C.F. Stevens (1976). Bayesian forecasting (with discussion).J. R. Stat. Soc. B, 38, 205-247.

Harvey, A.C. and P.H.J. Todd (1983). Forecasting economic time series withstructural and Box-Jenkins models: A case study. J. Bus. Econ. Stat., 1,299-307.

Harvey, A.C. and R.G. Pierse (1984). Estimating missing observations in eco-nomic time series. J. Am. Stat. Assoc., 79, 125-131.

Harvey, A.C. (1991). Forecasting, Structural Time Series Models and theKalman Filter. Cambridge: Cambridge University Press.

Harvey, A.C. (1993). Time Series Models. Cambridge, MA: MIT Press.

Harvey A.C., E. Ruiz and N. Shephard (1994). Multivariate stochastic volatil-ity models. Rev. Economic Studies, 61, 247-264.

Hastings, W.K. (1970). Monte Carlo sampling methods using Markov chainsand their applications. Biometrika, 57, 97-109.

Hosking, J.R.M. (1981). Fractional differencing. Biometrika, 68, 165-176.

Hurst, H. (1951). Long term storage capacity of reservoirs. Trans. Am. Soc.Civil Eng., 116, 778-808.

Hurvich, C.M. and S. Zeger (1987). Frequency domain bootstrap methods fortime series. Tech. Report 87-115, Department of Statistics and OperationsResearch, Stern School of Business, New York University.

Hurvich, C.M and C.-L. Tsai (1989). Regression and time series model selec-tion in small samples. Biometrika, 76, 297-307.

Hurvich, C.M. and K.I. Beltrao (1993). Asymptotics for the low-requencyoridnates of the periodogram for a long-memory time series. J. TimeSeries Anal., 14, 455-472.

Hurvich, C.M., R. Deo and J. Brodsky (1998). The mean squared error ofGeweke and Porter-Hudak’s estimator of the memory parameter of a long-memory time series. J. Time Series Anal., 19, 19-46.

References 561

Jacquier, E., N.G. Polson, and P.E. Rossi (1994). Bayesian analysis of sto-chastic volatility models. J. Bus. Econ. Stat., 12, 371-417.

Jazwinski, A.H. (1970). Stochastic Processes and Filtering Theory. New York:Academic Press.

Jenkins, G.M. and D.G. Watts. (1968). Spectral Analysis and Its Applications.San Francisco: Holden-Day.

Johnson, R.A. and D.W. Wichern (1992). Applied Multivariate StatisticalAnalysis, 3rd ed.. Englewood Cliffs, NJ: Prentice-Hall.

Jones, P.D. (1994). Hemispheric surface air temperature variations: A reanaly-sis and an update to 1993. J. Clim., 7, 1794-1802.

Jones, R.H. (1980). Maximum likelihood fitting of ARMA models to timeseries with missing observations. Technometrics, 22, 389-395.

Jones, R.H. (1984). Fitting multivariate models to unequally spaced data.In Time Series Analysis of Irregularly Observed Data, pp. 158-188. E.Parzen, ed. Lecture Notes in Statistics, 25, New York: Springer-Verlag.

Jones, R.H. (1993). Longitudinal Data With Serial Correlation : A State-SpaceApproach. London: Chapman and Hall.

Journel, A.G. and C.H. Huijbregts (1978). Mining Geostatistics. New York:Academic Press.

Juang, B.H. and L.R. Rabiner (1985). Mixture autoregressive hidden Markovmodels for speech signals, IEEE Trans. Acoust., Speech, Signal Process.,ASSP-33, 1404-1413.

Kakizawa, Y., R. H. Shumway, and M. Taniguchi (1998). Discrimination andclustering for multivariate time series. J. Am. Stat. Assoc., 93, 328-340.

Kalman, R.E. (1960). A new approach to linear filtering and prediction prob-lems. Trans ASME J. Basic Eng., 82, 35-45.

Kalman, R.E. and R.S. Bucy (1961). New results in filtering and predictiontheory. Trans. ASME J. Basic Eng., 83, 95-108.

Kay, S.M. (1988). Modern Spectral Analysis: Theory and Applications. Engle-wood Cliffs, NJ: Prentice-Hall.

Kazakos, D. and P. Papantoni-Kazakos (1980). Spectral distance measuringbetween Gaussian processes. IEEE Trans. Automat. Contr., AC-25, 950-959.

Khatri, C.G. (1965). Classical statistical analysis based on a certain multivari-ate complex Gaussian distribution. Ann. Math. Stat., 36, 115-119.

Kim S., N. Shephard and S. Chib (1998). Stochastic volatility: likelihoodinference and comparison with ARCH models. Rev. Economic Studies,65, p.361-393.

Kitagawa, G. (1987). Non-Gaussian state-space modeling of nonstationarytime series (with discussion). J. Am. Stat. Assoc., 82, 1032-1041, (C/R:p1041-1063; C/R: V83 p1231).

562 References

Kitagawa, G. and W. Gersch (1984). A smoothness priors modeling of timeseries with trend and seasonality. J. Am. Stat. Assoc., 79, 378-389.

Kitagawa, G. and W. Gersch (1996). Smoothness Priors Analysis of TimeSeries. New York: Springer-Verlag.

Kolmogorov, A.N. (1941). Interpolation and extrapolation von stationarenzufalligen folgen. Bull. Acad. Sci. U.R.S.S., 5, 3-14.

Krishnaiah, P.R., J.C. Lee, and T.C. Chang (1976). The distribution of likeli-hood ratio statistics for tests of certain covariance structures of complexmultivariate normal populations. Biometrika, 63, 543-549.

Kullback, S. and R.A. Leibler (1951). On information and sufficiency. Ann.Math. Stat., 22, 79-86.

Kullback, S. (1958). Information Theory and Statistics. Gloucester, MA: PeterSmith.

Lachenbruch, P.A. and M.R. Mickey (1968). Estimation of error rates in dis-criminant analysis. Technometrices, 10, 1-11.

Laird, N. and J. Ware (1982). Random-effects models for longitudinal data.Biometrics, 38, 963-974.

Lam, P.S. (1990). The Hamilton model with a general autoregressive com-ponent: Estimation and comparison with other modles of economic timeseries. J. Monetary Econ., 26, 409-432.

Lay, T. (1997). Research required to support comprehensive nuclear test bantreaty monitoring. National Research Council Report, National AcademyPress, 2101 Constitution Ave., Washington, DC 20055.

Levinson, N. (1947). The Wiener (root mean square) error criterion in filterdesign and prediction. J. Math. Phys., 25, 262-278.

Lindgren, G. (1978). Markov regime models for mixed distributions andswitching regressions. Scand. J. Stat., 5, 81-91.

Liu, L.M. (1991). Dynamic relationship analysis of U.S. gasoline and crude oilprices. J. Forecast., 10, 521-547.

Ljung, G.M. and G.E.P. Box (1978). On a measure of lack of fit in time seriesmodels. Biometrika, 65, 297-303.

Lutkepohl, H. (1985). Comparison of criteria for estimating the order of avector autoregressive process. J. Time Series Anal., 6, 35-52.

Lutkepohl, H. (1993). Introduction to Multiple Time Series Analysis, 2nd ed.Berlin: Springer-Verlag.

McQuarrie, A.D.R. and R.H. Shumway (1994). ASTSA for Windows.

McQuarrie, A.D.R. and C-L. Tsai (1998). Regression and Time Series ModelSelection, Singapore: World Scientific.

Mallows, C.L. (1973). Some comments on Cp. Technometrics, 15, 661-675.

References 563

McBratney, A.B. and R. Webster (1981). Detection of ridge and furrow patternby spectral analysis of crop yield. Int. Stat. Rev., 49, 45-52.

McCulloch, R.E. and R.S. Tsay (1993). Bayesian inference and predictionfor mean and variance shifts in autoregressive time series. J. Am. Stat.Assoc., 88, 968-978.

McDougall, A. J., D.S. Stoffer and D.E. Tyler (1997). Optimal transformationsand the spectral envelope for real-valued time series. J. Stat. Plan. Infer.,57, 195-214.

McLeod A.I. (1978). On the distribution of residual autocorrelations in Box-Jenkins models. J. R. Stat. Soc. B, 40, 296-302.

McLeod, A.I. and K.W. Hipel (1978). Preservation of the rescaled adustedrange, I. A reassessment of the Hurst phenomenon. Water Resour. Res.,14, 491-508.

Meinhold, R.J. and N.D. Singpurwalla (1983). Understanding the Kalmanfilter. Am. Stat., 37, 123-127.

Meinhold, R.J. and N.D. Singpurwalla (1989). Robustification of Kalman filtermodels. J. Am. Stat. Assoc., 84, 479-486.

Meng X.L. and Rubin, D.B. (1991). Using EM to obtain asymptotic variance–covariance matrices: The SEM algorithm. J. Am. Stat. Assoc., 86,899-909.

Metropolis N., A.W. Rosenbluth, M.N. Rosenbluth, A. H. Teller, and E. Teller(1953). Equations of state calculations by fast computing machines. J.Chem. Phys., 21, 1087-1091.

Mickens, R.E. (1987). Difference Equations. New York: Van Nostrand Rein-hold.

Newbold, P. and T. Bos (1985). Stochastic Parameter Regression Models.Beverly Hills: Sage.

Ogawa, S., T.M. Lee, A. Nayak and P. Glynn (1990). Oxygenation-sensititivecontrast in magnetic resonance image of rodent brain at high magneticfields. Magn. Reson. Med., 14, 68-78.

Palma, W. and N.H. Chan (1997). Estimation and forecasting of long-memorytime series with missing values. J. Forecast., 16, 395-410.

Paparoditis, E. and Politis, D.N. (1999). The local bootstrap for periodogramstatistics. J. Time Series Anal., 20, 193-222.

Parker, D.E., P.D. Jones, A. Bevan and C.K. Folland (1994). Interdecadalchanges of surface temperature since the late 19th century. J. GeophysicalResearch, 90, 14373-14399.

Parker, D.E., C.K. Folland and M. Jackson (1995). Marine surface temper-ature: observed variations and data requirements. Climatic Change, 31,559-60

564 References

Parzen, E. (1961). Mathematical considerations in the estimation of spectra.Technometrics, 3, 167-190.

Parzen, E. (1983). Autoregressive spectral estimation. In Time Series in theFrequency Domain, Handbook of Statistics, Vol. 3, pp. 211-243. D.R.Brillinger and P.R. Krishnaiah eds. Amsterdam: North Holland.

Pawitan, Y. and R.H. Shumway (1989). Spectral estimation and deconvolutionfor a linear time series model. J. Time Series Anal., 10, 115-129.

Pena, D. and I. Guttman (1988). A Bayesian approach to robustifying theKalman filter. In Bayesian Analysis of Time Series and Dynamic LinearModels, pp. 227-254. J.C. Spall, ed. New York: Marcel Dekker.

Percival, D.B. and A.T. Walden (1993). Spectral Analysis for Physical Appli-cations: Multitaper and Conventional Univariate Techniques Cambridge:Cambridge University Press.

Percival, D.B. and A.T. Walden (2000). Wavelet Methods for Time SeriesAnalysis. Cambridge: Cambridge University Press.

Pinsker, M.S. (1964). Information and Information Stability of Random Vari-ables and Processes, San Francisco: Holden Day.

Pole, P.J. and M. West (1988). Nonnormal and nonlinear dynamic Bayesianmodeling. In Bayesian Analysis of Time Series and Dynamic Linear Mod-els, pp. 167-198. J.C. Spall, ed. New York: Marcel Dekker.

Press, W.H., S.A. Teukolsky, W. T. Vetterling, and B.P. Flannery (1993).Numerical Recipes in C: The Art of Scientific Computing, 2nd ed. Cam-bridge: Cambridge University Press.

Priestley, M.B., T. Subba-Rao and H. Tong (1974). Applications of principalcomponents analysis and factor analysis in the identification of multi-variable systems. IEEE Trans. Automat. Contr., AC-19, 730-734.

Priestley, M.B. and T. Subba-Rao (1975). The estimation of factor scoresand Kalman filtering for discrete parameter stationary processes. Int. J.Contr., 21, 971-975.

Priestley, M.B. (1981). Spectral Analysis and Time Series. Vol. 1: UnivariateSeries; Vol 2: Multivariate Series, Prediction and Control. New York:Academic Press.

Priestley, M.B. (1988). Nonlinear and Nonstationary Time Series Analysis.London: Academic Press.

Quandt, R.E. (1972). A new approach to estimating switching regressions. J.Am. Stat. Assoc., 67, 306-310.

Rabiner, L.R. and B.H. Juang (1986). An introduction to hidden Markovmodels, IEEE Acoust., Speech, Signal Process., ASSP-34, 4-16.

Rao, C.R. (1973). Linear Statistical Inference and Its Applications. New York:Wiley.

References 565

Rauch, H.E., F. Tung, and C.T. Striebel (1965). Maximum likelihood estima-tion of linear dynamic systems. J. AIAA, 3, 1445-1450.

Reinsel, G.C. (1997). Elements of Multivariate Time Series Analysis, 2nd ed.New York: Springer-Verlag.

Renyi, A. (1961). On measures of entropy and information. In Proceedings of4th Berkeley Symp. Math. Stat. and Probability, pp. 547-561, Berkeley:Univ. of California Press.

Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14,465-471.

Robinson, P.M. (1995). Gaussian semiparametric estimation of long rangedependence. Ann. Stat., 23, 1630-1661.

Rosenblatt, M. (1956). A central limit theorem and a strong mixing condition.Proc. Nat. Acad. Sci., 42, 43-47.

Royston, P. (1982). An extension of Shapiro and Wilk’s W test for normalityto large samples. Applied Statistics, 31, 115-124.

Sandmann, G. and S.J. Koopman (1998). Estimation of stochastic volatilitymodels via Monte Carlo maximum likelihood. J. Econometrics, 87 , 271-301.

Sargan, J.D. (1964). Wages and prices in the United Kingdom: A study ineconometric methodology. In Econometric Analysis for National Eco-nomic Planning, eds. P. E. Hart, G. Mills and J. K. Whitaker. Lon-don: Butterworths. reprinted in Quantitative Economics and Economet-ric Analysis, pp. 275-314, eds. K. F. Wallis and D. F. Hendry (1984).Oxford: Basil Blackwell.

Scheffe, H. (1959). The Analysis of Variance. New York: Wiley.

Schuster, A. (1898). On the investigation of hidden periodicities with applica-tion to a supposed 26 day period of meteorological phenomena. TerrestrialMagnetism, III, 11-41.

Schuster, A. (1906). On the periodicities of sunspots. Phil. Trans. R. Soc.,Ser. A, 206, 69-100.

Schwarz, F. (1978). Estimating the dimension of a model. Ann. Stat., 6,461-464.

Schweppe, F.C. (1965). Evaluation of likelihood functions for Gaussian signals.IEEE Trans. Inform. Theory, IT-4, 294-305.

Seber, G.A.G. (1977). Linear Regression Analysis. New York: Wiley.

Shephard, N. (1996). Statistical aspects of ARCH and stochastic volatility.In Time Series Models in Econometrics, Finance and Other Fields , pp1-100. D.R. Cox, D.V. Hinkley, and O.E. Barndorff-Nielson eds. London:Chapman and Hall.

Shumway, R.H. and W.C. Dean (1968). Best linear unbiased estimation formultivariate stationary processes. Technometrics, 10, 523-534.

566 References

Shumway, R.H. (1970). Applied regression and analysis of variance for sta-tionary time series. J. Am. Stat. Assoc., 65, 1527-1546.

Shumway, R.H. (1971). On detecting a signal in N stationarily correlated noiseseries. Technometrics, 10, 523-534.

Shumway, R.H. and A.N. Unger (1974). Linear discriminant functions forstationary time series. J. Am. Stat. Assoc., 69, 948-956.

Shumway, R.H. (1982). Discriminant analysis for time series. In Classification,Pattern Recognition and Reduction of Dimensionality, Handbook of Statis-tics Vol. 2, pp. 1-46. P.R. Krishnaiah and L.N. Kanal, eds. Amsterdam:North Holland.

Shumway, R.H. and D.S. Stoffer (1982). An approach to time series smoothingand forecasting using the EM algorithm. J. Time Series Anal., 3, 253-264.

Shumway, R.H. (1983). Replicated time series regression: An approach tosignal estimation and detection. In Time Series in the Frequency Domain,Handbook of Statistics Vol. 3, pp. 383-408. D.R. Brillinger and P.R.Krishnaiah, eds. Amsterdam: North Holland.

Shumway, R.H. (1988). Applied Statistical Time Series Analysis. EnglewoodCliffs, NJ: Prentice-Hall.

Shumway, R.H., R.S. Azari, and Y. Pawitan (1988). Modeling mortality fluc-tuations in Los Angeles as functions of pollution and weather effects. En-viron. Res., 45, 224-241.

Shumway, R.H. and D.S. Stoffer (1991). Dynamic linear models with switching.J. Am. Stat. Assoc., 86, 763-769, (Correction: V87 p. 913).

Shumway, R.H. and K.L. Verosub (1992). State space modeling of paleocli-matic time series. In Pro. 5th Int. Meeting Stat. Climatol.. Toronto, pp.22-26, June, 1992.

Shumway, R.H. (1996). Statistical approaches to seismic discrimination. InMonitoring a Comprehensive Test Ban Treaty, pp. 791-803. A.M. Daintyand E.S. Husebye eds. Doordrecht, The Netherlands: Kluwer Academic

Shumway, R.H., S.E. Kim and R.R. Blandford (1999). Nonlinear estimationfor time series observed on arrays. Chapter 7, S. Ghosh, ed. Asymptotics,Nonparametrics and Time Series, pp. 227-258. New York: Marcel Dekker.

Small, C.G. and D.L. McLeish (1994). Hilbert Space Methods in Probabilityand Statistical Inference. New York: Wiley.

Smith, A.F.M. and M. West (1983). Monitoring renal transplants: An appli-cation of the multiprocess Kalman filter. Biometrics, 39, 867-878.

Spliid, H. (1983). A fast estimation method for the vector autoregressive mov-ing average model with exogenous variables. J. Am. Stat. Assoc., 78,843-849.

Stoffer, D.S. (1982). Estimation of Parameters in a Linear Dynamic Systemwith Missing Observations. Ph.D. Dissertation. Univ. California, Davis.

References 567

Stoffer, D.S. (1987). Walsh-Fourier analysis of discrete-valued time series. J.Time Series Anal., 8, 449-467.

Stoffer, D.S., M. Scher, G. Richardson, N. Day, and P. Coble (1988). A Walsh-Fourier analysis of the effects of moderate maternal alcohol consumptionon neonatal sleep-state cycling. J. Am. Stat. Assoc., 83, 954-963.

Stoffer, D.S. and K.D. Wall (1991). Bootstrapping state space models: Gaussianmaximum likelihood estimation and the Kalman filter. J. Am. Stat. As-soc., 86, 1024-1033.

Stoffer, D.S., D.E. Tyler, and A.J. McDougall (1993). Spectral analysis forcategorical time series: Scaling and the spectral envelope. Biometrika, 80,611-622.

Stoffer, D.S. and D.E. Tyler (1998). Matching sequences: Cross-spectral analy-sis of categorical time series Biometrika, 85, 201-213.

Stoffer, D.S. (1999). Detecting common signals in multiple time series usingthe spectral envelope. J. Am. Stat. Assoc., 94, 1341-1356.

Stoffer, D.S. and K.D. Wall (2004). Resampling in State Space Models. InState Space and Unobserved Component Models Theory and Applications,Chapter 9, pp. 227-258. Andrew Harvey, Siem Jan Koopman, and NeilShephard, eds. Cambridge: Cambridge University Press.

Subba-Rao, T. (1981). On the theory of bilinear time series models. J. R.Stat. Soc. B, 43, 244-255.

Sugiura, N. (1978). Further analysis of the data by Akaike’s information cri-terion and the finite corrections, Commun. Statist, A, Theory Methods, 7,13-26.

Taniguchi, M., M.L. Puri, and M. Kondo (1994). Nonparametric approach fornon-Gaussian vector stationary processes. J. Mult. Anal., 56, 259-283.

Tanner, M. and W.H. Wong (1987). The calculation of posterior distributionsby data augmentation (with discussion). J. Am. Stat. Assoc., 82, 528-554.

Tiao, G.C. and R.S. Tsay (1989). Model specification in multivariate timeseries (with discussion). J. Roy. Statist. Soc. B, 51, 157-213.

Tiao, G. C. and R.S. Tsay (1994). Some advances in nonlinear and adaptivemodeling in time series analysis. J. Forecast., 13, 109-131.

Tiao, G.C., R.S. Tsay and T .Wang (1993). Usefulness of linear transforma-tions in multivariate time series analysis. Empir. Econ., 18, 567-593.

Tierney, L. (1994). Markov chains for exploring posterior distributions (withdiscussion). Ann. Stat., 22, 1701-1728.

Tong, H. (1983). Threshold Models in Nonlinear Time Series Analysis. SpringerLecture Notes in Statistics, 21. New York: Springer-Verlag.

Tong, H. (1990). Nonlinear Time Series: A Dynamical System Approach.Oxford: Oxford Univ. Press.

568 References

Tsay, R. (1987). Conditional hetereroscadasticity in time series analysis. J.Am. Stat. Assoc., 82, 590-604.

Venables, W.N. and B.D. Ripley (1994). Modern Applied Statistics with S-Plus. New York: Springer-Verlag.

Walker, G. (1931). On periodicity in series of related terms. Proc. R. Soc.Lond., Ser. A, 131, 518-532.

Watson, G.S. (1966). Smooth regression analysis. Sankhya, 26, 359-378.Weiss, A.A. (1984). ARMA models with ARCH errors. J. Time Series Anal.,

5, 129-143.West, M. and J. Harrison (1997). Bayesian Forecasting and Dynamic Models

2nd ed. New York: Springer-Verlag.Whittle, P. (1961). Gaussian estimation in stationary time series. Bull. Int.

Stat. Inst., 33, 1-26.Wiener, N. (1949). The Extrapolation, Interpolation and Smoothing of Sta-

tionary Time Series with Engineering Applications. New York: Wiley.Wu, C.F. (1983). On the convergence properties of the EM algorithm. Ann.

Stat., 11, 95-103.Young, P.C. and D.J. Pedregal (1998). Macro-economic relativity: Govern-

ment spending, private investment and unemployment in the USA. Centrefor Research on Environmental Systems and Statistics, Lancaster Univer-sity, U.K.

Yule, G.U. (1927). On a method of investigating periodicities in disturbedseries with special reference to Wolfer’s Sunspot Numbers. Phil. Trans.R. Soc. Lond., A226, 267-298.

Index

ACF, 22, 25large sample distribution, 30,

519multidimensional, 37of an AR(p), 106of an AR(1), 87of an AR(2), 100of an ARMA(1,1), 105of an MA(q), 104sample, 30

AIC, 53, 153multivariate case, 302

AICc, 54, 153, 229multivariate case, 302

Aliasing, 12Amplitude, 176

of a filter, 225Analysis of Power, see ANOPOWANOPOW, 423, 431, 432

designed experiments, 438AR model, 14, 85

conditional sum of squares, 127bootstrap, 137conditional likelihood, 127estimation

large sample distribution, 123,529

likelihood, 126maximum likelihood estimation,

126missing data, 409operator, 86polynomial, 94spectral density, 186, 228threshold, 290

unconditional sum of squares,

126vector, see VARwith observational noise, 328

ARCH modelARCH(m), 285ARCH(1), 281estimation, 282GARCH, 286, 388

ARFIMA model, 272, 276ARIMA model, 141

fractionally integrated, 276multiplicative seasonal models,

159multivariate, 302

ARMA model, 93ψ-weights, 102conditional least squares, 128pure seasonal models

behavior of ACF and PACF,156

unconditional least squares, 128backcasts, 121behavior of ACF and PACF, 109bootstrap, 359causality of, 95conditional least squares, 130forecasts, 116

mean square prediction error,117

based on infinite past, 116prediction intervals, 119truncated prediction, 118

Gauss–Newton, 131in state-space form, 357invertibilty of, 96

large sample distribution of

569

570 Index

estimators, 133likelihood, 128MLE, 128multiplicative seasonal model,

156pure seasonal model, 155unconditional least squares, 131vector, see VARMA model

ARMAX model, 305, 317, 356bootstrap, 359for cross-sectional data, 395in state-space form, 356with time-varying parameters,

397Autocorrelation function, see ACFAutocovariance function, 20, 25, 87

multidimensional, 36, 37random sum of sines and cosines,

177sample, 30

Autocovariance matrix, 35sample, 36

Autoregressive Integrated MovingAverage Model, see ARIMAmodel

Autoregressive models, see AR modelAutoregressive Moving Average

Models, see ARMA model

Backcasting, 120Backshift operator, 61Bandwidth, 197, 214Bartlett kernel, 207Beam, 427Best linear predictor, see BLPBIC, 54BLP, 111

m-step-ahead prediction, 115mean square prediction error,

115one-step-ahead prediction, 112definition, 111one-step-ahead prediction

mean square prediction error,112

stationary processes, 111

Bone marrow transplant series, 325,352

Bonferroni inequality, 203Bootstrap, 137, 198, 229, 359

stochastic volatility, 391Bounded in probability Op, 504

Canonical correlation, 309Cauchy sequence, 522Cauchy–Schwarz inequality, 501, 522Causal, 89, 95, 526

conditions for an AR(2), 97vector model, 306

CCF, 23, 27large sample distribution, 31sample, 31

Central Limit Theorem, 509M-dependent, 511

Cepstral analysis, 263Characteristic function, 507Chernoff information, 459Cluster analysis, 461Coherence, 216

estimation, 218hypothesis test, 218, 554multiple, 420

Completeness of L2, 502Complex normal distribution, 550Complex roots, 101Conditional least squares, 128Convergence in distribution, 506

Basic Approximation Theorem,507

Convergence in probability, 504Convolution, 220Cosine transform

large sample distribution, 539of a vector process, 416properties, 192

Cospectrum, 215of a vector process, 416

Cramer–Wold device, 507Cross-correlation function, see CCFCross-covariance function, 22

sample, 31

Index 571

Cross-spectrum, 215Cycle, 176

Daniell kernel, 204, 205modified, 205

Deconvolution, 435Density function, 19Designed experiments, see ANOPOWDeterministic process, 532Detrending, 49DFT, 69

inverse, 188large sample distribution, 539multidimensional, 257of a vector process, 416

likelihood, 417Differencing, 60, 61Discrete wavelet transform, see DWTDiscriminant analysis, 451DLM, 324, 355

Bayesian approach, 376bootstap, 359for cross-sectional data, 396

form for mixed linear models,400

innovations form, 358maximum likelihood estimation

large sample distribution, 347via EM algorithm, 344, 350via Newton-Raphson, 340

MCMC methods, 379observability, 346observation equation, 325state equation, 324steady-state, 346with switching, 362

EM algorithm, 370maximum likelihood

estimation, 369DNA series, 482, 488Durbin–Levinson algorithm, 113DWT, 239Dynamic Fourier analysis, 232

Earthquake series, 10, 210, 232, 241,245, 414, 448, 454, 460, 463

EEG sleep data, 480EM algorithm, 342

complete data likelihood, 343DLM with missing observations,

350expectation step, 343maximization step, 344

Explosion series, 10, 210, 232, 241,245, 414, 448, 454, 460, 463

Exponentially Weighted MovingAverages, 142

Factor analysis, 470EM algorithm, 472

Federal Reserve Board Indexproduction, 160unemployment, 160

Fejer kernel, 207, 214FFT, 69Filter, 61

amplitude, 225, 226band-pass, 255design, 255high-pass, 222, 255linear, 220low-pass, 223, 255matrix, 227optimum, 252phase, 225, 226recursive, 255seasonal adjustment, 255spatial, 256time-invariant, 502

fMRI, see Functional magneticresonance imaging series

Folding frequency, 177Fourier transform, 184

discrete, see DFTfast, see FFTfinite, see DFTpairs, 184

Fractional difference, 62, 272fractional noise, 272

Frequency bands, 183, 197Frequency response function, 221

572 Index

of a first difference filter, 222of a moving average filter, 222

Functional magnetic resonanceimaging series, 9, 413, 440,

442, 445, 469, 474

Gaussian distribution, 19Gibbs sampler, see MCMCGlacial varve series, 62, 132, 151, 274Global temperature series, 5, 58, 62,

327Gradient vector, 341, 408Growth rate, 143, 280

Hessian matrix, 341, 408Hidden Markov model, 364, 368

estimation, 370Hilbert space, 522

closed span, 523conditional expectation, 525projection mapping, 523regression, 524

Homogeneous difference equationfirst order, 98general solution, 102second order, 98

solution, 99

Impulse response function, 220Influenza series, 290, 371Infrasound series, 427, 429, 432, 436,

437Inner product space, 522Innovations, 148, 331, 339

standardized, 148steady-state, 346

Innovations algorithm, 115Interest rate and inflation rate series,

359Invertible, 92

vector model, 306

J-divergence measure, 461Johnson & Johnson quarterly

earnings series, 4, 352Joint distribution function, 18

Kalman filter, 331correlated noise, 356innovations form, 358Riccati equation, 346stability, 345with missing observations, 348with switching, 366

Kalman smoother, 335, 406for the lag-one covariance, 337with missing observations, 348

Kullback-Leibler information, 79, 458

LA Pollution – Mortality Study, 54,75, 77, 294, 303, 305, 311,

318Lag, 22, 28Lagged regression model, 296Lead, 28Leakage, 208

sidelobe, 208Least squares estimation, see LSELikelihood

AR(1) model, 126conditional, 127innovations form, 128, 340

Linear filter, see FilterLinear process, 28, 95Ljung–Box–Pierce statistic, 149Local level model, 333, 336Long memory, 62, 272

estimation, 274estimation of d, 278spectral density, 277

Longitudinal data, 394, 400LSE, 51

conditional sum of squares, 127Gauss–Newton, 130unconditional, 126

MA model, 13, 90autocovariance function, 21, 104Gauss–Newton, 132mean function, 19operator, 91polynomial, 94

Index 573

spectral density, 185Markov chain Monte Carlo, see

MCMCMaximum likelihood estimation, see

MLEMCMC, 377

nonlinear and non-Gaussianstate-space models, 381, 384

rejection sampling, 379Mean function, 19Mean square convergence, 501Method of moments estimators, see

Yule–WalkerMinimum mean square error

predictor, 110Missing data, 348, 350Mixed linear models, 400MLE

ARMA model, 128conditional likelihood, 127DLM, 340state-space model, 340via EM algorithm, 342via Newton–Raphson, 129, 340via scoring, 129

Mortality series, see LA Pollution –Mortality Study

Moving average model, see MAmodel

New York Stock Exchange, 6, 390Newton–Raphson, 129Normal distribution, 19

multivariate, 550NYSE, see New York Stock Exchange

Order in probability op, 504Orthogonality property, 523

PACF, 107of an MA(1), 108iterative solution, 114large sample results, 123of an AR(p), 108of an AR(1), 107of an MA(q), 108

Parameter redundancy, 94Partial autocorrelation function, see

PACFPartial autoregression matrices, 309Partial canonical correlation, 310Parzen window, 213Period, 176Periodogram, 70, 188

disribution, 193matrix, 457scaled, 68

Phase, 177of a filter, 225

Pitch period, 6Pollution series, see LA Pollution –

Mortality StudyPrediction equations, 111Prenatal smoking and growth series,

397Prewhiten, 297Principal components, 464Projection Theorem, 523

Quadspectrum, 215of a vector process, 416

Random sum of sines and cosines,177, 534, 536

Random walk, 15, 19, 24autocovariance function, 22

Recruitment series, 7, 33, 65, 109,120, 194, 199, 205, 219, 248,298

RegressionANOVA table, 52autocorrelated errors, 293

Cochrane-Orcutt procedure,294

for jointly stationary series, 417ANOPOW table, 423

Hilbert space, 524lagged, 245model, 49multiple correlation, 53multivariate, 302

574 Index

normal equations, 51periodic, 72polynomial, 72random coefficients, 434spectral domain, 417stochastic, 359, 434

ridge correction, 435with deterministic inputs, 426

Return, 6, 143, 280Riesz–Fisher Theorem, 502

Scaling, 480Scatterplot matrix, 57, 64, 65Scatterplot smoothers, 72

kernel, 74lowess, 75, 77nearest neighbors, 75splines, 76, 77

Score vector, 341Shasta Lake series, 412, 423SIC, 54, 153, 229

multivariate case, 302, 304Signal plus noise, 16, 17, 251, 427

mean function, 20Signal-to-noise ratio, 16, 252Sinc kernel, 214Sine transform

large sample distribution, 539of a vector process, 416properties, 192

Soil surface temperature series, 36,38, 257

Southern Oscillation Index, 7, 33, 65,194, 199, 205, 209, 219, 222,229, 248, 254, 298

Spectral density, 183autoregression, 229estimation, 197

adjusted degrees of freedom,198

bandwidth stability, 203confidence interval, 198degrees of freedom, 198large sample distribution, 197nonparametric, 228

parametric, 228resolution, 203

matrix, 217linear filter, 227

of a filtered series, 221of a moving average, 185of an AR(2), 186of white noise, 184wavenumber, 256

Spectral distribution function, 182Spectral envelope, 479

categorical time series, 484real-valued time series, 489

Spectral Representation Theorem,181, 182, 534, 536, 537

vector process, 217, 537Speech series, 6, 33State-space model

Bayesian approach, 333, 376general, 377linear, see DLMnon-Gaussian, 377, 384nonlinear, 377, 384

MCMC methods, 381Stationary

Gaussian series, 29jointly, 27strictly, 23weakly, 24

Stochastic process, 11realization, 11

Stochastic regression, 359Stochastic trend, 141Stochastic volatility

bootstrap, 391Stochastic volatility model, 388

estimation, 390Structural component model, 352,

371

Taper, 207, 209cosine bell, 207

Taylor series expansion in probability,506

Tchebycheff inequality, 501

Index 575

Temperature series, see LA Pollution– Mortality Study

Threshold autoregressive model, 290Time series, 11

categorical, 484complex-valued, 464multidimensional, 36, 256multivariate, 23, 35two-dimensional, 256

Transfer function model, 296Transformation

Box-Cox, 62via spectral envelope, 491

Triangle inequality, 522Tukey-Hanning window, 214

U.S. GNP series, 144, 150, 154, 283,490

U.S. macroeconomic series, 477U.S. population series, 153Unconditional least squares, 128

VAR model, 303, 304estimation

large sample distribution, 316operator, 306

Variogram, 39, 45VARMA model, 305

autocovariance function, 306estimation

Spliid algorithm, 317identifiability of, 309

Varve series, 278VMA model, 306

operator, 306Volatility, 6, 280

Wavelet analysis, 235waveshrink, 244

Wavenumber spectrum, 256estimation, 257

Weak law of large numbers, 504White noise, 12

autocovariance function, 21Gaussian, 12vector, 303

Whittle likelihood, 231, 457Wold Decomposition, 532

Yule–Walkerequations, 122

vector model, 307, 309estimators, 122

AR(2), 123MA(1), 125

large sample results, 123

Springer Texts in Statistics (continued from page ii)

Lehmann and Romano: Testing Statistical Hypotheses, Third EditionLehmann and Casella: Theory of Point Estimation, Second EditionLindman: Analysis of Variance in Experimental DesignLindsey: Applying Generalized Linear ModelsMadansky: Prescriptions for Working StatisticiansMcPherson: Applying and Interpreting Statistics: A Comprehensive Guide,

Second EditionMueller: Basic Principles of Structural Equation Modeling: An

Introduction to LISREL and EQSNguyen and Rogers: Fundamentals of Mathematical Statistics: Volume I:

Probability for StatisticsNguyen and Rogers: Fundamentals of Mathematical Statistics: Volume II:

Statistical InferenceNoether: Introduction to Statistics: The Nonparametric WayNolan and Speed: Stat Labs: Mathematical Statistics Through ApplicationsPeters: Counting for Something: Statistical Principles and PersonalitiesPfeiffer: Probability for ApplicationsPitman: ProbabilityRawlings, Pantula and Dickey: Applied Regression AnalysisRobert: The Bayesian Choice: From Decision-Theoretic Foundations to

Computational Implementation, Second EditionRobert and Casella: Monte Carlo Statistical MethodsRose and Smith: Mathematical Statistics with MathematicaRuppert: Statistics and Finance: An IntroductionSantner and Duffy: The Statistical Analysis of Discrete DataSaville and Wood: Statistical Methods: The Geometric ApproachSen and Srivastava: Regression Analysis: Theory, Methods, and

ApplicationsShao: Mathematical Statistics, Second EditionShorack: Probability for StatisticiansShumway and Stoffer: Time Series Analysis and Its Applications:

With R Examples, Second EditionSimonoff: Analyzing Categorical DataTerrell: Mathematical Statistics: A Unified IntroductionTimm: Applied Multivariate AnalysisToutenburg: Statistical Analysis of Designed Experiments, Second EditionWasserman: All of Nonparametric StatisticsWasserman: All of Statistics: A Concise Course in Statistical InferenceWeiss: Modeling Longitudinal DataWhittle: Probability via Expectation, Fourth EditionZacks: Introduction to Reliability Analysis: Probability Models and

Statistical Methods

springer texts in statistics -...

Documents