Statistics of stochastic processes Academic year 2014-‐15
Group projects For each problem, prepare a synthetic report, and enclose the R script used. Each group has different problems: For problem A, group 1 has file A1.dat, group 2 A2.dat and so on. For problems B and C the different problems are listed below. Groups are free to ask the teachers of the course for limited suggestions. A. Analyse the data in the appropriate file, decomposing into trend, seasonal component and a stationary time series. Find an appropriate model for the time series, discussing the methods used. Use the model found for prediction in the 2 years following the end of the data. B. All the problems concern real examples often already analysed in the scientific literature. Their analysis often will involve using also techniques not examined (or just quickly mentioned) in the lectures. Try different models and methods of analysis, keeping in mind that data are real and presumably there is no “correct” model. In all cases, test the predictive power of the methods: use the first 80-‐85% of the data to fit a model, and then see their performance on the final part (in the financial time series generally what you can predict is the volatility of the series). The datasets come from different scientific areas; they have been randomly assigned to the groups, but groups may agree to exchange the problems, and inform me of this. 1) On the web page http://www.esapubs.org/archive/ecol/E085/043/appendix-‐A.htm one can read data on the ibex population in Gran Paradiso park, together with data about winter climate conditions (length of snow cover…). First analyse the univariate time series of population using methods discussed in the course; then through appropriate regressions or methods for multivariate time series study whether population dynamics is influenced by any of the climatic variables. 2) The files MIB.csv and FTSE.csv contain the data on the MIB index (Milan stock market) and FTSE-‐100 index (London stock market) from 1998 to now. As this type of data are unlikely to be stationary, it is usual to analyse the log returns (i.e. log(xt /xt-‐1)), but other ideas are possible. Find the auto-‐ and cross-‐correlation structure of the 2 time series, possibly dividing them in shorter parts that appear more homogeneous (in particular it should be possible to identify two market shocks); analyse the time series, looking in particular whether ARCH and GARCH models are adequate. Find a formula for the conditional variance. 3) Data on electroencephalograms (EEG) are often analysed especially with the aim of distinguishing healthy brain activity from symptoms of several diseases. An extract from an EEG of a volunteer is in the file Z093.txt. Try different models to analyse the data, noting that, when fitting ARMA models, generally very high-‐order methods are selected; nonlinear models have often been used. Analyse
whether the series may be deemed stationary (in terms of mean and first few moments) or it may be better splitting it into shorter series. 4) The R package evir http://cran.r-‐project.org/web/packages/evir/index.html contains the time series of daily log returns on BMW and Siemens share price from 1973 to 1996. Find the auto-‐ and cross-‐correlation structure of the 2 time series, possibly dividing them in shorter parts that appear more homogeneous; analyse the time series, looking in particular whether ARCH and GARCH models are adequate. Find a formula for the conditional variance. 5) The datasets gtemp and gtemp2 from the R package astsa http://cran.r-‐project.org/package=astsa are two different time series (one attempting to keep track both of land and oceans, the second based only on land meteorological stations) of the mean Earth temperature in the years 1880-‐2009. Use information form both series to obtain a “best” estimate of the actual mean global temperatures. Examine the series for stationarity; if not, try to assess periods with a discernible trend and estimate it. 6) A famous article of 1942 by Elton and Nicholson contains data on lynx catches in Canada; in particular Table 4 presents the most comprehensive and reliable data, those of Hudson’s Bay Company from 9 regions from 1821 to 1939. The series can be found in digital form at the Global Population Dynamics Database; one (or a similar one) of these time series is present in the datasets of the standard implementation of R; the ser. Analyse the spectrum of some of these time-‐series, as well as their cross-‐spectrum, detecting possible periodicities. Fit the data with an ARMA model, as well as exploring simple nonlinear models. [The periodicity in the data have been often explained because of the predator-‐prey interactions between hare and lynx; the data on hares form the same area and period are difficult to find on the web, but those that are interested may look at the recent data from the Kluane project, at http://www.zoology.ubc.ca/~krebs/kluane.html ] 7) From the Kluane project http://www.zoology.ubc.ca/~krebs/kluane.html I have extracted two files, snowshoe_hare.csv and lynx_tracks.csv. ‘’snowshoe_hare.csv’’ contains data on estimated density of hares in the fall and spring of all years starting in 1976 (the column “All Controls”), as well of two areas (“Silver” and “Sulphur”) in the region. “lynx_tracks.csv” contains data on the number of lynx tracks (that can be considered a proxy for their density) in the winters from 1987-‐88. First, analyse the autocorrelation and the spectrum of each time-‐series, detecting possible periodicities, and fit the data with appropriate ARMA models. Consider the cross-‐correlations between sites and between hare and lynx densities. Consider exploring simple nonlinear models for the joint densities of lynx and hare. Note that the .csv files have the semicolon as field separator, and the dot as decimal point. Keep track of this, when reading the file. Note also that the two series have different yearly frequencies.
8) The file “UK_measles.csv” contains the data on measles cases recorded in England every 2 weeks from 1944 to 1967 (before the start of mass vaccination). There are 60 columns corresponding to 60 different English towns, and a final column presenting the total over all England. Most of the analysis can be conducted only on the series of total English cases; it is requested to analyse the autocorrelation and the spectrum; there is an obvious seasonal component, but there are other periodicities to be detected. Finally discuss the trend (if identified), and fit the random part of the data with appropriate ARMA models, or (if possible) with some nonlinear model. It can also be worthwhile looking at the correlations between different towns, checking whether the correlation decreases with distance (this analysis can be restricted to a few towns whose distance can be quickly obtained). This dataset has been analysed in dozens of publications; for instance, Finkelstadt and Grenfell1 analysed the total series, while Grenfell et al. (2001)2 looked also at the spatial structure. C. 1) Generate n=100 elements of X , an AR(2) process with φ1 = 0.5, φ2 = −0.6, σ2= 0.25 and 100 elements of an independent ARMA(1,1) process with φ1 = −0.5, θ1 = 0.8, σ2= 1. Finally let Ut = Xt – Yt + Wt (t=1..100) where Wt is a sequence of 100 independent normal variates with mean 0 and standard deviation 0.1. Write down this a state-‐space model where Ut is the observation variable. Assuming that we know perfectly the model, use the values of Ut (t=1..90) to predict the values Xt and Yt for t=90..100 through Kalman’s method. Compare the predictions with the actual values. 2) Generate 60 simulations, each with n=100 elements, of an ARMA(1,2) process with φ1 = −0.5, θ1 = −0.5, θ2 = 0.7 and σ2= 0.49. For each simulation estimate (with maximum likelihood) the coefficients of the model and find in how many simulations they are contained in the theoretical confidence intervals. 3) Generate 100 simulations, each with n=60 elements, of an ARMA(1,2) process with φ1 = −0.5, θ1 = 0.5, θ2 = −0.7 and σ2= 0.25. For each simulation estimate the first two correlation coefficients and find in how many simulations they are contained in the theoretical confidence intervals. 4) Generate n=100 elements of an ARMA(2,1) process with φ1 = −0.1, , φ2 = −0.7, θ1 = 0.5 for σ2= 0.25, 0.49 or 1 (3 different simulations). For each simulation estimate the first two correlation coefficients and compare with the theoretical confidence intervals. Repeat the procedure assuming that the white noise in the ARMA process is not 1 Finkenstädt, B. F., Grenfell, B. T. (2000) Time series modelling of childhood diseases: a dynamical systems approach, J. Royal Stat. Soc: C 49, 187-‐205, doi:10.1111/1467-‐9876.00187 2 B. Grenfell et al. (2001) Travelling waves and spatial hierarchies in measles epidemics, Nature 414, 716-‐723, doi:10.1038/414716a
composed of normal independent variates (as default in R) but are distributed as a t with 6 df and the same variances [see an example in the help in R] . 5) Generate n=100 elements of an ARMA(2,1) process X with φ1 = 0.3, φ2 = −0.7, θ1 = 0.5 and σ2= 0.49. Use the innovations algorithm to predict the values of Xt for t = 1..100; compare with actual values. Repeat the procedure assuming that the white noise in the ARMA process is not composed of normal independent variates (as default in R) but is distributed as a t with 6 df and the same variance [see an example in the help in R] . 6) Generate n=100 elements of X, an AR(1) process with φ = −0.5, σ2= 1 and 100 elements of an independent ARMA(2,1) process with φ1 = −0.1, , φ2 = −0.7, θ1 = 0.5, σ2= 0.25. Finally let Ut = Xt + Yt + Wt (t=1..100) where Wt is a sequence of 100 independent normal variates with mean 0 and standard deviation 0.1. Write down this a state-‐space model where Ut is the observation variable. Assuming that we know perfectly the model, use the values of Ut (t=1..T) to predict the values XT and YT for T=10, 20, 30, 40, 50,..100 through Kalman’s method. Compare the predictions with the actual values. 7) Generate 100 simulations, each with n=60 elements, of an ARMA(1,2) process with mean µ= 1.25 and parameters φ1 = −0.5, θ1 = 0.5, θ2 = −0.7 and σ2= 0.5. For each simulation estimate mean and the first two correlation coefficients and find in how many simulations they are contained in the theoretical confidence intervals. 8) Generate 100 simulations, each with n=60 elements, of an AR (2) process with mean µ= 1.25 and parameters φ1 = −0.5, φ2 = 0.2 and σ2= 0.5. For each simulation estimate (both using Yule-‐Walker and maximum likelihood methods) the mean and the other parameter of the model, look (for both methods) at the empirical distribution of the coefficients, and compare it to the theoretical distributions.