4. forecasting with arma modelsfaculty.chicagobooth.edu/jeffrey.russell/teaching/timeseries/... ·...
Post on 04-Apr-2018
232 Views
Preview:
TRANSCRIPT
1
4. Forecasting with ARMA models
MSE optimal forecasts
• Let denote a one‐step ahead forecast given information Ft . For ARMA models,
• The optimal MSE forecast minimizes
1tY
211t tE Y Y
1, ,...t t tF Y Y
2
• MSE is a convenient forecast evaluation that has nice theoretical properties.
• MSE is not always a good way to evaluate forecasts from an economic perspective.
– It is symmetric. Do you care the same about underestimating the time to walk to the bus vs. overestimating?
– In general, non‐linear payoff functions are not optimized by MSE.
The optimal MSE forecast is the conditional expectation
• To see this, let g(Xt) denote any other forecast.
• The middle (cross product) term has expectation zero by the Law of Iterated Expectations (LIE)
22
1
2
1
1
1
1
1 1 1
2
1
1
|
|
|
2* |
|
|
t t tt t t
t t t t t t t t
t
t t
t
t t
Y E Y X
Y E Y X Y
E Y g X E E Y X g X
E Y X g X
E Y X g
X
X
E
E
YE E
|E Y E E Y X
t
1 1
only randomthing giv
1 1
en X
| 00| |tt t tt t t t t tE Y X g XE E YY E Y E X g XXX
3
• So the middle term vanishes and we are left with:
• The smallest the second term can be is zero and it is zero only when
• So the optimal MSE forecast is the conditional expectation.
2
11
2 2
1 1 | |t t t tt tt tE Y X gE Y g X E EY Y XE X
1 |t t tE Y X g X
Restricting our attention to linear models, a similar result holds for Least Squares estimators
• Let so that is the linear projection of Yt+1 on Xt, or the least squares solution.
• A nearly identical proof shows that the linear projection provides the MSE optimal forecast in the class of linear predictors.
• Let denote the least squares solution and let denote any other linear combination.
1 0t t tE Y X X tX
tX
tX g
4
• The middle term again has expectation zero.
• The resulting expression
is minimized when
2
2
1
1
2
1
2
1
2*
t t t
t t
t
t
t t t
t
t
t
t Y X
Y X
E Y X g E
E E Y
X X g
X X g
X X gE
X
1 1
1 0 0
t t t
t
t t t t
t t
E EX XY X Y X
Y
g X g
X gE X g
1
2 2 2
1 t tt t ttE Y Y XX g E E X X g
t tX X g
• So if the true conditional expectation is linear then the linear model is the optimal MSE.
• If the true model is non‐linear, then the linear model is the best linear predictor.
5
Forecasting ARMA models
• Consider the ARMA model:
where
and
• If the model is weakly stationary we get:
or
where
t tL Y L
21 21 ... p
pL L L L
21 21 ... q
qL L L L
1 1
t tL L Y L L
0 1 0
t j t jj
Y
Multi‐step ahead forecast
• Or,
• If the ’s are independent, then conditional on past ’s, future ’s have expectation zero:
• The forecast error is
• Since the forecast errors are uncorrelated with the “X’s” use to predict, this is the best linear predictor.
1| ,t k t t j t k jj k
E Y
1
0
future after time t past, before time t
k
t k j t k j j t k jj j k
Y
1
0
kk
t k t j t k jj
E Y Y
6
If the ’s are white noise (not iid) then the forecast is the optimal linear
predictor
• Denote
• The forecast error is
• Since the forecast errors are uncorrelated with the “X’s” (past epsilons) this is the best linear predictor.
| kt k t t j t k j
j k
E Y F Y
1
0
kk
t k t j t k jj
E Y Y
• The forecast error is defined as the difference between the outcome and it’s forecast is denoted by:
1
0
|k
kt t k t k t j t k j
j
e Y E Y F
7
• The k‐step ahead forecast error variance is then given by:
• If the errors are Gaussian then the 95% prediction interval is given by:
1 1
2 2
0 0
k kkt j t k j j
j j
Var e Var
12
0
2k
kt j
j
Y
Example
• Consider the AR(1) model
• Pre‐multiply both sides of Yt+k by:
yielding
or
01 t tL Y
2 2 1 11 ... k kL L L
1 1
00 0
k kk j j
t k t t k jj j
Y Y
1 1
00 0
1k k
k k j jt k t k j
j j
L Y
8
• The k‐step ahead forecast is:
• The forecast error is
• The forecast error variance is:
1 1
00 0
|k k
k j j kt k t t t k j t
j j
E Y F Y E Y
1
0
kj
t k jj
1 12 2
0 0
k kj j
t jj j
Var
• Notice that we could pre‐multiply by
regardless of whether is
less than one (we didn’t actually use the
inverse operator).
2 2 1 11 ... k kL L L
9
• If the point forecast is
and the forecast error variance is:
1
1
0 00
1|
1
1
kkj k k
t k t t tj
k kt
E Y F Y Y
Y
21 1
2 2 22
0 0
1
1
kk kk j jt t j
j j
Var e Var
• If , the point forecast is given by:
• The forecast error variance is given by:
1
1 1
2 2 2
0 0
k kk j jt t k j
j j
Var e Var k
1
0 00
|k
j kt k t t t
j
E Y F Y k Y
10
A simple algorithm for multi‐step ahead forecasts
• 1 step:
• 2 step
• 3 step
1 1 11 1
|p q
t t j t j j t jj i
E Y F Y
2 1 1 2 1 1 22 1
One-step ahead forecast 0
1 t+1 t 2 22 1
| E | |
E Y |F
p q
t t t t j t j t t j t ij i
p q
j t j j t ij i
E Y F Y F Y E F
Y
3 1 2 2 1 3 33 3
| | |p q
t t t t t t j t j j t jj i
E Y F E Y F E Y F Y
• So once you have the one‐step you can get the two step, once you have the two‐step, you can get the three‐step and so on.
• Plug in conditional expected values for future values of Y and zero for the expectation of future values of . Plug in know past values for values of Y and at or before time t.
11
5. Maximum Likelihood Estimation
• Let’s start with a simple example.
• Suppose that we take an iid sample of size 10 of whether a voter will vote for candidate A.
• Suppose the data look like this:
1 0 1 1 1 0 0 1 0 1 where a 1 denotes “favor candidate A”.
• Let p denote the true fraction of voters that favor candidate A. p is the parameter we want to estimate.
• If we knew p, how likely would it be to observe this sequence of data?
• Well, they are all independent and identically distributed.1 0 1 1 1 0 0 1 0 1The joint probability that x1=1, x2=0, x3=1… is given by:p(1‐p)ppp(1‐p)(1‐p)p(1‐p)p=p6(1‐p)4
12
• More generally, for a sample of size n we can write the probability of observing the sample as
• This is simply the joint probability of n outcomes.
01 1nnp p
• For a given value of p, tells us how likely it is to observe this particular data set. We therefore call this the likelihood function.
• The goal of maximum likelihood is to find the parameter value p that maximizes the likelihood function.
• In doing so we are finding the parameter value that makes it most likely that we observe our given sample.
01| 1nnL data p p p
13
• The log function is a monotonically increasing function. This means that the value of p that maximizes the likelihood function L is the same as the value of p that maximizes the log of the likelihood function.
1 0| ln ln ln 1data p L n p n p L
• We can maximize the likelihood by taking the derivative and setting it to zero.
• The p that maximizes the log likelihood function will be our estimate of p so we denote it by p
14
• Taking the derivative and setting to zero yields:
• The solution is:
01|
0(1 )
d p data nn
dp p p
L
1 1
0 1
ˆn n
pn n n
• For continuous random variables the density function describes probabilities.
• Let denote a density function that depends on a possible vector of parameters (i.e. for the normal it depends on the mean and variance).
|f
15
• If we take an iid sample of a continuous random variable then the likelihood of observing the sample is given by the product of the marginal densities.
1 2 1 2, , , | | | |n nf y y y f y f y f y
• It is convenient to take the log of the joint density function in which case we get:
• As before, the value of that maximizes the function f will also maximize ln(f) and is called the Maximum Likelihood Estimate .
1 21
ln , , , | ln |n
n ii
f y y y f y
16
• Example: iid Normal likelihood function.
(on white board)
How do we set up the likelihood for dependent time series data?
• The factorization of the likelihood into the product of marginal densities relied on an iid sample of data which is clearly not true for dependent time series data.
• Fact: without any loss of generality, the joint density function can always be expressed as:
1 1, , , |T Tf y y y
1 1 1 2 1 1 2 3 1
2 3 4 1 2 1 1
, , , | | , ,..., ; | , ,..., ;
| , ,..., ; ... | ; ;
T T T T T T T T
T T T
f y y y f y y y y f y y y y
f y y y y f y y f y
17
• Taking logs of the joint density then gives:
• The value of that maximizes the likelihood is called the Maximum Likelihood Estimate (MLE) of
1 2 1 2 11
ln , , , | ln | , ,..., ;n
n i i ii
f y y y f y y y y
• All the models that considered in class are different models for this conditional distribution.
• The result says that if we can write the conditional density of f(yt+1|Ft) we can construct the likelihood.
• But this is exactly how we specified the models discussed so far. We defined the dynamics conditional of Ft.
18
• AR(1) model likelihood under Gaussian errors.
(on white board)
Variance of maximum likelihood parameter estimates
• The value of that maximizes the likelihood is our estimate, but how accurate is the estimate?
• If the assumed distribution is correct then the variance covariance matrix of the parameter estimates is given by inverse of the information matrix denoted by I‐1.
19
• Write the log likelihood function as: where
• There are two ways to estimate the information matrix:
1)
2)
22
ˆ ˆ21
1 1ˆ | |T
td
t
d ld LI
T d d T d d
ˆ ˆ1
1ˆ | |T
t top
t
dl dlI
T d d
T
ttl
1
L
1 2 1ln | , ,..., ;t t t tl f y y y y
Constructing the information matrix
• The if 0 is the true parameter value, the distribution of the parameter estimates is given by:
• Where is one of the two estimates or
.
11ˆ ˆ~ ,oN IT
2dI
opI
I
20
• When the model is correctly specified the two estimate of the information matrix are the same in large samples. In small samples they will differ a little although neither is preferred.
• There are several ways the model can be wrong.
– First, we could misspecify the dynamics (i.e. have the wrong AR model)
– Second, we could use the wrong shape for the conditional distribution (i.e. use a Normal when we should have used a t‐distribution)
Getting the standard errors right
• Interestingly, if we have the dynamics right but falsely assume Normality, the parameter estimates are still consistent under very general conditions.
• The standard errors will be wrong, however.
• The good news is that we know how to fix them.
top related