natcor forecast evaluation€¦ · static origin evaluation 22 10 20 30 40 50 60 70 80 350 400 450...

50
Forecast Evaluation NATCOR Forecasting with ARIMA models Nikolaos Kourentzes [email protected]

Upload: others

Post on 30-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

Forecast Evaluation

NATCOR

Forecasting with ARIMA models

Nikolaos Kourentzes

[email protected]

Page 2: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

O u t l i n e

1. Bias measures

2. Accuracy measures

3. Evaluation schemes

4. Prediction intervals

5. Parameter selection

6. Method selection

2

Page 3: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

F o r e c a s t E v a l u a t i o n

3

Forecast errors Inaccurate results Loss (performance, financial etc...)

Measuring the loss is important but often hard to track Forecasting error can be used

as a proxy.

Therefore is it important to track and evaluate forecast errors.

Forecast evaluation = key activity in the forecasting process. It is at the core of:

• Important for Forecast Monitoring

• Essential for Method Selection and Parameterisation

Page 4: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

2014.01 2014.11 2014.21 2014.31 2014.41-500

-250

0

250

500

Err

or

Week

2014.01 2014.11 2014.21 2014.31 2014.410

100

200

300

400

500

600

700

800SKU A

Week

Sale

s

Sales (At)

Forecast (Ft)

F o r e c a s t E r r o r - D e f i n i t i o n

4

ttt FAe

e51 e20

e40

e45

Positive errors

Under-forecast

Negative errors

Over-forecast

Page 5: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

M e a s u r e s o f B i a s

5

)(

1

1

t

n

t

t

emedianMdE

en

ME

Instead of considering the complete vector of errors we can aggregate them using:

Mean Error: Most common measure of forecast bias

Median Error

Measures of bias show whether we typically over- or under-forecast. Ideally this should be as close to zero as possible.

Forecasting method A ME: 12.67

Forecasting method B ME: -23.12

Forecasting method C ME: -0.076

2014.01 2014.11 2014.21 2014.31 2014.410

200

400

600

800

Positive bias – Under-forecasting

2014.01 2014.11 2014.21 2014.31 2014.410

200

400

600

800

Negative bias – Over-forecasting

2014.01 2014.11 2014.21 2014.31 2014.410

200

400

600

800

Unbiased

Page 6: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

M e a s u r e s o f B i a s

6

2014.01 2014.11 2014.21 2014.31 2014.41

0

Err

or

Sale

s

Mean Error

2014.01 2014.11 2014.21 2014.31 2014.41

0

Err

or

Sale

s

Mean Error

2014.01 2014.11 2014.21 2014.31 2014.41

0

Err

or

Sale

s

Mean Error

Mean error = 149.9

Mean error = 0.1

Mean error = -150.1

In this case we typically forecast more than what we should. This forecast will lead to biased decisions.

In this case we typically forecast less than what we should. This forecast will lead to biased decisions.

This forecast shows no preference, therefore it is useful for objective decision making.

Page 7: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

A N o t e o n M e a n a n d M e d i a n E r r o r s

7 -800 -600 -400 -200 0 200 400 600 8000

2

4

6

8

10

12

14

16

18

20

Error

Fre

quency

Distribution of errors

Mean Error

Median Error

It is well known that the mean is affected by outliers and asymmetric distributions more than the median. In the context of forecasting:

• Median insensitive to extremes (outliers), summarises better normal performance.

• Mean sensitive to extremes (outliers), useful when we are interested in them. • Substantial differences between mean and median errors error distribution

may have outliers or be asymmetric.

Outliers affect mean strongly

Page 8: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

B i a s a n d M a g n i t u d e o f E r r o r s - A c c u r a c y

8

Mean Error = 0 does not tell us if we are accurate, merely whether we are biased. To overcome this we can calculate squared (et

2) or absolute errors (|et|), which do not cancel out once aggregated.

e1

e2

e3

Sum

Mean

- 7

+ 12

- 5

0

0

et et2 |et|

7

12

5

24

8

+ 49

+ 144

+ 25

+ 218

+ 72.67

No error?

Page 9: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

M e a s u r e s o f A c c u r a c y – S c a l e D e p e n d e n t

9

Some common errors that can be defined using these operators are:

1. Mean Squared Error (MSE) • Sensitive to outliers (squares) • Non-intuitive (units are squared) • Scale dependent

2. Root Mean Squared Error (RMSE) • As MSE but resulting units not in squares • Scale dependent

3. Mean Absolute Error (MAE) • Robust to outliers • Scale dependent

n

i

ii FAn

MAE1

1

n

i

ii FAn

MSE1

21

n

i

ii FAn

RMSE1

21

Scale dependent errors can only be used to compare different methods on the same time series. If a time series is in “cars” errors are also in “cars”! Similar problems occur due to the scale. Should not be used for comparisons across different time series!

Obviously we can define median versions of the above errors.

Page 10: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

A N o t e o n A b s o l u t e a n d S q u a r e d E r r o r s

10

-500 0 5000

5

10

15

20

Error

Fre

quency

Distribution of errors (et)

Mean

Median

0 200 400 600 8000

5

10

15

20

25

30

35

Absolute error

Fre

quency

Distribution of absolute errors (|et|)

Mean

Median

0 200000 400000 600000 8000000

20

40

60

80

100

Squared error

Fre

quency

Distribution of absolute errors (et2)

Mean

Median

Mean: 50,179 units2

Median: 14,022 units2

Mean: 158.8 units Median: 117.8 units

Mean: 224.0 units Median : 118.4 units

Notice how extreme the

outliers become

Squared errors are sensitive to outliers, as they are increased disproportionally to smaller errors. On the other hand, absolute errors do not rescale the errors and the contribution of outliers is not exaggerated.

Page 11: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P e r c e n t a g e E r r o r s – S c a l e I n d e p e n d e n t

11

In order to compare across different time series we define a series of scale independent measures; neither the level or the units of the original time series are important.

Percentage errors (PE) • Expresses errors as a ratio to actual level • Free of units • Requires a “meaningful” zero (0oC is not a

“meaningful” zero), so that the actuals do not become negative.

t

ttt

A

FAPE

Based on the percentage errors we can define percentage bias and accuracy metrics. These will be scale and unit independent and therefore allow comparisons and aggregations between time series.

Page 12: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P e r c e n t a g e E r r o r s – S c a l e I n d e p e n d e n t

12

1. Mean Absolute Percentage Error (MAPE) • Scale independent • Very intuitive (method is % wrong) • Biased: Positive and negative errors do not

count equally! • Requires non-zero and positive denominator

2. Symmetric Mean Absolute Error (sMAPE) • If you see it, avoid it! Has too many several

issues!

n

i i

ii

A

FA

nMAPE

1

1

There are also median versions of the absolute percentage errors.

Actual = 100 Forecast = 90 MAPE = |10|/100 = 10%

MAPE bias example

Actual = 90 Forecast = 100 MAPE = |-10|/90 = 11.111%

Some common errors that can be defined using these operators are:

Page 13: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P e r c e n t a g e E r r o r E x a m p l e

13

Period Actualt Forecastt et t At Ft = At - Ft

t+1 106 101 5 t+2 0 101 -101 t+3 102 101 1

PE APE = et/At = |PE| 4.72% 4.72%

Infinite Infinite 0.98% 0.98%

MAPE = Infinite!

If there is a zero (or a value close to zero) MAPE becomes infinite (or extremely large) Medians will typically allow you to calculate MAPE as infinite errors are ignored!

Period Actualt Forecastt et t At Ft = At - Ft

t+1 106 101 5 t+2 100 101 -1 t+3 102 101 1

PE APE = et/At = |PE| 4.72% 4.72% -1.00% 1.00% 0.98% 0.98%

MAPE = 2.23%

Page 14: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

E x a m p l e o f S c a l e I n d e p e n d e n t E r r o r s

14

2014.01 2014.11 2014.21 2014.31 2014.410

200

400

600

800SKU A

Week

Sale

s

Sales (At)

Forecast (Ft)

MAE = 134.0 computers MAPE = 179.7%

MAE = 6,117,390 iron nails MAPE = 26.6%

2014.01 2014.11 2014.21 2014.31 2014.410

10000000

20000000

30000000

40000000

50000000

60000000SKU C

Week

Sale

s

Sales (At)

Forecast (Ft)

The MAE of the first series is dwarfed from the second’s, because the scale of the second series is in millions. MAE has also the issue of units.

Page 15: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

R e l a t i v e E r r o r s

15

In order to compare against a benchmark method we define a series of relative error measures.

Relative errors (RE) • Expresses errors as a ratio to the errors of

another forecasting model, typically the naive.

• Free of units (scale independent) • Directly compares forecasting methods

Benchmarkt

ttt

FA

FARE

Geometric Mean Relative Absolute Error (GMRAE) • Absolute form of relative errors • Error < 1 method better than benchmark • Error > 1 method worse than benchmark • Error = 1 method as good as benchmark

nn

i Naivei

ii

FA

FAGMRAE

1

1

To summarise across time series we use again a GM of the GMRAE for each series

Page 16: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

R e l a t i v e S u m m a r y E r r o r s

16

Another way to calculate GMRAE is to use mean of logarithms of RAEs.

An alternative error metric is AvRelMAE (Average Relative MAE). This is calculated as follows:

n

i Naivei

ii

nn

i Naivei

ii

FA

FA

nFA

FAGMRAE

1

1

1

log1

exp

n

i Benchmark

Forecast

nn

i Benchmark

Forecast

MAE

MAE

nMAE

MAElMAEAv

1

1

1

log1

expRe

The idea is the following: 1. Calculate MAE for each series (i = 1,…,n) 2. Calculate ratios with benchmark MAE 3. Average across different series with a GM (as we use ratios) • More robust to calculate than GMRAE, but less sensitive to individual errors • Same interpretation as GMRAE

Page 17: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

E r r o r M e a s u r e R e m a r k s

17

• Can be calculated across any forecast horizon or aggregation of forecast horizons.

• Scale independent errors can be aggregated across time series as well.

• There is no best error measure! Depends on the data and question at hand.

• Error measures can help in assessing the bias, accuracy and robustness of a forecasting method, as well as its ranking against other methods.

• Different forecasting error measures may output different model rankings.

Page 18: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

F o r e c a s t M o n i t o r i n g

18

Let us assume that we run two different forecasts and tracked their performance at every period. The blue forecast is more reactive and therefore better. But if this SKU was forecasted every month automatically without any human intervention, could we identify the problem?

2014.01 2014.11 2014.21 2014.31 2014.410

1000

2000

3000

4000

5000

6000SKU B - SES

Week

Sale

s

Sales

SES(0.05)

SES(0.25)

2014.01 2014.11 2014.21 2014.31 2014.410

1000

2000

3000

Week

MA

E

SES(0.05)

SES(0.25)

Very high errors should signal an alert for

manual intervention

The poor forecast produces very high

errors for prolonged periods

The alert occurs very fast

Out-of-sample t+1

to t+6 forecast

error

Page 19: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

F o r e c a s t M o n i t o r i n g

19

We can monitor automated forecasts by tracking their errors. This can be implemented unstructured or in a control chart approach. The errors can be used raw or smoothed.

5 10 15 20 25 30 35 40 450

500

1000

1500

2000

1.96

-1.96

Period

MA

E

Periods with unexpectedly high

errors

Page 20: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

O u t l i n e

1. Bias measures

2. Accuracy measures

3. Evaluation schemes

4. Prediction intervals

5. Parameter selection

6. Method selection

20

Page 21: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

I n - a n d O u t - o f - S a m p l e

21

The historical observations can be split into two subsets: • In-sample: used for model building and parameterisation. • Out-of-sample: used for model evaluation. This is not used in building the model

and is not “seen” by the model. We use it to simulate true forecasts, instead of waiting for new unobserved values in order to evaluate the forecasting performance of alternative forecasting models.

10 20 30 40 50 60 70 80350

400

450

500

550

600

650

Month

Units

In-sample Out-of-sample

Use to build the model

Use to evaluate the model. Note that

forecast is multiple steps ahead

Forecast origin

Page 22: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

S t a t i c O r i g i n E v a l u a t i o n

22

10 20 30 40 50 60 70 80350

400

450

500

550

600

650

Month

Uni

ts

The simplest evaluation produce a single forecast in the out-of-sample subset. Let the forecast horizon be 12 months and the holdout (out-of-sample) 24 months:

We have a forecast for t+1, t+2, ..., t+12. There is a single forecast origin, month 60. Let us assume that we are interested to forecast accurately t+12.... We have only 1 measurement: low confidence in our accuracy measurement This evaluation scheme is called static origin evaluation.

Forecast origin

Page 23: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

Limitations:

1. One forecast per lead time needs long track history

2. Forecast susceptible to corruption “strange” origins or targets may affect quality of forecasts

3. Averaging over different lead times corrupts summary error statistic

S t a t i c O r i g i n E v a l u a t i o n

23

Origin

In-sample Out-of-sample (Holdout)

Use to fit the model Use to evaluate the model

Forecasts

Page 24: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

R o l l i n g O r i g i n E v a l u a t i o n

Origin

In-sample Out-of-sample

A way to overcome these limitations is the Rolling Origin Evaluation scheme.

Origin

In-sample (increased) Out-of-sample

Origin

In-sample (increased) Out-of-sample

Roll origin

Roll origin

We roll the forecast origin until there is not enough out-of-sample to use for evaluation.

Page 25: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

R o l l i n g O r i g i n E v a l u a t i o n

25

Rolling origin evaluation: 1. Provides more forecasts per origin 2. Overcomes limitations of fixed origin evaluation

• Provides more forecasting history per lead time for equal holdout sample • Does not need to average over lead time • Can overcome “strange” origins or targets

Forecast Lead Time

Holdout: 5 Holdout: 10 Fixed Origin Rolling Origin Fixed Origin Rolling Origin Number of forecasts

Number of forecasts

Number of forecasts

Number of forecasts

t+1 1 5 1 10 t+2 1 4 1 9 t+3 1 3 1 8 t+4 1 2 1 7 t+5 1 1 1 6

Page 26: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

R o l l i n g O r i g i n E v a l u a t i o n

26

Using the previous example a rolling origin evaluation would look like:

10 20 30 40 50 60 70 80350

400

450

500

550

600

650

10 20 30 40 50 60 70 80350

400

450

500

550

600

650

Out-of-sample In-sample

Out-of-sample In-sample

Alp

ha

= 0

.05

A

lph

a =

0.2

0

Black dots are forecast origins

Visualising the rolling origin forecasts

makes it easier to appreciate the importance of

smooth forecasts that filter noise

Page 27: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

E v a l u a t i o n S c h e m e s – S a m p l e S i z e

27

Having enough sample for model building and evaluation is crucial. Lack of sample severely restricts the selection of alternative models, as many require abundance of data. Eventually on really short time series we can only apply naive and simple average models. Models that require parameterisation perform better when there is ample data. The more available data the better the estimation of the parameters. Consider issues with setting the gamma parameter on seasonal exponential smoothing models. The same is true for evaluating forecasts. With large sample many errors can be calculated and therefore higher confidence on the estimated figure. Sample size also affects our understanding of time series components. How many observations are required to identify a seasonal time series?

Page 28: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

E v a l u a t i o n S c h e m e s – S a m p l e S i z e

28

Lets try to find the t+1 forecast error of a method using absolute errors...

0 50 100 1500

5

10

15

20 MAE: 45.75

For 84 errors the MAE is 45.75, but let us assume that this is unknown. The 1st error is: 65.96 - Mean: 65.96, are we confident? The 2nd error is: 68.18 - Mean: 67.07, are we confident? The 3rd error is: 99.35 - Mean: 77.83, are we confident? ... The 20th error is: 112.68 - Mean: 57.30, are we confident? The 30th error is: 18.33 - Mean: 49.65, are we confident?

5 10 15 20 25 30 35 40 450

50

100

150

Observation

Absolu

te E

rror

(AE

)

AE

Cumulative MAE

Final MAE

Page 29: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

E v a l u a t i o n S c h e m e R e m a r k s

29

• Forecasting accuracy of one model is meaningful only as a relative size to another model or benchmark. An error of 5% or 50% is non-informative without comparing it to benchmarks.

• Naive methods make simple and effective benchmarks.

• There are no set rules determining the size of the out-of-sample (or holdout), however:

o It should be at least as long as the forecast horizon

o Leave enough in-sample data for model building

o Provide enough forecasts of the forecast horizon of interest

o A simple heuristic is 80% in-sample 20% out-of-sample, but often is inappropriate.

Page 30: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

O u t l i n e

1. Bias measures

2. Accuracy measures

3. Evaluation schemes

4. Prediction intervals

5. Parameter selection

6. Method selection

30

Page 31: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P r e d i c t i o n I n t e r v a l s

31

10 20 30 40 50 60 70 80350

400

450

500

550

600

650

700

Data

SES

10 20 30 40 50 60 70 80-150

-100

-50

0

50

100

150

et=A

t-F

t

0 5

x 10-3

-150

-100

-50

0

50

100

150Error PDF 0

0.0

1

0.0

2

0.0

3

0.0

4

0.0

5

Probability

-3

-2

-

2

3

68.2

%

95.5

%

99.7

%

We can use the error distribution of valid models to formulate prediction intervals of the forecasting methods

Page 32: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

heahth szFPI 2/

P r e d i c t i o n I n t e r v a l s

32

0

0.01

0.02

0.03

0.04

0.05

Pro

babili

ty

-3-2 - 2 3

68.2%

95.5%

99.7%Starting from the sample standard deviation of the errors:

2

1

,

1

n

i

htihte een

sh

h

n

i

ihte MSEen

sh

2

1

, 01

For an unbiased model the mean of errors is zero.

The forecast is the expected value. Adding and subtracting to the forecast the quantity zα/2∙seh

gives us the prediction intervals. The standard score for normal distributions (valid models) is easy to calculate.

Prediction Interval

zα/2 - score

50% 0.67

90% 1.64

95% 1.96

99% 2.58

Using prediction intervals we can visualise the confidence in our forecasts.

Page 33: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P r e d i c t i o n I n t e r v a l s

33

80% and 90% prediction intervals

Observe that the prediction intervals vary with the series, method and horizon. We have more confidence to forecasts with tight PIs.

SKU C SKU A

UK Android Market Share US Air Passengers

Page 34: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P r e d i c t i o n I n t e r v a l s

34

hahteahth MSEzFszFPIh 2/2/

Calculating the PI formula can be complicated as it requires in-sample MSE of multiple forecast horizons (h):

This can be obtained by calculating the rolling origin in-sample MSE of the relevant forecast horizon. Alternatively it can be approximated using the following formula:

1 te MSEhsh

This is the square root of the horizon multiplied by the squared root of the 1-step ahead in-sample mean squared error. Note that there is substantial empirical evidence that this is a very rough approximation.

Page 35: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

O u t l i n e

1. Bias measures

2. Accuracy measures

3. Evaluation schemes

4. Prediction intervals

5. Parameter selection

6. Method selection

35

Page 36: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P a r a m e t e r S e l e c t i o n

36

We have seen for the exponential smoothing methods that we can select the smoothing parameters based on the theoretical properties of the methods and the characteristics of the time series:

• Low parameters imply long weighted averages and therefore robustness against outliers and increased noise.

• Higher parameters imply shorter weighted averages, reacting faster to new information and handling better breaks in the series.

However it may be desirable to automate the parameter selection process. We can use in-sample error metrics for this purpose.

Page 37: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P a r a m e t e r S e l e c t i o n

37

2014.01 2014.11 2014.21 2014.31 2014.410

100

200

300

400

500

600

700

800SKU A - SES = 0.10

Week

Sale

s

Sales (At)

Forecast (Ft)

2014.01 2014.11 2014.21 2014.31 2014.410

100

200

300

400

500

600

700

800SKU A - SES = 0.40

Week

Sale

s

Sales (At)

Forecast (Ft)

MSE = 27,337 MSE = 35,580 < This parameter is

better, resulting in a better fit (lower error).

Based on this idea we can optimise model parameters (for any exponential smoothing method or generally).

Page 38: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P a r a m e t e r S e l e c t i o n : S E S e x a m p l e

38

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11200000

1220000

1240000

1260000

1280000

1300000

1320000

1340000

In-s

am

ple

MS

E

Alpha

MSE

Minimum

We can calculate the in-sample MSE for various values of alpha and identify the value that gives the lowest error.

α = 0.106

This result is very close to the one we chose manually (0.1). Obviously for more complex methods we can optimise several parameters simultaneously.

Page 39: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P a r a m e t e r S e l e c t i o n : S E S e x a m p l e

39

The same principle can be applied to choose the initialisation level value as well. Now we vary both initialisation and smoothing parameter.

Alpha = 0.3237

Level = 1697 2014.01 2014.11 2014.21 2014.31 2014.41

0

1000

2000

3000

4000

5000

6000SKU B - SES

Week

Sale

s

Sales (At)

Forecast (Ft)

As the number of parameters (including initialisation values) increases optimisation becomes more time consuming and requires more data.

Page 40: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P a r a m e t e r S e l e c t i o n

40

Will the optimal parameters be always the best? … In fact no.

There are many reasons for this:

• Optimisation is done in-sample. The correlation between in-sample error and out-of-sample has been shown to be low.

• Optimisation is (typically) done using t+1 MSE. In practice we forecast for longer horizons and our pragmatic cost functions are different.

• MSE by construction is very reactive to extreme errors, which may distort the error surface that we are search for the optimal values.

• Sample limitations as the number of parameters to optimise increases.

• Minimum error may not be the business objective. Companies may prefer consistency of forecasts across origins instead.

Optimisation is very useful for automation, however human experts should override identified parameters if they violate theory or objectives.

Page 41: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P a r a m e t e r S e l e c t i o n – T r y i t o u t

41 https://kourentzes.shinyapps.io/shinySES

Experiment with setting the alpha parameter: • Do you agree with the optimal value? • Does the in- and out-of-sample behave the same way?

Page 42: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

P a r a m e t e r S e l e c t i o n R e m a r k s

42

• Optimisation for complex models is sensitive to the starting conditions (local optima). Different sets of initial values and parameters may give different results.

• Optimising on bias does not make sense, as positive and negative errors cancel out.

• Optimisation results may change depending on the error metric used. MSE is common, but other metrics may be useful.

Alpha

Initia

l le

vel

log(MSE)

0 0.2 0.4 0.6 0.8 11000

1500

2000

2500

3000

Minimum MAE

Minimum MSE

Minimum MAPE

13.5

14

14.5

15

15.5

Error Alpha Level MSE 0.3238 1697 MAE 0.3423 2046

MAPE 0.2574 2046

2014.01 2014.11 2014.21 2014.31 2014.410

1000

2000

3000

4000

5000

6000SKU B - SES

Week

Sale

s

MAE

MSE

MAPE

Page 43: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

M e t h o d S e l e c t i o n

43

We can use similar principles to select the appropriate forecasting method.

There are two major approaches:

• Using information criteria (usable only within a family of methods, e.g. exponential smoothing).

• Using a validation holdout sample and rolling origin evaluation.

These compliment manual selection, based on understanding the characteristics of a time series.

Page 44: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

M e t h o d S e l e c t i o n – W h y F i t E r r o r s D o N o t W o r k

44

For selecting between methods we cannot use the in-sample errors as we did we parameter selection. This is due to the fact that more complex models will tend to have lower fit errors, even if their forecasts perform worse.

2013.48 2014.01 2014.06 2014.11 2014.16 2014.21 2014.26 2014.31 2014.36 2014.410

1000

2000

3000

4000

5000

6000SKU B

Week

Sale

s

Sales

Level EXSM

Trend-Season EXSM

Level Exponential Smoothing In-sample MAEt+1: 707.56 Out-of-sample MAEt+1-t+6: 321.28

Trend-Seasonal Exponential Smoothing In-sample MAEt+1: 528.55 Out-of-sample MAEt+1-t+6: 1095.00

More complex models are more flexible and have higher potential to overfit compared to simpler models.

In-sample Holdout

<

<

Page 45: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

M e t h o d S e l e c t i o n – I n f o r m a t i o n C r i t e r i a

45

One approach is to penalise the fit of more complex model for the number of parameters they have (= complexity). We define as information criteria:

)()ln( npQMSEIC

• ln(MSE) is the logarithm of the 1-step ahead in-sample MSE.

• p is the number of parameters (including initial values).

• Q(n) is the penalty function.

• n is the in-sample size.

nnQ /2)(

nnnQ /)ln()(

Akaike Information Criterion (AIC) Uses the following penalty function:

Bayesian Information Criterion (BIC) Uses the following penalty function: BIC penalises more larger models. For exponential smoothing no significant differences in performance.

Page 46: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

I n f o r m a t i o n C r i t e r i a E x a m p l e

46

2013.48 2014.01 2014.06 2014.11 2014.16 2014.21 2014.26 2014.31 2014.36 2014.410

1000

2000

3000

4000

5000

6000SKU B

Week

Sale

s

Sales

Level EXSM

Trend-Season EXSM

Level Exponential Smoothing In-sample MSEt+1: 812,506.27 AIC: 13.7079 BIC: 13.7923 Out-of-sample MAEt+1-t+6: 321.28

Trend-Seasonal Exponential Smoothing In-sample MSEt+1: 413,584.09 AIC: 13.8326 BIC: 14.5926 Out-of-sample MAEt+1-t+6: 1095.00

In-sample Holdout

< < <

<

Both information criteria give us the correct answer.

Page 47: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

M e t h o d S e l e c t i o n – H o l d o u t s a m p l e

47

For the other approach we simply measure the error at a subset of the series that is not used for fitting the models.

Level Exponential Smoothing In-sample MAEt+1: 707.56 Validation MAEt+1-t+6: 655.64 Out-of-sample MAEt+1-t+6: 321.28

Validation set: Rolling origin

forecasts

Out-of-sample: Rolling origin

forecasts

Trend-Seasonal Exponential Smoothing In-sample MAEt+1: 528.55 Validation MAEt+1-t+6: 1028.61 Out-of-sample MAEt+1-t+6: 1095.00

Again we get the correct answer!

Page 48: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

48

M e t h o d S e l e c t i o n

Information Criteria • Pros

- Easy to calculate. • Cons

- Applicable only within a single family of methods. - Cannot always be aligned with the true cost function of the company.

Holdout Set • Pros

- Universal, can be used to select between any methods. - Can be fully aligned with the true cost function

• Cons - Lose sample for the validation set. - If sample size is not adequate for a reasonable rolling origin evaluation

then the results may not be reliable. - Computationally complex.

Page 49: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

M e t h o d a n d P a r a m e t e r S e l e c t i o n R e m a r k s

49

• Forecast evaluation can help us to automate both method and parameter selection.

• There are several alternative options often they produce similar results

• Reliable fully automatic performance (use forecast monitoring)

• The key benefit of statistics is automation essential for modern business forecasting problems

• Experienced human experts can outperform automatic methods

• Understand structure of the time series

• Choose method and parameters appropriately

Page 50: NATCOR Forecast Evaluation€¦ · Static Origin Evaluation 22 10 20 30 40 50 60 70 80 350 400 450 500 550 600 650 h s The simplest evaluation produce a single forecast in the out-of-sample

Thank you for your attention!

Questions?

Nikolaos Kourentzes Lancaster University Management School

Lancaster Centre for Forecasting - Lancaster, LA1 4YX email: [email protected]

Forecasting blog: http://nikolaos.kourentzes.com

www.forecasting-centre.com/

Full or partial reproduction of the slides is not permitted without author’s consent. Please contact [email protected] for more information.