the m4 competition in progress...o machine learning o combination o judgmental with contradicted...

43
National Technical University of Athens- Forecasting & Strategy Unit 38th International Symposium on Forecasting Boulder Colorado, USA– June 2018 The M4 Competition in Progress Evangelos Spiliotis Spyros Makridakis Vassilios Assimakopoulos National Technical University of Athens Forecasting & Strategy Unit University of Nicosia Institute for the Future Forecast. Compete. Excel.

Upload: others

Post on 05-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

National Technical University of Athens- Forecasting & Strategy Unit38th International Symposium on Forecasting

Boulder Colorado, USA– June 2018

The M4 Competition in Progress

Evangelos SpiliotisSpyros MakridakisVassilios Assimakopoulos

National Technical University of AthensForecasting & Strategy UnitUniversity of NicosiaInstitute for the Future

Forecast. Compete. Excel.

Page 2: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The quest for the holy grail

What do we forecast?

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

The performance of the forecasting methods strongly depends on theo Domaino Frequencyo Lengtho Characteristicso ???of the time series being examined

as well as on various strategic decisions, such as forecasting horizon and computation time (complexity), and relevant information available

Page 3: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The quest for the holy grail

What kind of method should we use?

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Too many types of methods and alternativeso Statisticalo Machine Learningo Combinationo Judgmentalwith contradicted results in the literature

Even if we knew which method is best for the examined application in general, lots of work would still be needed to properly select and parameterize our forecasting model, as well as to pre-process our data

Page 4: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The quest for the holy grail

Is there a golden rule or some best practices?

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

“ignorance of research findings, bias, sophisticated statistical procedures, and the proliferation of big data, have led forecasters to violate the Golden Rule. As a result, …, forecasting practice in many fields has failed to improve over the past half-century”.

Golden rule of forecasting: Be conservative (Armstrong, et.al., 2015)

“identify the main determinants of forecasting accuracy considering seven time series features and the forecasting horizon”

‘Horses for Courses’ in demand forecasting (Petropoulos, et.al., 2014)

“investigate which individual model selection is beneficial and when this approach should be preferred to aggregate selection or combination”

Simple versus complex selection rules for forecasting many time series (Fildes & Petropoulos, 2015)

Page 5: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Evaluating Forecasting Performance

We need benchmarks....

New methods and forecasting approaches must perform well on well-known,diverse and representative data sets

This is exactly the scope of forecasting competitions: Learn how to improvethe forecasting accuracy, and how such learning can be applied to advancethe theory and practice of forecasting

✓ Encourage researchers and practitioners develop new and more accurateforecasting methods

✓ Compare popular forecasting methods with new alternatives✓ Document state-of-the-art methods and forecasting techniques used in academia

and industry✓ Identify best practices✓ Set new research questions and try to provide proper answers

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Page 6: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Evaluating Forecasting Performance

Competitions will always be helpful....

➢ There will always be features of time series forecasting not previouslystudied under competition conditions

➢ There will always be new methods to be evaluated and validated➢ As new performance metrics and statistical test come into light, the results

of previous competitions will be always put under question➢ Technological advances affect the way forecasting is performed and enable

more advanced, complex and computational intensive approaches,previously inapplicable

➢ Exploding data influence forecasting and its applications (more data tolearn from, unstructured data sources, abnormal time series, newforecasting needs)

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Page 7: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The history of time series forecasting competitions

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Makridakis and Hibon (1979)• No participants• 111 time series (yearly, quarterly &

monthly)• 22 methods

Major findings• Simple methods do as well or better than sophisticated ones• Combining forecasts may improve forecasting accuracy• Special events have a negative impact on forecasting performance

Establishing the idea of forecasting competitions

Page 8: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The history of time series forecasting competitions

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Establishing the idea of forecasting competitions

G. Jenkins

G.J.A. Stern

Automatic forecasting may be useless and less accurate than humans, while combining forecasts quite risky

No-one wants that accurate forecasts nor has enough data to estimate them

M. B. Priestley

A model (simple data generation process) can perfectly describe and extrapolate your time series if identified and applied correctly

Page 9: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The history of time series forecasting competitions

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Makridakis et. al (1982)• Seven participants• 1001 time series (yearly, quarterly & monthly)• 15 methods (plus 9 variations)• Not real-time

M1: The first forecasting competition

Major findings• Statistically sophisticated or complex methods do not necessarily provide more accurate

forecasts than simpler ones.• The relative ranking of the performance of the various methods varies according to the

accuracy measure being used.• The accuracy when various methods are combined outperforms, on average, the

individual methods being combined and does very well in comparison to other methods.• The accuracy of the various methods depends on the length of the forecasting horizon

involved.

What's new?

•Real participants

•Many accuracy measures

Page 10: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The history of time series forecasting competitions

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Makridakis and Hibon (1993)• 29 time series• 16 methods (human forecasters,

automatic methods and combinations)

• Real time

Major findings• In most cases, forecasters failed to improve statistical forecasts based on

their judgment• Simple methods perform better in most of the cases, with the results being

in agreement with previous studies

What's new?

•Combine statistical methods with judgment

•Ask questions to the companies involved

•Learn from previous errors and revise next

forecasts accordingly

M2: Incorporating judgment

Page 11: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The history of time series forecasting competitions

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Makridakis and Hibon (2000)• 3003 time series• 24 methods • Not real time

Major findings• The results of the previous studies and competitions were largely

confirmed.• New methods, such as the Theta of Assimakopoulos & Nikolopoulos (2000),

and FSSs, such as the ForecastPro, have proven their forecasting capabilities• ANNs relatively inaccurate

What's new?

•More methods (NNs and FSSs)

•More series

M3: The forecasting benchmark

“The M3 series have become the de facto standard test base in forecasting research. When any new

univariate forecasting method is proposed, if it does not perform well on the M3 data compared to the results

on other published algorithms, it is unlikely to receive any further attention or adoption.”

(Kang, Hyndman & Smith-Miles, 2017)

Page 12: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The history of time series forecasting competitions

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Modern forecasting competitions

Neural network competitions (NN3 -2006)

Crone, Hibon & Nikolopoulos (2011)111 monthly M3 series & 59 submissions ✓ None CI method outperformed the

original M3 contestants✓ NNs may be inadequate for time

series forecasting, especially for short ones

✓ No “best-practices” identified for utilizing CI methods

Kaggle Competitions

Tourism Forecasting CompetitionAthanasopoulos & Hyndman (2010)

Web traffic (Wikipedia) competitionAnava & Kuznetsov (2017)

✓ feedback significantly improves forecasting accuracy by proving motivation and fruitful feedback

✓ fast results and conclusions

Page 13: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Status que and next steps

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

✓ Forecasting and time series analysis are two different things

✓ Models that produce more accurate forecasts should be preferred from those of better statistical properties

✓ Simple models work – especially for short series

✓ Out-of-sample and in-sample accuracy may significantly differ (Avoid over-fitting)

✓ Automatic forecasting algorithms work rather well – especially for long time series

✓ Combining methods help us deal with uncertainty

So, what did we learn?

Page 14: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Status que and next steps

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

✓ Which are the “best practices” nowadays?

✓ How do advances in technology and algorithms have affected forecasting?

✓ Are there any new methods that could really make a difference?

✓ How about prediction intervals?

✓ Similarities and differences between the various forecasting methods, including ML ones?

✓ Are the data of the forecasting competitions representative? Do other larger datasets support previous findings?

What would be also useful to learn (or verify) though M4?

Page 15: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Today

Nov Dec 2018 Feb Mar Apr May Jun Jul Aug Sep Oct

Competition Announced

Nov 1, 2017Competition

StartsJan 1, 2018

Competition Ends May 31, 2018

Preliminary Results Jun 18, 2018Final Results and

WinnersSep 28, 2018

The M4 Competition

The dates

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

• There was also a deadline extension (1 week) to encourage more participations• Late submissions are not eligible for any prize

Page 16: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The M4 Competition

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

The dataset (1/2)

Frequency Micro Industry Macro Finance Demographic Other Total

Yearly 6,538 3,716 3,903 6,519 1,088 1,236 23,000

Quarterly 6,020 4,637 5,315 5,305 1,858 865 24,000

Monthly 10,975 10,017 10,016 10,987 5,728 277 48,000

Weekly 112 6 41 164 24 12 359

Daily 1,476 422 127 1,559 10 633 4,227

Hourly - - - - - 414 414

Total 5,121 18,798 19,402 24,534 8,708 3,437 100,000

✓ The largest forecasting competition involving 100,000 business time series to provide conclusions of statistical significance

✓ High frequency data, including Weekly, Daily and Hourly series✓ Diverse time series collected from 23 reliable data sources & classified in 6 domains

*Data available at https://www.m4.unic.ac.cy/the-dataset/ or through the M4comp2018 R package

Page 17: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The M4 Competition

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

2D visualization of time series into the Feature Space of Kang et al., 2017 Frequency, Seasonality, Trend, Randomness, ACF1 & Box-Cox λ

Yearly

Monthly

Hourly

Quarterly

The dataset (2/2)

Page 18: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The M4 Competition

✓ Produce point forecasts for the whole dataset – mandatory. Forecasting horizons as follows:• 6 for yearly• 8 for quarterly (2 years)• 18 for monthly (1.5 years)• 13 for weekly (3 months)• 14 for daily (2 weeks)• 48 for hourly data (2 days)

✓ Estimate prediction intervals (95% confidence) for the whole dataset –optional✓ Submit before deadline through the M4 site using a pre-defined file formal ✓ Submit the code used to generate the forecasts, as well as a detailed method

description for reasons of reproducibility - optional but highly recommended. The supplementary material must be uploaded at M4 GitHub* repo not later than 10th of June, 2018

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

The rules

* https://github.com/M4Competition

Page 19: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The M4 Competition

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Evaluation: Point Forecasts

Overall Weighted Average (OWA) of two accuracy measures:• Mean Absolute Scaled Error (MASE)• symmetric Mean Absolute Percentage Error (sMAPE)

𝑀𝐴𝑆𝐸 =1

σ𝑡=1ℎ 𝑌𝑡 − 𝑌𝑡

1𝑛 − 𝑚

σ𝑡=𝑚+1𝑛 𝑌𝑡 − 𝑌𝑡−𝑚

,where 𝑌𝑡 is the post sample value of the time series at point t, 𝑌𝑡 the estimated forecast, h the forecasting horizon and m the frequency of the data

𝑠𝑀𝐴𝑃𝐸 =1

𝑡=1

ℎ2 𝑌𝑡 − 𝑌𝑡

𝑌𝑡 + 𝑌𝑡

➢ Estimate MASE and sMAPE per series by averaging the error computed perforecasting horizon

➢ Divide all Errors by that of Naïve 2 (Relative MASE and Relative sMAPE)➢ Compute the OWA by averaging the Relative MASE and the Relative sMAPE

Page 20: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The M4 Competition

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Evaluation: Prediction Intervals

Mean Scaled Interval Score (MSIS)

➢ A penalty is calculated at the points where the real values are outside the specified bounds

➢ The width of the prediction interval adds up to the penalty, if any, to get the IS➢ The IS estimated at the individual points are averaged to get the MIS value➢ MIS is scaled by dividing its value with the mean absolute seasonal difference of

the series➢ MSIS of all series is averaged to evaluate the total performance of the method

𝐌𝐒𝐈𝐒 =1

σ𝑡=1ℎ ቅ𝑈𝑡 − 𝐿𝑡 +

2𝑎 (𝐿𝑡 − 𝑌𝑡)1{𝑌𝑡 < 𝐿𝑡} +

2𝑎 (𝑌𝑡 − 𝑈𝑡)1{𝑌𝑡 > 𝑈𝑡

1𝑛 − 𝑚

σ𝑡=𝑚+1𝑛 𝑌𝑡 − 𝑌𝑡−𝑚

,where L and U are the Lower and Upper bounds of the prediction intervals, 𝑌 are the future observations

of the series, 𝑎 is the significance level (0,05) and 1 is the indicator function (being 1 if Y is within the

postulated interval and 0 otherwise).

Page 21: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The M4 Competition

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

The benchmarks

1. Naïve 1 (S) – used to compare all methods (Prediction Intervals)

2. Seasonal Naïve (S)

3. Naïve 2 (S) - reference for estimating OWA

4. Simple Exponential Smoothing (S)

5. Holt’s Exponential Smoothing (S)

6. Dampen Exponential Smoothing (S)

7. Combination of 4, 5 and 5 (C) – used to compare all methods (Point Forecasts)*

8. Theta (S)

9. MLP (ML)

10.RNN (ML)

10 benchmarks were used to facilitate comparisons between the participating methods: 7 classic Statistical methods, 1 Combination and 2 simplified Machine Learning ones

*Accurate, robust, simple & easy to understand

Page 22: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The M4 Competition

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

The prizes

Prize Description Amount

1st Prize Best performing method according to OWA 9,000 €

2nd Prize Second-best performing method according to OWA 4,000 €

3rd Prize Third-best performing method according to OWA 2,000 €

Prediction Intervals Prize Best performing method according to MSIS 5,000 €

The UBER Student Prize Best performing method according to OWA 5,000 €

The Amazon Prize The best reproducible forecasting method according to OWA 2,000 €

Six prizes, standing in total at 27,000€

Sponsorships

Page 23: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The M4 Competition

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

The participants (1/2)

✓ 50 submissions (20 with PIs)✓ 17 countries

0

2

4

6

8

10

12

14

Page 24: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

The M4 Competition

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

The participants (2/2)

✓ The majority utilized statistical methods or combinations, both of Statistical and ML models, and only a few pure ML ones*.

✓ More than half of the participants were related to the academia and the rest were either companies or individuals

0

5

10

15

20

25

30

35

University Company-Organization Individual

# of Participants per Affiliation Type

0

5

10

15

20

25

30

35

Combination Statistical Machine Learning Other

# of Participants per Method Type

*These are rough classifications – more work is needed to verify them

Page 25: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Evaluation of submissions – Point Forecasts

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Rankings (1/5)

Rank Team Affiliation Method sMAPE MASE OWADiff fromComb (%)

1 Smyl Uber Technologies Hybrid 11.37 1.54 0.821 -8.52

2 Montero-Manso et al.University of A Coruña & Monash

UniversityComb (S & ML) 11.72 1.55 0.838 -6.65

3 Pawlikowski et al. ProLogistica Soft Comb (S) 11.84 1.55 0.841 -6.25

4 Jaganathan & Prakash Individual Comb (S & ML) 11.70 1.57 0.842 -6.17

5 Fiorucci, J. A. & LouzadaUniversity of Brasilia & University of

São PauloComb (S) 11.84 1.55 0.843 -6.10

6 Petropoulos & SvetunkovUniversity of Bath & Lancaster

UniversityComb (S) 11.89 1.57 0.848 -5.55

7 Shaub Harvard Extension School Comb (S) 12.02 1.60 0.860 -4.13

8 Legaki & KoutsouriNational Technical University of

AthensStatistical 11.99 1.60 0.861 -4.11

9 Doornik et al. University of Oxford Comb (S) 11.92 1.63 0.865 -3.62

10 Pedregal et al. University of Castilla-La Mancha Comb (S) 12.11 1.61 0.869 -3.19

11 4Theta (Benchmark) - Statistical 12.15 1.63 0.874 -2.65

12 RoubinchteinWashington State Employment

Security DepartmentComb (S) 12.18 1.63 0.876 -2.38

13 Ibrahim Georgia Institute of Technology Statistical 12.20 1.64 0.880 -1.97

14 Tartu M4 seminar University of Tartu Comb (S & ML) 12.50 1.63 0.888 -1.09

15 Waheeb Universiti Tun Hussein Onn Malaysia Comb (S) 12.15 1.71 0.894 -0.40

Page 26: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Evaluation of submissions – Point Forecasts

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Rankings (2/5)

Rank Team Affiliation Method sMAPE MASE OWADiff fromComb (%)

16 Darin & StellwagenBusiness Forecast Systems

(Forecast Pro)Statistical 12.28 1.69 0.895 0.25

17 Dantas & Cyrino OliveiraPontifical Catholic University of Rio

de JaneiroComb (S) 12.55 1.66 0.896 0.19

18 Theta (Benchmark) - Statistical 12.31 1.70 0.897 0.03

19 Comb (Benchmark) - Comb (S) 12.55 1.66 0.898 0.00

20 Nikzad, A. Scarsin (i2e) Comb (S) 12.37 1.72 0.907 -1.01

21 Damped (Benchmark) - Statistical 12.66 1.68 0.907 -1.02

22 Segura-Heras et al.Universidad Miguel Hernández &

Universitat de ValenciaComb (S) 12.51 1.72 0.910 -1.38

23 Trotta IndividualMachine Learning

12.89 1.68 0.915 -1.94

24 Chen & Francis Fordham University Comb (S) 12.55 1.73 0.915 -1.96

25 Svetunkov et al.Lancaster University &

University of NewcastleComb (S) 12.46 1.74 0.916 -2.01

26 Talagala et al. Monash University Statistical 12.90 1.69 0.917 -2.12

27 Sui & Rengifo Fordham University Comb (S) 12.85 1.74 0.930 -3.56

28 Kharaghani Individual Comb (S) 13.06 1.72 0.930 -3.63

29 Smart Forecast Smart Cube Comb (S) 13.21 1.79 0.955 -6.34

30 Wainwright et al. Oracle Corporation (Crystal Ball) Statistical 13.34 1.80 0.962 -7.15

Page 27: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Evaluation of submissions – Point Forecasts

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Rankings (3/5)Top 6 performing methods

Smyl, S.• Hybrid model mixing Exp. Smoothing with LSTM – estimated concurrently• Hierarchical modeling – parameters estimated using information both from the

whole dataset and individual series | Combinations are also consideredMontero-Manso, P., Talagala, T., Hyndman, R. J. & Athanasopoulos, G.• Weighted average of ARIMA, ETS , tbats, Theta, naïve, seasonal naïve, NN and LSTM• Weights estimated through gradient boosting tree (xgboost) using holdout testsPawlikowski, M., Chorowska, A. & Yanchuk, O.• Weighted average of several statistical methodsusing holdout tests• Pool defined based on time series characteristics / manual selectionJaganathan, S. & Prakash, P.• Combination of statistical methods as described in Armstrong, J. S. (2001)Fiorucci, J. A. & Louzada, F.• Weighted average of ARIMA, ETS & Theta• Weights estimated using cross-validationPetropoulos, F. & Svetunkov, I.• Median of ETS, CES, ARIMA & Theta

Page 28: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Evaluation of submissions – Point Forecasts

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Rankings (4/5)

Spearman’s correlation coefficient of the rankings

Correlation sMAPE MASE OWA

sMAPE - - -

MASE 0.88 - -

OWA 0.94 0.98 -

The final ranks, both according to MASE and sMAPE, are highly correlated with OWA, meaning that both can be used as proxies to measure the

relative performance of the individual methods

Page 29: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Evaluation of submissions – Point Forecasts

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Rankings (5/5)Multiple Comparisons with the Best (MCB) OWA Rank Participant

#2 Montero-Manso

#5 Fiorucci

#3 Pawlikowski

#4 Jaganathan

#1 Smyl

#6 Petropoulos

✓ The forecasts of the first six methods did not statistically differ✓ Apart from these methods, the improvements of the rest over the benchmarks were minor

Naive2

Theta

Montero-Manso.

Fiorucci

Pawlikowski

Jaganathan.

Smyl.

Naive

sNaiveSESHoltDamped

ComMLP

RNNPetropoulos

Page 30: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Evaluation of submissions – Point Forecasts

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

What about Complexity - Future WorkDoes sub-optimality matter? (Nikolopoulos & Petropoulos, 2017)

Forecasting performance (sMAPE) versus computational complexity (Makridakis et al., 2018)

Page 31: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Comparing different types of methods

Type of Method Yearly Quarterly Monthly Weekly Daily Hourly Total

Statistical 0.93 0.93 0.95 0.97 1.00 1.00 0.97

Machine Learning 1.27 1.16 1.20 1.00 1.93 0.92 1.48

Combination 0.87 0.90 0.92 0.90 1.02 0.65 0.91

Other 0.99 1.92 1.77 8.88 9.16 2.79 1.80

Type of Method Macro Micro Demographic Industry Finance Other Total

Statistical 0.95 0.98 0.95 0.99 0.97 0.97 0.98

Machine Learning 1.20 1.16 1.44 1.43 1.41 1.56 1.48

Combination 0.90 0.89 0.90 0.93 0.92 0.91 0.91

Other 1.64 1.81 1.93 1.55 2.04 1.76 1.80

Median performance per Frequency & Domain

✓ In general, Combinations produced more accurate forecasts that the rest of the methods, regardless the frequency and the domain of the data

✓ Out of the 17 methods that did better than the benchmarks, 12 wereComb, 4 were Statistical and 1 was Hybrid

✓ Only 1 pure ML method performed better than Naive2

Page 32: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Comparing different types of methods

Top 3 per Frequency & Domain

Frequency 1st 2nd 3rd

Yearly Smyl, S. (#1) Legaki, N. Z. (#8) Montero-Manso, P. (#2)

Quarterly Montero-Manso, P. (#2) Smyl, S. (#1) Petropoulos, F. (#6)

Monthly Smyl, S. (#1) Jaganathan, S. (#4) Montero-Manso, P. (#2)

Weekly Darin, S. (#16) Petropoulos, F. (#6) Pawlikowski, M. (#3)

Daily Pawlikowski, M. (#3) Taru M4Seminar (#14) Fiorucci, J. A. (#5)

Hourly Doornik, J. (#9) Smyl, S. (#1) Pawlikowski, M. (#3)

Domain 1st 2nd 3rd

Macro Smyl, S. (#1) Jaganathan, S. (#4) Montero-Manso, P. (#2)

Micro Smyl, S. (#1) Legaki, N. Z. (#8) Pawlikowski, M. (#3)

Demographic Montero-Manso, P. (#2) Smyl, S. (#1) Pawlikowski, M. (#3)

Industry Montero-Manso, P. (#2) Smyl, S. (#1) Jaganathan, S. (#4)

Finance Smyl, S. (#1) Montero-Manso, P. (#2) Fiorucci, J. A. (#5)

Other Smyl, S. (#1) Pawlikowski, M. (#3) Montero-Manso, P. (#2)

Legend: - Statistical - Combination

➢ Although the best performing methods for the whole dataset were also very accurate for the individual subsets, in many cases they were outperformed by other methods with a much lower rank – No method to fit them all

Spearman’s correlation coefficient of the rankings

Page 33: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Impact of forecasting horizon

Average sMAPE across 60 methods (benchmarks & submissions)

FrequencyDeterioration per period (%)

Yearly 20

Quarterly 13Monthly 6

Weekly 7Daily 14Hourly 1

✓ The length of the forecasting horizon has a great impact on forecasting accuracy

✓ Only for hourly data did ML methods become competitive

Page 34: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Impact of time series characteristics*

Average impact in forecasting accuracy (coefficient) per t-s characteristick methods/type x 100,000 observations

Machine Learning: • More data, better forecasts• Not robust for noisy and linear series• Good for seasonal series

Type of Method

Randomness Trend Seasonality Linearity Stability Length

Machine Learning 0.20 -0.10 -0.04 0.14 -0.05 -0.08

Statistical 0.18 -0.08 -0.02 0.09 -0.04 0.15

Combination 0.17 -0.09 -0.02 0.10 -0.03 -0.02

Total 0.18 -0.08 -0.02 0.10 -0.04 0.06

Combinations:• Robust for noisy data • Bad in capturing seasonality

Statistical:• Bad for trended & seasonal

series• Good for modeling linear

patterns• The less the data the better

(use only the most recent ones)

𝑠𝑀𝐴𝑃𝐸 = 𝑎 ∗ 𝑅𝑎𝑛𝑑𝑜𝑚𝑛𝑒𝑠𝑠 + 𝑏 ∗ 𝑇𝑟𝑒𝑛𝑑 +⋯+ 𝑓 ∗ 𝐿𝑒𝑛𝑔𝑡ℎ*

Page 35: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Evaluation of submissions – Prediction Intervals

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Rankings

Rank Team Affiliation Method MSIS CoverageDiff fromNaive (%)

1 Smyl Uber Technologies Hybrid 12.23 94.78% 49.2%

2 Montero-Manso et al.University of A Coruña & Monash

UniversityComb (S & ML) 14.33 95.96% 40.4%

3 Doornik et al. University of Oxford Comb (S) 15.18 90.70% 36.9%

4 ETS (benchmark) - Statistical 15.68 91.27% 34.8%

5 Fiorucci & LouzadaUniversity of Brasilia & University of

São PauloComb (S) 15.69 88.52% 34.8%

6 Petropoulos & SvetunkovUniversity of Bath & Lancaster

UniversityComb (S) 15.98 87.81% 33.6%

7 RoubinchteinWashington State Employment

Security DepartmentComb (S) 16.50 88.93% 31.4%

8 Talagala et al. Monash University Statistical 18.43 86.48% 23.4%

9 ARIMA (benchmark) - Statistical 18.68 85.80% 22.3%

10 Ibrahim Georgia Institute of Technology Statistical 20.20 85.62% 16.0%

11 Iqbal et al. Wells Fargo Securities Statistical 22.00 86.41% 8.5%

12 ReillyAutomatic Forecasting Systems, Inc.

(AutoBox)Statistical 22.37 82.87% 7.0%

13 Wainwright et al. Oracle Corporation (Crystal Ball) Statistical 22.67 82.99% 5.7%

14 Segura-Heras et al.Universidad Miguel Hernández &

Universitat de ValenciaComb (S) 22.72 90.10% 5.6%

15 Naïve (benchmark) - Statistical 24.05 86.40% 0.0%

Page 36: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Evaluation of submissions – Prediction Intervals

Median performance per Frequency

✓ Apart from the first two methods, the rest underestimated reality considerably

✓ On average, the coverage of the methods was only 86.4% (target is 95%)✓ Estimating uncertainty was more difficult for low frequency data, especially

for the yearly series – limited sample & longer forecasting horizon

Page 37: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Evaluation of submissions – Prediction Intervals

Median performance per Domain

✓ Demographic and Industry data were easier to predict – slower changes and less fluctuations

✓ Micro & Finance data are characterized by the higher levels of uncertainty –challenges for business forecasting

Page 38: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Impact of forecasting horizon

Average Coverage across 23 methods (benchmarks & submissions)

✓ The length of the forecasting horizon has a great impact on estimating the PIs correctly, especially for yearly, quarterly & monthly data

Co

vera

ge

Page 39: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Conclusions

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

✓ Hybrid methods, utilizing basic principles of statistical models and ML components, have a great potential

✓ Combining forecasts of different methods significantly improves forecasting accuracy

✓ Pure ML methods are inadequate for time series forecasting

✓ Prediction intervals underestimate reality considerably

Accuracy of individual statistical or ML methods is low and hybrid approaches and

combination of methods is the way forward to improve forecasting accuracy and make forecasting more valuable

Five major findings

Page 40: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Conclusions

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

✓ Complex methods did better than simple ones but the improvements were notexceptional. Given the computational resources used, one can question if theseare also practical.

✓ Forecasting horizon has a negative effect on forecasting accuracy – both forpoint forecasts and PIs

✓ When using large samples, the variations reported between different errormeasures were insignificant

✓ Different methods should be used per series according to their characteristics,as well as their frequency and domain. Yet, learning from the masses seemsmandatory.

✓ The majority of the forecasters exploited traditional forecasting approaches andmostly experimented on how to combine them

…and some minor, yet important ones

Page 41: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Next Steps

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

➢ Understand why hybrid methods work better in order to advance them further and improve their forecasting performance

➢ Figure out how combinations should be performed and where the emphasis should be given – pool or weights?

➢ Study the elements of the top performing methods in terms of PIs and lean how to exploit and advance their features to better capture uncertainty

➢ Accept the drawbacks of ML methods and reveal ways to utilize their advantages in time series forecasting

➢ Experiment and discover new, more accurate forecasting approaches

Page 42: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

Thank you for your attentionQuestions?

If you would like to learn more about M4 visit

https://www.m4.unic.ac.cy/

or contact me at

[email protected]

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018

Page 43: The M4 Competition in Progress...o Machine Learning o Combination o Judgmental with contradicted results in the literature Even if we knew which method is best for the examined application

References• Armstrong, J. S., Green, K. C. & Graefe, A. (2015). Golden rule of forecasting: Be conservative. Journal of Business Research, 68(8), 1717-1731• Armstrong, J. S. (2001). Combining forecasts. Retrieved from https://repository.upenn.edu/marketing_papers/34• Athanasopoulos, G., Hyndman, R.J., Song, H. & Wu, D.C. (2011). The tourism forecasting competition. International Journal of Forecasting, 27(3),822-

844,• Athanasopoulos, G. & Hyndman, R.J. (2011). The value of feedback in forecasting competitions. International Journal of Forecasting, 27(3), 845-849• Crone, S. F., Hibon, M. & Nikolopoulos, K. (2011). Advances in forecasting with neural networks? Empirical evidence from the NN3 competition on

time series prediction. International Journal of Forecasting, 27(3), 635 - 660• Fildes, R. & Petropoulos, F. (2015). Simple versus complex selection rules for forecasting many time series. Journal of Business Research, 68(8), 1692-

1701• Gneiting, T. & Raftery A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102 (477),

359-378• Hyndman, R. J., Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting 22(4), 679-688• Kang, Y., Hyndman, R.J. & Smith-Miles, K. (2017). Visualising forecasting algorithm performance using time series instance spaces. International

Journal of Forecasting, 33(2), 345-358• Makridakis, S., Hibon, M., & Moser, C. (1979). Accuracy of Forecasting: An Empirical Investigation. Journal of the Royal Statistical Society. Series A

(General), 142(2), 97-145• Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M. et al. (1982). The accuracy of extrapolation (time series) methods: results of a

forecasting competition. Journal of Forecasting, 1, 111-153• Makridakis, S., Chatfield, C., Hibon, M., Lawrence, M., Mills, T. et al. (1993). The M2-competition: A real-time judgmentally based forecasting study.

International Journal of Forecasting, 9(1), 5-22• Makridakis, S. & Hibon, M. (2000). The M3-Competition: results, conclusions and implications. International Journal of Forecasting, 16(4), 451-476• Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2018). Statistical and Machine Learning forecasting methods: Concerns and ways forward. PLOS

ONE, 13(3), 1-26• Montero-Manso, P., Netto, C. & Talagala, T. (2018). M4comp2018: Data from the M4-Competition, R package version: 0.1.0• Newbold, P., & Granger, C. (1974). Experience with Forecasting Univariate Time Series and the Combination of Forecasts. Journal of the Royal

Statistical Society. Series A (General), 137(2), 131-165• Nikolopoulos, K. & Petropoulos, F. (2017). Forecasting for big data: Does suboptimality matter?,• Computers & Operations Research (in press)• Petropoulos, F., Makridakis, S., Assimakopoulos, V. & Nikolopoulos, K. (2014). ‘Horses for Courses’ in demand forecasting. European Journal of

Operational Research, 237(1), 152-163• Spiliotis, E., Patikos, A., Assimakopoulos V. & Kouloumos, A. (2017). Data as a service: Providing new datasets to the forecasting community for time

series analysis. 37th International Symposium on Forecasting, Cairns, Australia

38th International Symposium on ForecastingBoulder Colorado, USA– June 2018