visualization of big time series data

Visualisation ofbig time seriesdata

Visualisation of big time series data 1

Rob J Hyndman

with Earo Wang, Nikolay LaptevYanfei Kang, Kate Smith-Miles

Rob J Hyndman

Outline

1 The problem

2 Australian tourism demand

3 M3 competition data

4 Yahoo web traffic

5 What next?

Visualisation of big time series data The problem 2

Spectacle sales

Monthly sales data from 2000 2014Provided by a large spectacle manufacturerSplit by brand (26), gender (3), price range (6),materials (4), and stores (600)About a million disaggregated series

Fulcher collection

www.comp-engine.org/timeseries

38,190 time series from many sources

Over 20,000 real series from meterology,medicine, audio, astrophysics, finance, etc.Over 10,000 simulated series from variouschaotic and stochastic models.

Fulcher collection

38,190 time series from many sources

Over 20,000 real series from meterology,medicine, audio, astrophysics, finance, etc.Over 10,000 simulated series from variouschaotic and stochastic models.

FRED: research.stlouisfed.org/fred2/

research.stlouisfed.org/fred2/

Quandl: www.quandl.com

www.quandl.com

How to plot lots of time series?

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Time

Key idea

Examples for time series

lag correlationsize and direction of trendstrength of seasonalitytiming of peak seasonalityspectral entropy

Called features or characteristics in themachine learning literature.

John W Tukey

Cognostics

Computer-produced diagnostics(Tukey and Tukey, 1985).

Key idea

John W Tukey

Cognostics

Key idea

John W Tukey

Cognostics

Key idea

John W Tukey

Cognostics

Key idea

John W Tukey

Cognostics

Key idea

John W Tukey

Cognostics

Key idea

John W Tukey

Cognostics

Key idea

John W Tukey

Cognostics

Outline

1 The problem

4 Yahoo web traffic

5 What next?

Visualisation of big time series data Australian tourism demand 10

Australian tourism demand

Quarterly data on visitor night from1998:Q1 2013:Q4From: National Visitor Survey, based onannual interviews of 120,000 Australiansaged 15+, collected by Tourism ResearchAustralia.Split by 7 states, 27 zones and 76 regions(a geographical hierarchy)Also split by purpose of travel

HolidayVisiting friends and relatives (VFR)BusinessOther

304 disaggregated series

Domestic tourism demand: VictoriaB

AA

Hol

BA

BH

ol

BA

AV

isB

AB

Vis

BA

AB

usB

AB

Bus

BA

AO

thB

AB

Oth

BA

CH

olB

BA

Hol

BA

CV

isB

BA

Vis

BA

CB

usB

BA

Bus

BA

CO

thB

BA

Oth

BC

AH

olB

CB

Hol

BC

AV

isB

CB

Vis

BC

AB

usB

CB

Bus

BC

AO

thB

CB

Oth

BC

CH

olB

DA

Hol

BC

CV

isB

DA

Vis

BC

CB

usB

DA

Bus

BC

CO

thB

DA

Oth

BD

BH

olB

DC

Hol

BD

BV

isB

DC

Vis

BD

BB

usB

DC

Bus

BD

BO

thB

DC

Oth

BD

DH

olB

DE

Hol

BD

DV

isB

DE

Vis

BD

DB

usB

DE

Bus

BD

DO

thB

DE

Oth

BD

FH

olB

EA

Hol

BD

FV

isB

EA

Vis

BD

FB

usB

EA

Bus

BD

FO

thB

EA

Oth

BE

BH

olB

EC

Hol

BE

BV

isB

EC

Vis

BE

BB

usB

EC

Bus

BE

BO

thB

EC

Oth

BE

DH

olB

EE

Hol

BE

DV

isB

EE

Vis

BE

DB

usB

EE

Bus

BE

DO

thB

EE

Oth

BE

FH

olB

EG

Hol

BE

FV

isB

EG

Vis

BE

FB

usB

EG

Bus

BE

FO

thB

EG

Oth

An STL decompositionTourism demand for holidays in PeninsulaYt = St + Tt + Rt St is periodic with mean 0

5.0

6.0

7.0

data

0.

50.

5

seas

onal

5.8

6.1

6.4

tren

d

0.

40.

0

2000 2005 2010

rem

aind

er

timeVisualisation of big time series data Australian tourism demand 13

Seasonal stacked bar chart

Place positive values above the origin whilenegative values below the originMap the bar length to the magnitudeEncode quarters by colours

1.0

0.5

0.0

0.5

1.0

Holiday

BAA BAB BAC BBABCABCBBCCBDABDBBDCBDDBDEBDF BEA BEBBECBEDBEE BEFBEGRegions

Sea

sona

l Com

pone

nt

Qtr

Q1

Q2

Q3

Q4

Seasonal stacked bar chart: VIC

1.00.5

0.00.51.0

1.00.5

0.00.51.0

1.00.5

0.00.51.0

1.00.5

0.00.51.0

Holiday

VF

RB

usinessO

ther

BAABABBACBBABCABCBBCCBDABDBBDCBDDBDEBDFBEABEBBECBEDBEEBEFBEGRegions

Sea

sona

l Com

pone

nt

QtrQ1Q2Q3Q4

Trend analysis

Linearity: the long-term direction andstrength of trend.

Curvature: the changing direction of trend.

Estimate by regression:

Tt = 0 + 11(t) + 22(t) + et

where k(t) is a kth-degree orthogonalpolynomial in time t.

To separate the linearity (1) and curvature(2).

Trend analysis

01234

Holiday

VF

RB

usinessO

ther

BAA BAB BAC BBA BCABCBBCCBDABDBBDCBDDBDE BDF BEA BEB BECBED BEE BEF BEGRegions

Tren

d Li

near

ity

Direction+

Trend analysis

Corrgram of remainder

0.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

BE

EH

olB

EF

Oth

BE

EO

thB

DE

Oth

BE

BO

thB

EA

Bus

BE

FB

usB

DC

Oth

BA

CH

olB

EB

Bus

BE

AV

isB

BA

Hol

BD

EH

olB

AB

Oth

BA

AV

isB

AA

Hol

BD

CH

olB

BA

Bus

BC

BH

olB

EG

Bus

BD

DV

isB

AB

Vis

BD

AV

isB

EA

Oth

BD

FH

olB

EE

Bus

BA

AO

thB

AC

Oth

BD

AO

thB

DE

Bus

BC

BO

thB

AC

Bus

BE

BV

isB

AC

Vis

BC

AO

thB

EF

Vis

BC

BV

isB

ED

Hol

BE

GO

thB

DB

Hol

BA

BB

usB

EB

Hol

BD

FB

usB

EC

Hol

BC

AH

olB

DB

Oth

BE

AH

olB

DC

Bus

BE

CV

isB

DB

Vis

BC

CH

olB

BA

Vis

BA

BH

olB

BA

Oth

BC

CO

thB

CB

Bus

BC

CV

isB

EG

Vis

BD

DH

olB

EC

Oth

BD

CV

isB

AA

Bus

BC

CB

usB

EC

Bus

BC

AV

isB

DF

Vis

BE

GH

olB

DD

Oth

BE

DO

thB

ED

Vis

BD

DB

usB

DE

Vis

BE

FH

olB

EE

Vis

BD

BB

usB

DA

Bus

BD

AH

olB

CA

Bus

BD

FO

thB

ED

Bus

BEEHolBEFOthBEEOthBDEOthBEBOthBEABusBEFBusBDCOthBACHolBEBBusBEAVisBBAHolBDEHolBABOthBAAVisBAAHolBDCHolBBABusBCBHolBEGBusBDDVisBABVisBDAVisBEAOthBDFHolBEEBusBAAOthBACOthBDAOthBDEBusBCBOthBACBusBEBVisBACVisBCAOthBEFVisBCBVisBEDHolBEGOthBDBHolBABBusBEBHolBDFBusBECHolBCAHolBDBOthBEAHolBDCBusBECVisBDBVisBCCHolBBAVisBABHolBBAOthBCCOthBCBBusBCCVisBEGVisBDDHolBECOthBDCVisBAABusBCCBusBECBusBCAVisBDFVisBEGHolBDDOthBEDOthBEDVisBDDBusBDEVisBEFHolBEEVisBDBBusBDABusBDAHolBCABusBDFOthBEDBus

0.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

BE

EH

olB

EF

Oth

BE

EO

thB

DE

Oth

BE

BO

thB

EA

Bus

BE

FB

usB

DC

Oth

BA

CH

olB

EB

Bus

BE

AV

isB

BA

Hol

BD

EH

olB

AB

Oth

BA

AV

isB

AA

Hol

BD

CH

olB

BA

Bus

BC

BH

olB

EG

Bus

BD

DV

isB

AB

Vis

BD

AV

isB

EA

Oth

BD

FH

olB

EE

Bus

BA

AO

thB

AC

Oth

BD

AO

thB

DE

Bus

BC

BO

thB

AC

Bus

BE

BV

isB

AC

Vis

BC

AO

thB

EF

Vis

BC

BV

isB

ED

Hol

BE

GO

thB

DB

Hol

BA

BB

usB

EB

Hol

BD

FB

usB

EC

Hol

BC

AH

olB

DB

Oth

BE

AH

olB

DC

Bus

BE

CV

isB

DB

Vis

BC

CH

olB

BA

Vis

BA

BH

olB

BA

Oth

BC

CO

thB

CB

Bus

BC

CV

isB

EG

Vis

BD

DH

olB

EC

Oth

BD

CV

isB

AA

Bus

BC

CB

usB

EC

Bus

BC

AV

isB

DF

Vis

BE

GH

olB

DD

Oth

BE

DO

thB

ED

Vis

BD

DB

usB

DE

Vis

BE

FH

olB

EE

Vis

BD

BB

usB

DA

Bus

BD

AH

olB

CA

Bus

BD

FO

thB

ED

Bus

BEEHolBEFOthBEEOthBDEOthBEBOthBEABusBEFBusBDCOthBACHolBEBBusBEAVisBBAHolBDEHolBABOthBAAVisBAAHolBDCHolBBABusBCBHolBEGBusBDDVisBABVisBDAVisBEAOthBDFHolBEEBusBAAOthBACOthBDAOthBDEBusBCBOthBACBusBEBVisBACVisBCAOthBEFVisBCBVisBEDHolBEGOthBDBHolBABBusBEBHolBDFBusBECHolBCAHolBDBOthBEAHolBDCBusBECVisBDBVisBCCHolBBAVisBABHolBBAOthBCCOthBCBBusBCCVisBEGVisBDDHolBECOthBDCVisBAABusBCCBusBECBusBCAVisBDFVisBEGHolBDDOthBEDOthBEDVisBDDBusBDEVisBEFHolBEEVisBDBBusBDABusBDAHolBCABusBDFOthBEDBus

Compute the correlations amongthe remainder components

Render both the sign andmagnitude using a colour mappingof two hues

Order variables according to thefirst principal component of thecorrelations.

0.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

BD

AH

ol

BD

DH

ol

BE

BH

ol

BE

FH

ol

BE

CH

ol

BE

DH

ol

BD

FH

ol

BC

CH

ol

BD

CH

ol

BC

AH

ol

BE

AH

ol

BE

GH

ol

BB

AH

ol

BA

AH

ol

BA

BH

ol

BD

BH

ol

BD

EH

ol

BA

CH

ol

BC

BH

ol

BE

EH

ol

BDAHol

BDDHol

BEBHol

BEFHol

BECHol

BEDHol

BDFHol

BCCHol

BDCHol

BCAHol

BEAHol

BEGHol

BBAHol

BAAHol

BABHol

BDBHol

BDEHol

BACHol

BCBHol

BEEHol

Corrgram of remainder: TAS

0.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

FC

AH

ol

FB

BH

ol

FB

AH

ol

FAA

Hol

FC

BH

ol

FC

AV

is

FB

BV

is

FAA

Vis

FC

BB

us

FAA

Oth

FC

AO

th

FB

BO

th

FB

AB

us

FB

AO

th

FC

BV

is

FC

AB

us

FB

AV

is

FC

BO

th

FB

BB

us

FAA

Bus

FCAHol

FBBHol

FBAHol

FAAHol

FCBHol

FCAVis

FBBVis

FAAVis

FCBBus

FAAOth

FCAOth

FBBOth

FBABus

FBAOth

FCBVis

FCABus

FBAVis

FCBOth

FBBBus

FAABus

Outline

1 The problem

4 Yahoo web traffic

5 What next?

Visualisation of big time series data M3 competition data 20

M3 forecasting competition

The M3-Competition is a final attempt by the authors tosettle the accuracy issue of various time series methods. . .The extension involves the inclusion of more methods/researchers (in particular in the areas of neural networksand expert systems) and more series.

Makridakis & Hibon, IJF 2000

3003 series

All data from business, demography, finance andeconomics.

Series length between 14 and 126.

Either non-seasonal, monthly or quarterly.

All time series positive.

Candidate features

STL decompositionYt = St + Tt + Rt

Seasonal period

Strength of seasonality: 1 Var(Rt)Var(YtTt)Strength of trend: 1 Var(Rt)Var(YtSt)Spectral entropy: H =

fy() log fy()d,

where fy() is spectral density of Yt.Low values of H suggest a time series that iseasier to forecast (more signal).

Autocorrelations: r1, r2, r3, . . .

Optimal Box-Cox transformation parameter Visualisation of big time series data M3 competition data 24

Candidate features

Seasonal period

fy() log fy()d,

Candidate features

Seasonal period

fy() log fy()d,

Candidate features

Seasonal period

fy() log fy()d,

Candidate features

Seasonal period

fy() log fy()d,

Candidate features

Seasonal period

fy() log fy()d,

Candidate features

Seasonal period

fy() log fy()d,

Candidate features

Seasonality

N00

01

1976 1978 1980 1982 1984 1986 1988

1000

3000

5000

N15

02

1978 1980 1982 1984 1986

010

000

2000

0

N30

03

1984 1986 1988 1990 1992

2000

6000

1000

0

Candidate features

Trend

N00

01

1976 1978 1980 1982 1984 1986 1988

2000

4000

6000

N15

02

1982 1984 1986 1988 1990 1992

3000

5000

N30

03

1975 1980 1985100

040

0070

00

Candidate features

ACF1

N00

01

1987 1988 1989 1990

5800

6000

6200

N15

02

1987 1988 1989 1990 1991

3000

5000

7000

N30

03

1984 1986 1988 1990 1992

7000

8000

9000

Candidate features

Spectral entropy

N00

01

1964 1966 1968 1970 1972 1974

2500

4000

5500

N15

02

1986 1988 1990 1992

3000

4500

N30

03

1976 1978 1980 1982 1984 1986 1988200

024

0028

00

Candidate features

Box Cox

N00

05

1976 1978 1980 1982 1984 1986 1988

4500

6000

N22

69

1984 1986 1988 1990 1992

4200

4800

5400

N30

03

0 10 20 30 40 50 60

3500

4500

5500

Candidate features

SpecEntr

0.0 0.4 0.8 2 6 10 0.0 0.4 0.8

0.5

0.9

0.0

0.6

Trend

Season

0.0

0.6

28 Freq

ACF

0.

40.

6

0.5 0.7 0.9

0.0

0.6

0.0 0.4 0.8 0.4 0.2 0.8

Lambda

Dimension reduction for time series

SpecEntr

0.0 0.4 0.8 2 6 10 0.0 0.4 0.8

0.5

0.9

0.0

0.6

Trend

Season

0.0

0.6

28 Freq

ACF

0.

40.

6

0.5 0.7 0.9

0.0

0.6

0.0 0.4 0.8 0.4 0.2 0.8

Lambda

Featurecalculation

SpecEntr

0.0 0.4 0.8 2 6 10 0.0 0.4 0.8

0.5

0.9

0.0

0.6

Trend

Season

0.0

0.6

28 Freq

ACF

0.

40.

6

0.5 0.7 0.9

0.0

0.6

0.0 0.4 0.8 0.4 0.2 0.8

Lambda

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

Featurecalculation

Principalcomponentdecomposition

Feature space of M3 data

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

First two PCs explain 68% of variation.

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

3

6

9

12value

Freq

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

0.00

0.25

0.50

0.75

value

Season

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

0.25

0.50

0.75

value

Trend

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

0.0

0.5

value

ACF

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

0.50.60.70.80.9

value

SpecEntr

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

0.000.250.500.751.00

value

Lambda

Predictability

Three general forecasting methods:

Theta method Best overall in 2000 M3competition

ETS Exponential smoothing statespace models

STL-AR AR model applied to seasonallyadjusted series from STL, andseasonal component forecastusing the seasonal naive method.

Compute minimum MASE from all three methods

Predictability

Three general forecasting methods:

Theta method Best overall in 2000 M3competition

ETS Exponential smoothing statespace models

STL-AR AR model applied to seasonallyadjusted series from STL, andseasonal component forecastusing the seasonal naive method.

Compute minimum MASE from all three methods

Predictability

Theta

1975 1980 1985 1990

2000

4000

6000

8000

1000

0

Predictability

ETS

1975 1980 1985 1990

2000

4000

6000

8000

1000

0

Predictability

AR

1975 1980 1985 1990

2000

4000

6000

8000

1000

0

Predictability

Theta

1980 1982 1984 1986 1988 1990 1992

3000

4000

5000

6000

Predictability

ETS

1980 1982 1984 1986 1988 1990 1992

3000

4000

5000

6000

Predictability

STLAR

1980 1982 1984 1986 1988 1990 1992

3000

4000

5000

6000

Predictability

Theta

1984 1986 1988 1990 1992 1994

6000

6500

7000

7500

8000

Predictability

ETS

1984 1986 1988 1990 1992 1994

6000

6500

7000

7500

8000

Predictability

STLAR

1984 1986 1988 1990 1992 1994

6000

6500

7000

7500

8000

Predictability

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2Low

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

Middle

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

High

LowMASE values

Predictability

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

Low

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2Middle

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

High

MediumMASE values

Predictability

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

Low

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

Middle

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2High

HighMASE values

Predictability

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Best

EtsNoDiffStlmarTheta

Yearly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2 Best

NoDiffStlmar

Yearly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Best

Quarterly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2 Best

EtsNoDiffStlmar

Quarterly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Best

Monthly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2 Best

EtsNoDiffStlmar

Monthly data

Actual SVM prediction

Predictability

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Best

Yearly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2 Best

NoDiffStlmar

Yearly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Best

Quarterly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2 Best

EtsNoDiffStlmar

Quarterly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Best

Monthly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2 Best

EtsNoDiffStlmar

Monthly dataActual SVM prediction

Predictability

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Best

Yearly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2 Best

NoDiffStlmar

Yearly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Best

Quarterly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2 Best

EtsNoDiffStlmar

Quarterly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Best

Monthly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2 Best

EtsNoDiffStlmar

Monthly data

Actual SVM prediction

Generating new time series

We can use the feature space to:

Generate new time series with similar features toexisting series

Generate new time series where there are holes inthe feature space.

Let {PC1,PC2, . . . ,PCn} be a population of timeseries of specified length and period.Genetic algorithm uses a process of selection,crossover and mutation to evolve the populationtowards a target point Ti.Optimize: Fitness (PCj) =

(|PCj Ti|2).

Initial population random with some series inneighbourhood of Ti.

(|PCj Ti|2).

Evolving new time series

A

B

C

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

Targ

et A

1950 1960 1970 1980 1990

2000

6000

Evo

lved

A

0 5 10 15 20 25 30

4400

4800

5200

Targ

et B

1980 1985 1990 1995

3000

5000

7000

Time

Evo

lved

B5 10 15

5000

7000

Targ

et C

1982 1984 1986 1988 1990 1992 1994

2000

4000

Evo

lved

C

0 5 10 15 20 25 30

3000

5000

7000

D

E

F

3

2

1

0

1

2

3

2 0 2 4PC1

PC

2

Evo

lved

D

0 5 10 15 20 25 30

3000

5000

7000

Evo

lved

E

5 10 15

4000

8000

1200

0

Evo

lved

F

2 4 6 8 10

020

000

4000

0

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Targets

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Evolved yearly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Evolved quarterly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Evolved monthly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Targets

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Evolved yearly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2

4

2

0

2

4

2 0 2 4 6PC1

PC

2

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Targets

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Evolved yearly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2

4

2

0

2

4

2 0 2 4 6PC1

PC

2

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Targets

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Evolved yearly data

4

2

0

2

4

2 0 2 4 6PC1

PC

2

4

2

0

2

4

2 0 2 4 6PC1

PC

2

Questions raised

Can SVM be used to create a forecast selectionroutine to give better forecasts?

How much do M3 conclusions depend on theparticular set of time series involved?

Has the M3 data set biased forecast methoddevelopment?

What other features should we consider? Whatdifference does it make?

Is PCA the right approach? Perhaps we shoulduse multidimensional scaling? Or somethingelse?

Should we use more than 2 PC dimensions?Visualisation of big time series data M3 competition data 39

Questions raised

Outline

1 The problem

4 Yahoo web traffic

5 What next?

Visualisation of big time series data Yahoo web traffic 40

Yahoo web-trafficTens of thousands of time series collected atone-hour intervals over one month.Consisting of several server metrics (e.g. CPU usageand paging views) from many server farms globally.Aim: find unusual (anomalous) time series.

Yahoo web-traffic

3

6

9

10

20

30

40

1020304050

1

2

3

4

25

50

75

100

bu

sy2

33

bu

sy2

71

bu

sy5

0bu

sy2

00

bu

sy3

69

20

14

1

09

20

14

1

10

20

14

1

11

20

14

1

12

20

14

1

13

20

14

1

14

20

14

1

15

20

14

1

16

20

14

1

17

20

14

1

18

20

14

1

19

20

14

1

20

14

1

21

20

14

1

22

20

14

1

23

20

14

1

24

20

14

1

25

20

14

1

26

20

14

1

27

20

14

1

28

20

14

1

29

20

14

1

30

20

14

1

2

01

20

14

1

2

02

20

14

1

2

03

20

14

1

2

04

20

14

1

2

05

20

14

1

2

06

20

14

1

2

07

20

14

1

2

08

20

14

1

2

09

20

14

1

2

10

20

14

1

2

11

20

14

1

2

12

date

va

lue

25

30

35

40

45

20

25

30

35

40

50

60

70

10

15

20

25

50

60

me

mo

ry4

60

me

mo

ry4

29

me

mo

ry1

47

me

mo

ry4

13

me

mo

ry4

84

20

14

1

09

20

14

1

10

20

14

1

11

20

14

1

12

20

14

1

13

20

14

1

14

20

14

1

15

20

14

1

16

20

14

1

17

20

14

1

18

20

14

1

19

20

14

1

20

14

1

21

20

14

1

22

20

14

1

23

20

14

1

24

20

14

1

25

20

14

1

26

20

14

1

27

20

14

1

28

20

14

1

29

20

14

1

30

20

14

1

2

01

20

14

1

2

02

20

14

1

2

03

20

14

1

2

04

20

14

1

2

05

20

14

1

2

06

20

14

1

2

07

20

14

1

2

08

20

14

1

2

09

20

14

1

2

10

20

14

1

2

11

20

14

1

2

12

date

va

lue

0

5000

10000

15000

20000

200

400

600

0

5000

10000

15000

20000

500

1000

0

5000

10000

15000

20000

25000

pa

gin

g5

3p

ag

ing

46

7p

ag

ing

37

1p

ag

ing

33

7p

ag

ing

36

7

20

14

1

09

20

14

1

10

20

14

1

11

20

14

1

12

20

14

1

13

20

14

1

14

20

14

1

15

20

14

1

16

20

14

1

17

20

14

1

18

20

14

1

19

20

14

1

20

14

1

21

20

14

1

22

20

14

1

23

20

14

1

24

20

14

1

25

20

14

1

26

20

14

1

27

20

14

1

28

20

14

1

29

20

14

1

30

20

14

1

2

01

20

14

1

2

02

20

14

1

2

03

20

14

1

2

04

20

14

1

2

05

20

14

1

2

06

20

14

1

2

07

20

14

1

2

08

20

14

1

2

09

20

14

1

2

10

20

14

1

2

11

20

14

1

2

12

date

va

lue

Feature spaceACF1: first order autocorrelation = Corr(Yt, Yt1)Strength of trend and seasonality based on STLTrend linearity and curvatureSize of seasonal peak and troughSpectral entropyLumpiness: variance of block variances (block size 24).Spikiness: variances of leave-one-out variances of STL remainders.Level shift: Maximum difference in trimmed means of consecutivemoving windows of size 24.Variance change: Max difference in variances of consecutivemoving windows of size 24.Flat spots: Discretize sample space into 10 equal-sized intervals.Find max run length in any interval.Number of crossing points of mean line.Kullback-Leibler score: Maximum ofDKL(PQ) =

P(x) ln P(x)/Q(x)dx where P and Q are estimated by

kernel density estimators applied to consecutive windows of size 48.Change index: Time of maximum KL score

Principal component analysis

ACF1

lumpin

ess

entropy

lshiftvchange

cpoints

fspo

ts

trend

linearity

curvature

spikin

ess

seas

onpeak

trou

gh

klscore

chan

ge.id

x

4

2

0

2

2.5 0.0 2.5standardized PC1 (28.7% explained var.)

stan

dard

ized

PC

2 (1

7.3%

exp

lain

ed v

ar.)

What is anomalous

ACF1

lumpin

ess

entropy

lshiftvchange

cpoints

fspo

ts

trend

linearity

curvature

spikin

ess

seas

onpeak

trou

gh

klscore

chan

ge.id

x

4

2

0

2

stan

dard

ized

PC

2 (1

7.3%

exp

lain

ed v

ar.)

We need a measure of the anomalousness of a timeseries.

1 Rank points based on their local density.2 Rank points based on whether they are within

-convex hulls of different radius.Visualisation of big time series data Yahoo web traffic 45

What is anomalous

ACF1

lumpin

ess

entropy

lshiftvchange

cpoints

fspo

ts

trend

linearity

curvature

spikin

ess

seas

onpeak

trou

gh

klscore

chan

ge.id

x

4

2

0

2

stan

dard

ized

PC

2 (1

7.3%

exp

lain

ed v

ar.)

What is anomalous

ACF1

lumpin

ess

entropy

lshiftvchange

cpoints

fspo

ts

trend

linearity

curvature

spikin

ess

seas

onpeak

trou

gh

klscore

chan

ge.id

x

4

2

0

2

stan

dard

ized

PC

2 (1

7.3%

exp

lain

ed v

ar.)

Bivariate kernel density

f(x;H) =1

n

ni=1

KH(x Xi)

Xi a bivariate random sample {X1,X2, . . . ,Xn}KH(x) is the standard normal kernel function

H estimated by minimizing the sum of AMISE

Rank points based on f values in 2d PCA space.

Bivariate kernel density

f(x;H) =1

n

ni=1

KH(x Xi)

Xi a bivariate random sample {X1,X2, . . . ,Xn}KH(x) is the standard normal kernel function

H estimated by minimizing the sum of AMISE

Rank points based on f values in 2d PCA space.

Bivariate density ranking

5 0 5

8

6

4

2

02

46

pc1

pc2

1

2

3

45

Bivariate density ranking

010000200003000040000

0200040006000

01000020000300004000050000

010000200003000040000

010002000300040005000

S7793

S8494

S10464

S7833

S1715

2015

02

28

2015

03

01

2015

03

02

2015

03

2015

03

04

2015

03

05

2015

03

06

2015

03

07

2015

03

08

2015

03

09

2015

03

10

2015

03

11

2015

03

12

2015

03

13

2015

03

14

2015

03

15

2015

03

16

2015

03

17

2015

03

18

2015

03

19

2015

03

20

2015

03

21

2015

03

22

2015

03

23

2015

03

24

2015

03

25

2015

03

26

2015

03

27

2015

03

28

2015

03

29

2015

03

30

2015

03

31

2015

04

01

date

valu

e

-convex hullsThe space generated by point pairs that can betouched by an empty disc of radius .

gives a convex hull.Points can become isolated when is small.

We rank points based on the value of whenthey become isolated.

-convex hull

-convex hull ranking

5 0 5

8

6

4

2

02

46

12

3

4

5

-convex hull ranking

01000020000300004000050000

010000200003000040000

0200040006000

010002000300040005000

0100002000030000

S10464

S7793

S8494

S1715

S7826

2015

02

28

2015

03

01

2015

03

02

2015

03

2015

03

04

2015

03

05

2015

03

06

2015

03

07

2015

03

08

2015

03

09

2015

03

10

2015

03

11

2015

03

12

2015

03

13

2015

03

14

2015

03

15

2015

03

16

2015

03

17

2015

03

18

2015

03

19

2015

03

20

2015

03

21

2015

03

22

2015

03

23

2015

03

24

2015

03

25

2015

03

26

2015

03

27

2015

03

28

2015

03

29

2015

03

30

2015

03

31

2015

04

01

date

valu

e

HDR versus -convex hull

HDR boxplot

5 0 5

8

6

4

2

02

46

pc1

pc2

1

2

3

45

-convex hull

5 0 5

8

6

4

20

24

6

12

3

4

5

Top 5 anomalous time series

HDR0

10000200003000040000

0200040006000

01000020000300004000050000

010000200003000040000

010002000300040005000

S7793

S8494

S10464

S7833

S1715

2015

02

28

2015

03

01

2015

03

02

2015

03

2015

03

04

2015

03

05

2015

03

06

2015

03

07

2015

03

08

2015

03

09

2015

03

10

2015

03

11

2015

03

12

2015

03

13

2015

03

14

2015

03

15

2015

03

16

2015

03

17

2015

03

18

2015

03

19

2015

03

20

2015

03

21

2015

03

22

2015

03

23

2015

03

24

2015

03

25

2015

03

26

2015

03

27

2015

03

28

2015

03

29

2015

03

30

2015

03

31

2015

04

01

date

valu

e

-convex hull0

1000020000300004000050000

010000200003000040000

0200040006000

010002000300040005000

0100002000030000

S10464

S7793

S8494

S1715

S7826

2015

02

28

2015

03

01

2015

03

02

2015

03

2015

03

04

2015

03

05

2015

03

06

2015

03

07

2015

03

08

2015

03

09

2015

03

10

2015

03

11

2015

03

12

2015

03

13

2015

03

14

2015

03

15

2015

03

16

2015

03

17

2015

03

18

2015

03

19

2015

03

20

2015

03

21

2015

03

22

2015

03

23

2015

03

24

2015

03

25

2015

03

26

2015

03

27

2015

03

28

2015

03

29

2015

03

30

2015

03

31

2015

04

01

date

valu

e

Outline

1 The problem

4 Yahoo web traffic

5 What next?

Visualisation of big time series data What next? 53

What next?

Develop a more comprehensive set of featuresthat are reliable measures and fast to compute.e.g., for finance data.Consider other dimension reduction methodsand more than 2 dimensions.Develop dynamic and interactive visualizationtools.Make methods available in an R package.

Some of the methods are already available in theanomalous package for R on github.

Papers: robjhyndman.com

Code: github.com/robjhyndman

Email: [email protected]

What next?

The problemAustralian tourism demandM3 competition dataYahoo web trafficWhat next?

visualization of big time series data

Data & Analytics