indexing and mining streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1...

347
CMU SCS Indexing and Mining Time Sequences Christos Faloutsos and Lei Li CMU

Upload: others

Post on 18-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Indexing and Mining Time

Sequences

Christos Faloutsos and Lei Li

CMU

Page 2: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 2

Outline

• Motivation

• Similarity Search and Indexing

• DSP (Digital Signal Processing)

• Linear Forecasting

• Kalman filters

• fractals and multifractals

• Non-linear forecasting

• Conclusions

Page 3: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 3

Problem definition

• Given: one or more sequences

x1 , x2 , … , xt , …

(y1, y2, … , yt, …

… )

• Find

– similar sequences; forecasts

– patterns; clusters; outliers

Page 4: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 4

Motivation - Applications

• Financial, sales, economic series

• Medical

– reactions to new drugs

– elderly care

Page 5: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

ECG - physionet.org

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 5

Page 6: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

EEG - epilepsy

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 6from wikipedia

Page 7: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 7

Motivation - Applications

(cont‟d)

• „Smart house‟

– sensors monitor temperature, humidity,

air quality

• video surveillance

Page 8: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 8

Motivation - Applications

(cont‟d)

• civil/automobile infrastructure

– bridge vibrations [Oppenheim+02]

– road conditions / traffic monitoring

Page 9: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 9

Motivation - Applications

(cont‟d)

• Weather, environment/anti-pollution

– volcano monitoring

– air/water pollutant monitoring

Page 10: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 10

Motivation - Applications

(cont‟d)

• Computer systems

– „Active Disks‟ (buffering, prefetching)

– web servers (ditto)

– network traffic monitoring

– ...

Page 11: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 11

Stream Data: Disk accesses

time

#bytes

Page 12: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 12

Problem #1:

Goal: given a signal (eg., #packets over time)

Find: patterns, periodicities, and/or compress

year

count lynx caught per year

(packets per day;

temperature per day)

Page 13: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 13

Problem#2: Forecast

Given xt, xt-1, …, forecast xt+1

0

10

20

30

40

50

60

70

80

90

1 3 5 7 9 11

Time Tick

Nu

mb

er o

f p

ack

ets

sen

t

??

Page 14: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 14

Problem#2‟: Similarity search

Eg., Find a 3-tick pattern, similar to the last one

0

10

20

30

40

50

60

70

80

90

1 3 5 7 9 11

Time Tick

Nu

mb

er o

f p

ack

ets

sen

t

??

Page 15: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 15

Problem #3:

• Given: A set of correlated time sequences

• Forecast „Sent(t)‟

0

10

20

30

40

50

60

70

80

90

1 3 5 7 9 11

Time Tick

Nu

mb

er o

f p

ack

ets

sent

lost

repeated

Page 16: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 16

Important observations

Patterns, rules, forecasting and similarity indexing are closely related:

• To do forecasting, we need

– to find patterns/rules

– to find similar settings in the past

• to find outliers, we need to have forecasts

– (outlier = too far away from our forecast)

Page 17: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 17

Important topics NOT in this

tutorial:

• Continuous queries

– [Babu+Widom ] [Gehrke+] [Madden+]

• Categorical data streams

– [Hatonen+96]

• Outlier detection (discontinuities)

– [Breunig+00]

Page 18: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 18

Outline

• Motivation

• Similarity Search and Indexing

• DSP

• Linear Forecasting

• Bursty traffic - fractals and multifractals

• Non-linear forecasting

• Conclusions

Page 19: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 19

Outline

• Motivation

• Similarity Search and Indexing

– distance functions: Euclidean;Time-warping

– indexing

– feature extraction

• DSP

• ...

Page 20: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 20

Importance of distance functions

Subtle, but absolutely necessary:

• A „must‟ for similarity indexing (->

forecasting)

• A „must‟ for clustering

Two major families

– Euclidean and Lp norms

– Time warping and variations

Page 21: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 21

Euclidean and Lp

n

i

ii yxyxD1

2)(),(x(t) y(t)

...

n

i

p

iip yxyxL1

||),(

•L1: city-block = Manhattan

•L2 = Euclidean

•L

Page 22: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 22

Observation #1

• Time sequence -> n-d

vector

...

Day-1

Day-2

Day-n

Page 23: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 23

Observation #2

Euclidean distance is

closely related to

– cosine similarity

– dot product

– „cross-correlation‟

function

...

Day-1

Day-2

Day-n

Page 24: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 24

Time Warping

• allow accelerations - decelerations

– (with or w/o penalty)

• THEN compute the (Euclidean) distance (+

penalty)

• related to the string-editing distance

Page 25: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 25

Time Warping

„stutters‟:

Page 26: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 26

Time Warping

Q: how to compute it?

A: dynamic programming

D( i, j ) = cost to match

prefix of length i of first sequence x with prefix of length j of second sequence y

Skip

Page 27: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 27

Thus, with no penalty for stutter, for sequences

x1, x2, …, xi,; y1, y2, …, yj

),1(

)1,(

)1,1(

min][][),(

jiD

jiD

jiD

jyixjiD x-stutter

y-stutter

no stutter

Time WarpingSkip

Page 28: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 34

Other Distance functions

• piece-wise linear/flat approx.; compare

pieces [Keogh+01] [Faloutsos+97]

• „cepstrum‟ (for voice [Rabiner+Juang])

– do DFT; take log of amplitude; do DFT again!

• Allow for small gaps [Agrawal+95]

Page 29: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

More distance functions.

• Chen + Ng [vldb‟04]: ERP „Edit distance

with Real Penalty‟: give a penalty to stutters

• Keogh+ [kdd‟04]: VERY NICE, based on

information theory: compress each

sequence (quantize + Lempel-Ziv), using

the other sequences‟ LZ tables

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 35

On The Marriage of Lp-norms and Edit Distance, Lei Chen,

Raymond T. Ng:, VLDB‟04

Towards Parameter-Free Data Mining, E. Keogh, S. Lonardi,

C.A. Ratanamahatana, KDD‟04

Page 30: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 36

Conclusions

Prevailing distances:

– Euclidean and

– time-warping

Page 31: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 37

Outline

• Motivation

• Similarity Search and Indexing

– distance functions

– indexing

– feature extraction

• DSP

• ...

Page 32: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 38

Indexing

Problem:

• given a set of time sequences,

• find the ones similar to a desirable query

sequence

Page 33: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 39

day

$price

1 365

day

$price

1 365

day

$price

1 365

distance function: by expert

Page 34: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 40

Idea: „GEMINI‟

Eg., „find stocks similar to MSFT’

Seq. scanning: too slow

How to accelerate the search?

[Faloutsos96]

Page 35: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 41

day1 365

day1 365

S1

Sn

F(S1)

F(Sn)

„GEMINI‟ - Pictorially

eg, avg

eg,. std

Page 36: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 42

GEMINI

Solution: Quick-and-dirty' filter:

• extract n features (numbers, eg., avg., etc.)

• map into a point in n-d feature space

• organize points with off-the-shelf spatial

access method („SAM‟)

• discard false alarms

Page 37: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 43

Examples of GEMINI

• Time sequences: DFT (up to 100 times

faster) [SIGMOD94];

• [Kanellakis+], [Mendelzon+]

Page 38: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 44

Indexing - SAMs

Q: How do Spatial Access Methods (SAMs)

work?

A: they group nearby points (or regions)

together, on nearby disk pages, and answer

spatial queries quickly („range queries‟,

„nearest neighbor‟ queries etc)

For example:

Page 39: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 45

R-trees

• [Guttman84] eg., w/ fanout 4: group nearby

rectangles to parent MBRs; each group ->

disk page

A

B

C

DE

FG

H

I

J

Skip

Page 40: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 46

R-trees

• eg., w/ fanout 4:

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

F GD E

H I JA B C

Skip

Page 41: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 47

R-trees

• eg., w/ fanout 4:

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B C

Skip

Page 42: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 48

R-trees - range search?

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B C

Skip

Page 43: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 49

R-trees - range search?

A

B

C

DE

FG

H

I

J

P1

P2

P3

P4

P1 P2 P3 P4

F GD E

H I JA B C

Skip

Page 44: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 50

Conclusions

• Fast indexing: through GEMINI

– feature extraction and

– (off the shelf) Spatial Access Methods

[Gaede+98]

Page 45: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 51

Outline

• Motivation

• Similarity Search and Indexing

– distance functions

– indexing

– feature extraction

• DSP

• ...

Page 46: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 52

Outline

• Motivation

• Similarity Search and Indexing

– distance functions

– indexing

– feature extraction

• DFT, DWT, DCT (data independent)

• SVD, etc (data dependent)

• MDS, FastMap

Page 47: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 53

DFT and cousins

• very good for compressing real signals

• more details on DFT/DCT/DWT: later

Page 48: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 54

DFT and stocks

0.00

2000.00

4000.00

6000.00

8000.00

10000.00

12000.00

1 11 21 31 41 51 61 71 81 91 101 111 121

Fourier appx actual

• Dow Jones Industrial

index, 6/18/2001-

12/21/2001

Page 49: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 55

DFT and stocks

0.00

2000.00

4000.00

6000.00

8000.00

10000.00

12000.00

1 11 21 31 41 51 61 71 81 91 101 111 121

Fourier appx actual

• Dow Jones Industrial

index, 6/18/2001-

12/21/2001

• just 3 DFT

coefficients give very

good approximation

1

10

100

1000

10000

100000

1000000

10000000

1 11 21 31 41 51 61 71 81 91 101 111 121

Series1

freq

Log(ampl)

Page 50: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 56

Outline

• Motivation

• Similarity Search and Indexing

– distance functions

– indexing

– feature extraction

• DFT, DWT, DCT (data independent)

• SVD etc (data dependent)

• MDS, FastMap

Page 51: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 57

SVD

• THE optimal method for dimensionality

reduction

– (under the Euclidean metric)

Page 52: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 58

Singular Value Decomposition

(SVD)• SVD (~LSI ~ KL ~ PCA ~ spectral

analysis...) LSI: S. Dumais; M. Berry

KL: eg, Duda+Hart

PCA: eg., Jolliffe

Details: [Press+],

[Faloutsos96]

day1

day2

Page 53: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 59

SVD

• Extremely useful tool

– (also behind PageRank/google and Kleinberg‟s

algorithm for hubs and authorities)

• But may be slow: O(N * M * M) if N>M

• any approximate, faster method?

Page 54: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 60

SVD shorcuts

• random projections (Johnson-Lindenstrauss

thm [Papadimitriou+ pods98])

Page 55: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 61

Random projections

• pick „enough‟ random directions (will be

~orthogonal, in high-d!!)

• distances are preserved probabilistically,

within epsilon

• (also, use as a pre-processing step for SVD

[Papadimitriou+ PODS98])

Page 56: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 62

Outline

• Motivation

• Similarity Search and Indexing

– distance functions

– indexing

– feature extraction

• DFT, DWT, DCT (data independent)

• SVD etc (data dependent), ICA

• MDS, FastMap

Page 57: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 63

Citation

• AutoSplit: Fast and Scalable Discovery of

Hidden Variables in Stream and Multimedia

Databases, Jia-Yu Pan, Hiroyuki Kitagawa,

Christos Faloutsos and Masafumi

Hamamoto

PAKDD 2004, Sydney, Australia

Page 58: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 64

Motivation:

(Q1) Find patterns in data• Motion capture data (broad jumps)

Left Knee

Right Knee

Energy exerted

Take-off

Landing

Page 59: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 65

PCA sometimes misses essential

features

• Best SVD axis: not always meaningful!

Page 60: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 66

Motivation:

(Q1) Find patterns in data

• Human would say

– Pattern 1: along

diagonal

– Pattern 2: along

vertical axis

• How to find these

automatically? Left Knee

Right Knee

60:1

1:1

Page 61: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 67

Motivation:

(Q2) Find hidden variables

Dow Jones Industrial Average

Alcoa

American

Express

Boeing

Caterpillar

Citi

Group

Find common

hidden variables,

and weights.

Page 62: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 68

Motivation:

(Q2) Find hidden variables

Hidden variable 1 Hidden variable 2

B1,CAT

B1,INTC B2,CAT

B2,INTC?

? ?

Caterpillar Intel

Page 63: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 69

Motivation:

(Q2) Find hidden variables

“Hidden variable 1” “Hidden variable 2”

0.940.63 0.03

0.64

Caterpillar Intel

Page 64: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 70

Motivation:

(Q2) Find hidden variables

“General trend” “Internet bubble”

0.940.63 0.03

0.64

Caterpillar Intel

Page 65: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 80

Outline

• Motivation

• Similarity Search and Indexing

– distance functions

– indexing

– feature extraction

• DFT, DWT, DCT (data independent)

• SVD (data dependent)

• MDS, FastMap

Page 66: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 81

MDS / FastMap

• but, what if we have NO points to start

with?

(eg. Time-warping distance)

• A: Multi-dimensional Scaling (MDS) ;

FastMap

Page 67: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 82

MDS/FastMap

01100100100O5

10100100100O4

100100011O3

100100101O2

100100110O1

O5O4O3O2O1~100

~1

Page 68: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 83

MDS

Multi Dimensional

Scaling

Page 69: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 84

FastMap

• Multi-dimensional scaling (MDS) can do

that, but in O(N**2) time

• FastMap [Faloutsos+95] takes O(N) time

Page 70: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 85

FastMap: Application

VideoTrails [Kobla+97]

scene-cut detection (about 10% errors)

Page 71: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 86

Outline

• Motivation

• Similarity Search and Indexing

– distance functions

– indexing

– feature extraction

• DFT, DWT, DCT (data independent)

• SVD (data dependent)

• MDS, FastMap, IsoMap etc

Page 72: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li #87

Variations

• Isomap [Tenenbaum, de Silva, Langford,

2000]

• LLE (Local Linear Embedding) [Roweis,

Saul, 2000]

• MVE (Minimum Volume Embedding)

[Shaw & Jebara, 2007]

Page 73: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li #88

Variations

• Isomap [Tenenbaum, de Silva, Langford,

2000]

• LLE (Local Linear Embedding) [Roweis,

Saul, 2000]

• MVE (Minimum Volume Embedding)

[Shaw & Jebara, 2007]

Page 74: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 89

Outline

• Motivation

• Similarity Search and Indexing

– distance functions

– indexing

– feature extraction

• DFT, DWT, DCT (data independent)

• SVD (data dependent)

• MDS, FastMap

Page 75: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 90

Conclusions - Practitioner‟s

guide

Similarity search in time sequences

1) establish/choose distance (Euclidean, time-

warping,…)

2) extract features (SVD, DWT, MDS), and use

an SAM (R-tree/variant) or a Metric Tree (M-

tree)

2‟) for high intrinsic dimensionalities, consider sequential

scan (it might win…)

Page 76: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 91

Books

• William H. Press, Saul A. Teukolsky, William T.

Vetterling and Brian P. Flannery: Numerical Recipes in C,

Cambridge University Press, 1992, 2nd Edition. (Great

description, intuition and code for SVD)

• C. Faloutsos: Searching Multimedia Databases by Content,

Kluwer Academic Press, 1996 (introduction to SVD, and

GEMINI)

Page 77: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 92

References

• Agrawal, R., K.-I. Lin, et al. (Sept. 1995). Fast Similarity

Search in the Presence of Noise, Scaling and Translation in

Time-Series Databases. Proc. of VLDB, Zurich,

Switzerland.

• Babu, S. and J. Widom (2001). “Continuous Queries over

Data Streams.” SIGMOD Record 30(3): 109-120.

• Breunig, M. M., H.-P. Kriegel, et al. (2000). LOF:

Identifying Density-Based Local Outliers. SIGMOD

Conference, Dallas, TX.• Berry, Michael: http://www.cs.utk.edu/~lsi/

Page 78: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 93

References

• Ciaccia, P., M. Patella, et al. (1997). M-tree: An Efficient

Access Method for Similarity Search in Metric Spaces.

VLDB.

• Foltz, P. W. and S. T. Dumais (Dec. 1992). “Personalized

Information Delivery: An Analysis of Information

Filtering Methods.” Comm. of ACM (CACM) 35(12): 51-

60.

• Guttman, A. (June 1984). R-Trees: A Dynamic Index

Structure for Spatial Searching. Proc. ACM SIGMOD,

Boston, Mass.

Page 79: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 94

References

• Gaede, V. and O. Guenther (1998). “Multidimensional

Access Methods.” Computing Surveys 30(2): 170-231.

• Gehrke, J. E., F. Korn, et al. (May 2001). On Computing

Correlated Aggregates Over Continual Data Streams.

ACM Sigmod, Santa Barbara, California.

Page 80: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li Part2.1 #95

References

• Gunopulos, D. and G. Das (2001). Time Series

Similarity Measures and Time Series Indexing.

SIGMOD Conference, Santa Barbara, CA.

• Eamonn J. Keogh, Themis Palpanas, Victor B.

Zordan, Dimitrios Gunopulos, Marc Cardle:

Indexing Large Human-Motion Databases. VLDB

2004: 780-791

Page 81: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 96

References

• Hatonen, K., M. Klemettinen, et al. (1996). Knowledge

Discovery from Telecommunication Network Alarm

Databases. ICDE, New Orleans, Louisiana.

• Jolliffe, I. T. (1986). Principal Component Analysis,

Springer Verlag.

Page 82: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 97

References

• Keogh, E. J., K. Chakrabarti, et al. (2001). Locally

Adaptive Dimensionality Reduction for Indexing Large

Time Series Databases. SIGMOD Conference, Santa

Barbara, CA.

• Kobla, V., D. S. Doermann, et al. (Nov. 1997).

VideoTrails: Representing and Visualizing Structure in

Video Sequences. ACM Multimedia 97, Seattle, WA.

Page 83: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 98

References

• Oppenheim, I. J., A. Jain, et al. (March 2002). A MEMS

Ultrasonic Transducer for Resident Monitoring of Steel

Structures. SPIE Smart Structures Conference SS05, San

Diego.

• Papadimitriou, C. H., P. Raghavan, et al. (1998). Latent

Semantic Indexing: A Probabilistic Analysis. PODS,

Seattle, WA.

• Rabiner, L. and B.-H. Juang (1993). Fundamentals of

Speech Recognition, Prentice Hall.

Page 84: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 99

References

• Traina, C., A. Traina, et al. (October 2000). Fast feature

selection using the fractal dimension,. XV Brazilian

Symposium on Databases (SBBD), Paraiba, Brazil.

Page 85: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 100

References• Dennis Shasha and Yunyue Zhu High Performance

Discovery in Time Series: Techniques and Case Studies Springer 2004

• Yunyue Zhu, Dennis Shasha ``StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time'‘ VLDB, August, 2002. pp. 358-369.

• Samuel R. Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong. The Design of an Acquisitional Query Processor for Sensor Networks. SIGMOD, June 2003, San Diego, CA.

Page 86: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li #101

References

• Lawrence Saul & Sam Roweis. An Introduction to Locally Linear Embedding (draft)

• Sam Roweis & Lawrence Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, v.290 no.5500 , Dec.22, 2000. pp.2323--2326.

• B. Shaw and T. Jebara. "Minimum Volume Embedding" . Artificial Intelligence and Statistics, AISTATS, March 2007.

Page 87: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li #102

References

• Josh Tenenbaum, Vin de Silva and John Langford. A

Global Geometric Framework for Nonlinear

dimensionality Reduction. Science 290, pp. 2319-2323,

2000.

Page 88: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 103

Page 89: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 104

Outline

• Motivation

• Similarity Search and Indexing

• DSP (DFT, DWT)

• Linear Forecasting

• Kalman filters

• Bursty traffic - fractals and multifractals

• Non-linear forecasting

• Conclusions

Page 90: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 105

Outline

• DFT

– Definition of DFT and properties

– how to read the DFT spectrum

• DWT

– Definition of DWT and properties

– how to read the DWT scalogram

Page 91: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 106

Introduction - Problem#1

Goal: given a signal (eg., packets over time)

Find: patterns and/or compress

year

count

lynx caught per year

(packets per day;

automobiles per hour)-2000

0

2000

4000

6000

8000

1 14 27 40 53 66 79 92 105

Page 92: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 107

What does DFT do?

A: highlights the periodicities

Page 93: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 108

DFT: definition• For a sequence x0, x1, … xn-1

• the (n-point) Discrete Fourier Transform is

• X0, X1, … Xn-1 :

)/2exp(*/1

)1(

1,,0)/2exp(*/1

1

0

1

0

ntfjXnx

j

nfntfjxnX

n

t

ft

n

t

tf

inverse DFT

Skip

Page 94: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 109

DFT: definition

• Good news: Available in all symbolic math

packages, eg., in „mathematica‟

x = [1,2,1,2];

X = Fourier[x];

Plot[ Abs[X] ];

Page 95: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 110

DFT: Amplitude spectrum

actual mean mean+freq12

1

12

23

34

45

56

67

78

89

10

0

11

1

year

count

Freq.

Ampl.

freq=12

freq=0

)(Im)(Re 222

fffXXA Amplitude:

Page 96: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 111

DFT: examples

flat

0.000

0.010

0.020

0.030

0.040

0.050

0.060

0.070

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

time freq

Amplitude

Skip

Page 97: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 112

DFT: examples

Low frequency sinusoid

-0.150

-0.100

-0.050

0.000

0.050

0.100

0.150

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

time freq

Skip

Page 98: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 113

DFT: examples

• Sinusoid - symmetry property: Xf = X*n-f

-0.150

-0.100

-0.050

0.000

0.050

0.100

0.150

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

time freq

Skip

Page 99: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 114

DFT: examples

• Higher freq. sinusoid

-0.080

-0.060

-0.040

-0.020

0.000

0.020

0.040

0.060

0.080

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

0.1

0.2

0.3

0.4

0.5

0.6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

time freq

Skip

Page 100: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 115

DFT: examples

examples

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

=

+

+

Skip

Page 101: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 116

DFT: examples

examples

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Freq.

Ampl.

Skip

Page 102: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 117

Outline

• Motivation

• Similarity Search and Indexing

• DSP

• Linear Forecasting

• Bursty traffic - fractals and multifractals

• Non-linear forecasting

• Conclusions

Page 103: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 118

Outline

• Motivation

• Similarity Search and Indexing

• DSP

– DFT

• Definition of DFT and properties

• how to read the DFT spectrum

– DWT

Page 104: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 119

DFT: Amplitude spectrum

actual mean mean+freq12

1

12

23

34

45

56

67

78

89

10

0

11

1

year

count

Freq.

Ampl.

freq=12

freq=0

)(Im)(Re 222

fffXXA Amplitude:

Page 105: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 120

DFT: Amplitude spectrum

actual mean mean+freq12

1

12

23

34

45

56

67

78

89

10

0

11

1

year

count

Freq.

Ampl.

freq=12

freq=0

Page 106: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 121

1

12

23

34

45

56

67

78

89

10

0

11

1

DFT: Amplitude spectrum

actual mean mean+freq12

year

count

Freq.

Ampl.

freq=12

freq=0

Page 107: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 122

DFT: Amplitude spectrum

• excellent approximation, with only 2

frequencies!

• so what?

actual mean mean+freq12

1

12

23

34

45

56

67

78

89

10

0

11

1

Freq.

Page 108: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 123

DFT: Amplitude spectrum

• excellent approximation, with only 2

frequencies!

• so what?

• A1: (lossy) compression

• A2: pattern discovery 1

12

23

34

45

56

67

78

89

10

0

11

1

Page 109: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 124

DFT: Amplitude spectrum

• excellent approximation, with only 2

frequencies!

• so what?

• A1: (lossy) compression

• A2: pattern discovery

actual mean mean+freq12

Page 110: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 125

DFT - Conclusions

• It spots periodicities (with the „amplitude spectrum’)

• can be quickly computed (O( n log n)), thanks to the FFT algorithm.

• standard tool in signal processing (speech, image etc signals)

• (closely related to DCT and JPEG)

Page 111: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 126

Outline

• Motivation

• Similarity Search and Indexing

• DSP

– DFT

– DWT

• Definition of DWT and properties

• how to read the DWT scalogram

Page 112: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 127

Problem #1:

Goal: given a signal (eg., #packets over time)

Find: patterns, periodicities, and/or compress

year

count lynx caught per year

(packets per day;

virus infections per month)

Page 113: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 128

Wavelets - DWT

• DFT is great - but, how about compressing

a spike?

value

time

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Page 114: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 129

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Wavelets - DWT

• DFT is great - but, how about compressing

a spike?

• A: Terrible - all DFT coefficients needed!

0

0.2

0.4

0.6

0.8

1

1.2

1 3 5 7 9 11 13 15

Freq.

Ampl.value

time

Page 115: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 130

Wavelets - DWT

• DFT is great - but, how about compressing

a spike?

• A: Terrible - all DFT coefficients needed!

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

value

time

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Page 116: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 131

Wavelets - DWT

• Similarly, DFT suffers on short-duration

waves (eg., baritone, silence, soprano)

time

value

Page 117: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 132

Wavelets - DWT

• Solution#1: Short window Fourier

transform (SWFT)

• But: how short should be the window?

time

freq

time

value

Page 118: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 133

Wavelets - DWT

• Answer: multiple window sizes! -> DWT

time

freq

Time

domain DFT SWFT DWT

Page 119: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 134

Haar Wavelets

• subtract sum of left half from right half

• repeat recursively for quarters, eight-ths, ...

Page 120: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 135

Wavelets - construction

x0 x1 x2 x3 x4 x5 x6 x7

Skip

Page 121: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 136

Wavelets - construction

x0 x1 x2 x3 x4 x5 x6 x7

s1,0

+-

d1,0 s1,1d1,1 .......level 1

Skip

Page 122: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 137

Wavelets - construction

d2,0

x0 x1 x2 x3 x4 x5 x6 x7

s1,0

+-

d1,0 s1,1d1,1 .......

s2,0level 2

Skip

Page 123: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 138

Wavelets - construction

d2,0

x0 x1 x2 x3 x4 x5 x6 x7

s1,0

+-

d1,0 s1,1d1,1 .......

s2,0

etc ...

Skip

Page 124: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 139

Wavelets - construction

d2,0

x0 x1 x2 x3 x4 x5 x6 x7

s1,0

+-

d1,0 s1,1d1,1 .......

s2,0

Q: map each coefficient

on the time-freq. plane

t

f

Skip

Page 125: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 140

Wavelets - construction

d2,0

x0 x1 x2 x3 x4 x5 x6 x7

s1,0

+-

d1,0 s1,1d1,1 .......

s2,0

Q: map each coefficient

on the time-freq. plane

t

f

Skip

Page 126: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 141

Haar wavelets - code

#!/usr/bin/perl5

# expects a file with numbers

# and prints the dwt transform

# The number of time-ticks should be a power of 2

# USAGE

# haar.pl <fname>

my @vals=();

my @smooth; # the smooth component of the signal

my @diff; # the high-freq. component

# collect the values into the array @val

while(<>){

@vals = ( @vals , split );

}

my $len = scalar(@vals);

my $half = int($len/2);

while($half >= 1 ){

for(my $i=0; $i< $half; $i++){

$diff [$i] = ($vals[2*$i] - $vals[2*$i + 1] )/ sqrt(2);

print "\t", $diff[$i];

$smooth [$i] = ($vals[2*$i] + $vals[2*$i + 1] )/ sqrt(2);

}

print "\n";

@vals = @smooth;

$half = int($half/2);

}

print "\t", $vals[0], "\n" ; # the final, smooth component

Page 127: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 142

Wavelets - construction

Observation1:

„+‟ can be some weighted addition

„-‟ is the corresponding weighted difference

(„Quadrature mirror filters‟)

Observation2: unlike DFT/DCT,

there are *many* wavelet bases: Haar, Daubechies-

4, Daubechies-6, Coifman, Morlet, Gabor, ...

Page 128: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 143

Wavelets - how do they look

like?

• E.g., Daubechies-4

Page 129: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 144

Wavelets - how do they look

like?

• E.g., Daubechies-4

?

?

Page 130: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 145

Wavelets - how do they look

like?

• E.g., Daubechies-4

Page 131: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 146

Outline

• Motivation

• Similarity Search and Indexing

• DSP

– DFT

– DWT

• Definition of DWT and properties

• how to read the DWT scalogram

Page 132: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 147

Wavelets - Drill#1:

t

f

• Q: baritone/silence/soprano - DWT?

time

value

Page 133: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 148

Wavelets - Drill#1:

t

f

• Q: baritone/silence/soprano - DWT?

time

value

Page 134: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 149

Wavelets - Drill#2:

• Q: spike - DWT?

t

f

1 2 3 4 5 6 7 8

Page 135: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 150

Wavelets - Drill#2:

t

f

• Q: spike - DWT?

1 2 3 4 5 6 7 8

0.00 0.00 0.71 0.00

0.00 0.50

-0.35

0.35

Page 136: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 151

Wavelets - Drill#3:

• Q: weekly + daily periodicity, + spike -

DWT?

t

f

Page 137: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 152

Wavelets - Drill#3:

• Q: weekly + daily periodicity, + spike -

DWT?

t

f

Page 138: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 153

Wavelets - Drill#3:

• Q: weekly + daily periodicity, + spike -

DWT?

t

f

Page 139: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 154

Wavelets - Drill#3:

• Q: weekly + daily periodicity, + spike -

DWT?

t

f

Page 140: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 155

Wavelets - Drill#3:

• Q: weekly + daily periodicity, + spike -

DWT?

t

f

Page 141: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 156

Wavelets - Drill#3:

• Q: DFT?

t

f

t

f

DWT DFT

Page 142: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 157

Advantages of Wavelets

• Better compression (better RMSE with same

number of coefficients - used in JPEG-2000)

• fast to compute (usually: O(n)!)

• very good for „spikes‟

• mammalian eye and ear: Gabor wavelets

Page 143: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 158

Overall Conclusions

• DFT, DCT spot periodicities

• DWT : multi-resolution - matches

processing of mammalian ear/eye better

• All three: powerful tools for compression,

pattern detection in real signals

• All three: included in math packages

– (matlab, „R‟, mathematica, … - often in

spreadsheets!)

Page 144: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 159

Overall Conclusions

• DWT : very suitable for self-similar traffic

• DWT: used for summarization of streams [Gilbert+01], db histograms etc

Page 145: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 160

Resources - software and urls

• http://www.dsptutor.freeuk.com/jsanalyser/

FFTSpectrumAnalyser.html : Nice java

applets for FFT

• http://www.relisoft.com/freeware/freq.html

voice frequency analyzer (needs

microphone)

Page 146: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 161

Resources: software and urls

• xwpl: open source wavelet package from

Yale, with excellent GUI

• http://monet.me.ic.ac.uk/people/gavin/java

/waveletDemos.html : wavelets and

scalograms

Page 147: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 162

Books

• William H. Press, Saul A. Teukolsky, William T.

Vetterling and Brian P. Flannery: Numerical Recipes in C,

Cambridge University Press, 1992, 2nd Edition. (Great

description, intuition and code for DFT, DWT)

• C. Faloutsos: Searching Multimedia Databases by Content,

Kluwer Academic Press, 1996 (introduction to DFT,

DWT)

Page 148: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 163

Additional Reading

• [Gilbert+01] Anna C. Gilbert, Yannis Kotidis and S.

Muthukrishnan and Martin Strauss, Surfing Wavelets on

Streams: One-Pass Summaries for Approximate Aggregate

Queries, VLDB 2001

Page 149: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 164

Page 150: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 165

Outline

• Motivation

• Similarity Search and Indexing

• DSP

• Linear Forecasting

• Kalman filters

• Bursty traffic - fractals and multifractals

• Non-linear forecasting

• Conclusions

Page 151: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 166

Forecasting

"Prediction is very difficult, especially about

the future." - Nils Bohr

http://www.hfac.uh.edu/MediaFutures/thoughts.html

Page 152: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 167

Outline

• Motivation

• ...

• Linear Forecasting

– Auto-regression: Least Squares; RLS

– Co-evolving time sequences

– Examples

– Conclusions

Page 153: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 168

Problem#2: Forecast

• Example: give xt-1, xt-2, …, forecast xt

0

10

20

30

40

50

60

70

80

90

1 3 5 7 9 11

Time Tick

Nu

mb

er o

f p

ack

ets

sen

t

??

Page 154: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 169

Forecasting: Preprocessing

MANUALLY:

remove trends spot periodicities

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10

0

0.5

1

1.5

2

2.5

3

3.5

1 3 5 7 9 11 13

timetime

7 days

Page 155: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 170

Problem#2: Forecast

• Solution: try to express

xt

as a linear function of the past: xt-2, xt-2, …,

(up to a window of w)

Formally:

0102030405060708090

1 3 5 7 9 11Time Tick

??noisexaxax wtwtt 11

Page 156: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 171

(Problem: Back-cast; interpolate)• Solution - interpolate: try to express

xt

as a linear function of the past AND the future:

xt+1, xt+2, … xt+wfuture; xt-1, … xt-wpast

(up to windows of wpast , wfuture)

• EXACTLY the same algo‟s

0102030405060708090

1 3 5 7 9 11Time Tick

??

Page 157: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 172

Linear Regression: idea

40

45

50

55

60

65

70

75

80

85

15 25 35 45

Body weight

patient weight height

1 27 43

2 43 54

3 54 72

……

N 25 ??

• express what we don‟t know (= „dependent variable‟)

• as a linear function of what we know (= „indep. variable(s)‟)

Body height

Page 158: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 173

Linear Auto Regression:

Time Packets

Sent (t-1)

Packets

Sent(t)

1 - 43

2 43 54

3 54 72

……

N 25 ??

Page 159: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 174

Linear Auto Regression:

40

45

50

55

60

65

70

75

80

85

15 25 35 45

Number of packets sent (t-1)N

um

ber

of

pa

cket

s se

nt

(t)

Time Packets

Sent (t-1)

Packets

Sent(t)

1 - 43

2 43 54

3 54 72

……

N 25 ??

• lag w=1

• Dependent variable = # of packets sent (S [t])

• Independent variable = # of packets sent (S[t-1])

„lag-plot‟

Page 160: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 175

Outline

• Motivation

• ...

• Linear Forecasting

– Auto-regression: Least Squares; RLS

– Co-evolving time sequences

– Examples

– Conclusions

Page 161: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 176

More details:

• Q1: Can it work with window w>1?

• A1: YES!

xt-2

xt

xt-1

Page 162: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 177

More details:

• Q1: Can it work with window w>1?

• A1: YES! (we‟ll fit a hyper-plane, then!)

xt-2

xt

xt-1

Page 163: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 178

More details:

• Q1: Can it work with window w>1?

• A1: YES! (we‟ll fit a hyper-plane, then!)

xt-2

xt-1

xt

Page 164: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 179

More details:

• Q1: Can it work with window w>1?

• A1: YES! The problem becomes:

X[N w] a[w 1] = y[N 1]

• OVER-CONSTRAINED– a is the vector of the regression coefficients

– X has the N values of the w indep. variables

– y has the N values of the dependent variable

Skip

Page 165: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 180

More details:

• X[N w] a[w 1] = y[N 1]

N

w

NwNN

w

w

y

y

y

a

a

a

XXX

XXX

XXX

2

1

2

1

21

22221

11211

,,,

,,,

,,,

Ind-var1 Ind-var-w

time

Skip

Page 166: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 181

More details:

• X[N w] a[w 1] = y[N 1]

N

w

NwNN

w

w

y

y

y

a

a

a

XXX

XXX

XXX

2

1

2

1

21

22221

11211

,,,

,,,

,,,

Ind-var1 Ind-var-w

time

Skip

Page 167: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 182

More details

• Q2: How to estimate a1, a2, … aw = a?

• A2: with Least Squares fit

• (Moore-Penrose pseudo-inverse)

• a is the vector that minimizes the RMSE

from y

a = ( XT X )-1 (XT y)

Skip

Page 168: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 183

Even more details

• Q3: Can we estimate a incrementally?

• A3: Yes, with the brilliant, classic method

of „Recursive Least Squares‟ (RLS) (see,

e.g., [Yi+00], for details) - pictorially:

Skip

Page 169: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 184

Even more details

• Given:

Independent Variable

Dep

enden

t V

aria

ble

Skip

Page 170: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 185

Even more details

Independent Variable

Dep

enden

t V

aria

ble

.

new point

Skip

Page 171: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 186

Even more details

Independent Variable

Dep

enden

t V

aria

ble

RLS: quickly compute new best fit

new point

Skip

Page 172: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 187

Even more details

• Straightforward Least

Squares

– Needs huge matrix

(growing in size)

O(N×w)

– Costly matrix operation

O(N×w2)

• Recursive LS

– Need much smaller, fixed

size matrix

O(w×w)

– Fast, incremental

computation

O(1×w2)

N = 106, w = 1-100

Skip

Page 173: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 188

Even more details

• Q4: can we „forget‟ the older samples?

• A4: Yes - RLS can easily handle that

[Yi+00]:

Skip

Page 174: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 189

Adaptability - „forgetting‟

Independent Variable

eg., #packets sent

Dep

enden

t V

aria

ble

eg.,

#by

tes

sent

Skip

Page 175: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 202

Outline

• Motivation

• ...

• Linear Forecasting

– Auto-regression: Least Squares; RLS

– Co-evolving time sequences

– Examples

– Conclusions

Page 176: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 203

Co-Evolving Time Sequences

• Given: A set of correlated time sequences

• Forecast „Repeated(t)‟

0

10

20

30

40

50

60

70

80

90

1 3 5 7 9 11

Time Tick

Nu

mb

er o

f p

ack

ets

sent

lost

repeated

??

Page 177: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 204

Solution:

Q: what should we do?

Page 178: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 205

Solution:

Least Squares, with

• Dep. Variable: Repeated(t)

• Indep. Variables: Sent(t-1) … Sent(t-w);

Lost(t-1) …Lost(t-w); Repeated(t-1), ...

• (named: „MUSCLES‟ [Yi+00])

Page 179: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 206

Time Series Analysis - Outline

• Auto-regression

• Least Squares; recursive least squares

• Co-evolving time sequences

• Examples

• Conclusions

Page 180: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 207

Conclusions - Practitioner‟s

guide

• AR(IMA) methodology: prevailing method

for linear forecasting

• Brilliant method of Recursive Least Squares

for fast, incremental estimation.

• See [Box-Jenkins]

Page 181: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 208

Resources: software and urls

• MUSCLES: Prof. Byoung-Kee Yi:

http://www.postech.ac.kr/~bkyi/

or [email protected]

• free-ware: „R‟ for stat. analysis

(clone of Splus)

http://cran.r-project.org/

Page 182: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 209

Books

• George E.P. Box and Gwilym M. Jenkins and Gregory C.

Reinsel, Time Series Analysis: Forecasting and Control,

Prentice Hall, 1994 (the classic book on ARIMA, 3rd ed.)

• Brockwell, P. J. and R. A. Davis (1987). Time Series:

Theory and Methods. New York, Springer Verlag.

Page 183: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 210

Additional Reading

• [Papadimitriou+ vldb2003] Spiros Papadimitriou, Anthony

Brockwell and Christos Faloutsos Adaptive, Hands-Off

Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003

• [Yi+00] Byoung-Kee Yi et al.: Online Data Mining for

Co-Evolving Time Sequences, ICDE 2000. (Describes

MUSCLES and Recursive Least Squares)

Page 184: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 211

Next: Kalman filters

BREAK!

Page 185: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Indexing and Mining Time

Sequences

Part 4

Kalman Filters

Christos Faloutsos

Lei Li

CMU

KDD 2010 1Copyright: C. Faloutsos & L. Li, 2010

Page 186: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Outline

• Intuition, example, and definition

• Extensions

• Kalman filters at work

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 2

Page 187: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Intuition

• Tracking moving objects, estimate velocity

and acceleration on the fly

3KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

from FIFA 2010RoboCup 2010

Page 188: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Linear Dynamical System

• Known parameters

– Original Kalman Filters [Kalman 1960, Rauch

1965]

• Unknown parameters

– Parameter estimation through EM algorithm

[Shumway et al 1982, Ghahramani 1996]

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 4

Page 189: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Kalman Filters

Given observations of the soccer ball position

t=1..T, “Model parameters”

Goal: two types of prediction

Kalman filtering: Estimate the true position,

velocity & acceleration based on the

previous observations

Kalman smoothing: Estimate for every time

tick, based on all observationsKDD 2010 Copyright: C. Faloutsos & L. Li, 2010 5

?

?

Page 190: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Kalman Filters (intuition)

t=1, soccer with initial pos, vel and acc.

To estimate the future

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 6

v1

a1

p1

past observations

Page 191: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Kalman Filters (intuition)

t=2, according to Newton’s law, it should be...

7KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

v1

a1

p1

2p~

2v~2a~

Page 192: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Kalman Filters (intuition)

t=2, according to Newton’s law, it should be...

however, imperfect soccer/kick movement…

8KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

v1

a1

p1

v2

a2

p2

Page 193: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Kalman Filters (intuition)

Now take a photo, due to imperfect camera…

9KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

v1

a1

p1

v2

a2

p2

Page 194: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Kalman Filters (intuition)

What is the best estimate for next time tick?

10KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

v1

a1

p1

v2

a2

p2

Page 195: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Kalman Filters (intuition)

What is the best estimate for next time tick?

11KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

v1

a1

p1

2p

2v2a

Page 196: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Some math notationname

A transition matrix

C transmission/ projection/

output matrix

Q transition covariance

R transmission/ projection/

output covariance

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 12

v1

a1

p1

‘Newton’s dynamics’: Transition matrix A

2p~

2v~2a~

v1

a1

p1

2p

2v2a

Page 197: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Example

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 13

observation

hidden states

(details)

z1=(p1, v1, a1)T

x1=(observed1)

A= 1,1,1/2

0, 1, 1

0, 0, 1

v1

a1

p1

2p

2v2a

C=(1, 0, 0)

transition matrix

output matrix

transition covariance Q

output covariance R

Page 198: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Example

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 14

observation

hidden states

(details)

z1=(p1, v1, a1)T

x1=(observed1)

A= 1,1,1/2

0, 1, 1

0, 0, 1

C=(1, 0, 0)

transition matrix

output matrix

p2=p1+ v1*∆t + 0.5*a1*∆t2

v2=v1+a1*∆t

a2=a1

v1

a1

p1

Page 199: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Kalman Filtering (intuition)

Step1, forecast next time tick before observation

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 15

v1

a1

p1

‘Newton’s dynamics’:

Transition matrix A

2p~

2v~2a~

Page 200: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Kalman Filtering (intuition)

Step 2: adjust estimation after observation

16KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

v1

a1

p1

2p

2v2a

Page 201: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Example: Kalman filtering

(forward)

17

Time*

observed

Position

*

** *

*

Given:

a sequence of observations, Model parameters (A, C …)

Goal: remove noise and forecast real position

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 202: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Example: Kalman filtering

(forward)

18

Time*

observed

estimated

Position

[R. E. Kalman. A new approach to linear filtering and prediction problems. 1960]

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

t=1

Page 203: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Example: Kalman filtering

19

Time*

observed

estimated

Position

*

QAVAP

RCPCCPK

PKIV

zACxKzA

T

nn

T

n

T

nn

nnn

nnnnn

11

1

11

1

11

ˆ

)(

)(ˆ

)ˆ(ˆz

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

t=2

Intuition: #2 may

be close to #1

Using the “Newton

Dynamics”

Page 204: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

20

Time*

observed

Position

** *

*

Example: Kalman filtering

*estimated

QAVAP

RCPCCPK

PKIV

zACxKzA

T

nn

T

n

T

nn

nnn

nnnnn

11

1

11

1

11

ˆ

)(

)(ˆ

)ˆ(ˆz

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 205: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Kalman Filters

Given observations of the soccer position

t=1..T, and model parameters (A, C …)

Goal: two types of prediction

Kalman filtering: Estimate the true position,

velocity & acceleration based on the

previous observations

Kalman smoothing: Estimate for every time

tick, based on all observationsKDD 2010 Copyright: C. Faloutsos & L. Li, 2010 21

?

?

Page 206: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Kalman Smoothing

Given: all observation x1, …, xn

Estimate: the hidden state for every time tick

zt (t=1..n)

Difference from Kalman filtering:

bring future observation back in history

estimate

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 22

??

Page 207: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

23

Time

Forward

*

observed

Position

** *

*

Recap: Kalman filtering

*estimated

QAVAP

RCPCCPK

PKIV

zACxKzA

T

nn

T

n

T

nn

nnn

nnnnn

11

1

11

1

11

ˆ

)(

)(ˆ

)ˆ(ˆz

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 208: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Example: Kalman Smoothing

24

Backward

Time*

observed

Position

** *

*

**

11

1

1

ˆ

(ˆˆ

)ˆˆ(ˆz

n

T

nn

T

nnnnnn

nnnnn

PAVJ

JPVJVV

zAzJz

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

[Rauch et al, Maximum likelihood estimates of linear dynamic systems. 1965]

Page 209: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Example: Kalman Smoothing

25

Backward

Time*

observed

Position

** *

*

*

estimated

*

**

*

*

1

1

1

ˆ

(ˆˆ

)ˆˆ(ˆz

n

T

nn

T

nnnnnn

nnnnn

PAVJ

JPVJVV

zAzJz

2

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 210: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Example: Kalman Smoothing

26

Backward

Time*

observed

Position

** *

*

*

estimated

*

**

*

*

Reconstructed

signal after

smoothing

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 211: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Outline

• Intuition, example, and definition

– Original Kalman

– Kalman filters with parameter estimation

• Extensions

• Kalman filters at work

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 27

Page 212: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

What if not know the model

parameters?

• E.g. Datacenter sensor temperatures

• no longer “Newton dynamics”

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 28

v1

a1

p1

Transition matrix A =

output matrix C =

2p

2v2a

Page 213: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Graphical Model Representation

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 29

z1 = µ0 + ω0

zn+1= A∙zn + ωn

xn = C∙zn + εn

Model parameters:

θ={µ0, Q0, A, Q, C, R}

observation

hidden states

(details)

Z1 Z2 Z3 Z4

X1 X2 X3X4

N(A∙z1,Q)

N(µ0,Q0)

N(C∙z3,R)N(C∙z1,R) N(C∙z2,R) N(C∙z4,R) …

N(A∙z2,Q) N(A∙z3,Q) N(A∙z4,Q)

Gaussian

noise

Page 214: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Linear Dynamical Systems:

parametersname meaning & example

µ0 initial state for hidden

variable

e.g. initial position, velocity &

acceleration

A transition matrix how the states move forward, e.g.

soccer flying in the air

C transmission/ projection/

output matrix

hidden state observation, e.g.

camera taking picture of the soccer

Q0 Initial covariance

Q transition covariance how precision is the soccer motion

R transmission/ projection

covariance

i.e. observation noise; e.g. how

accurate is the cameraKDD 2010 Copyright: C. Faloutsos & L. Li, 2010 30

Page 215: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Estimating parameters of LDS• Given: a sequence of observations (e.g. car

positions)

• Find: best-fit parameters

• Basic principle: maximum likelihood

• Through Expectation-Maximization Alg.

31

Z1 Z2 Z3 Z4

X1 X2 X3X4

N(A∙z1,Q)

N(µ0,Q0)

N(C∙z3,R)N(C∙z1,R) N(C∙z2,R) N(C∙z4,R)…

N(A∙z2,Q) N(A∙z3,Q) N(A∙z4,Q)

Page 216: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Learning LDS: EM alg.

• E-step: Kalman filtering-smoothing

– Estimate the hidden variables based on

observation and current parameters

• M-step:

– Update parameters (A, C…) to

maximize

the log-likelihood.

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 32

Z1

Z2

Z3

Z4

X1

X2

X3

X4

N(A∙z1,Q)

N(µ0,Q0)

N(C∙z3,R)N(C∙z1,R) N(C∙z2,R) N(C∙z4,R)…

N(A∙z2,Q) N(A∙z3,Q) N(A∙z4,Q)

[Shumway et al 1982, Ghahramani 1996]

Page 217: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

EM alg. Intuition

33

E-step: compute hidden states using Kalman

filtering-smoothing (exactly as before)

Time*

observed

Position

** *

*

*

estimated

*

**

*

*

Reconstructed

signal after

smoothing

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 218: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

EM alg.

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 34

M-step: Maximizing the log-likelihood

by solving 0A

L

0C

L

...

Page 219: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Outline

• Intuition, example, and definition

• Extensions

– Handling missing values

– Switching LDS

– Particle filters

• Kalman filters at work

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 35

Page 220: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Motivation Examples

• Motion Capture:– Markers on human actors

– Cameras used to track the 3D positions

– Duration: 100-500

– 93 dimensional body-local coordinates after preprocessing (31-bones)

• Sensor data missing due to:– Low battery

– RF error

36

From mocap.cs.cmu.edu

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 221: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Time

sensor 1

sensor 2

sensorm

blackout

Illustration of data

37KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 222: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo: Intuition

38

Position of

Left hand

marker

Position of

right hand

marker

missing

Recover using

Correlation

among multiple

sequences

[Lei Li et al DynaMMo. KDD 2009]KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 223: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo: Intuition

39

missing

Recover using

Dynamics

temporal moving

pattern

Position of

Left hand

marker

Position of

right hand

marker

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

[Lei Li et al DynaMMo. KDD 2009]

Page 224: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

LDS with Missing Value

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 40

observation

hidden states Z1 Z2 Z3 Z4

X1 X2 X3X4

N(A∙z1,Q)

N(µ0,Q0)

N(C∙z3,R)N(C∙z1,R) N(C∙z2,R) N(C∙z4,R) …

N(A∙z2,Q) N(A∙z3,Q) N(A∙z4,Q)

partially

observed

Page 225: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo Illustration:

estimate all colored elements

41x1 x2 x3

× × ×

×× ×

C C C

Details in [Li+2009]

z1 z2 z3

A A A

Page 226: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo Illustration: step 1

estimate hidden variables

42x1 x2 x3

× × ×

×× ×

C C C

Details in [Li+2009]

z1 z2 z3

A A A

Page 227: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo Illustration: step 2

estimate missing values

43x1 x2 x3

× × ×

×× ×

Details in [Li+2009]

z1 z2 z3

C C C

A A A

Page 228: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo Illustration: step 3

update model parameters

44x1 x2 x3

× × ×

×× ×

C C C

Details in [Li+2009]

z1 z2 z3

A A A

Page 229: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo Intuition, forward-

backward algorithm

• How to recover the missing values?

45x1 x2 x3

Page 230: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo: How to Recover?

46

×

x1

hidden variables

e.g. velocity,

acceleration

projection

matrix C

z1

C

Page 231: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo: How to Recover?

47

×

×

×

x1 x2

Transition

matrix

projection

matrix

z1 z2

A

C C

Page 232: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo: How to Recover?

48

×

x1 x2

× ×

×z1 z2 z3

C C

x3

A A

Page 233: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo: How to Recover?

49x1 x2 x3

× × ×

××

C C C

z1 z2 z3

A A

Page 234: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo: How to Recover?

50x1 x2 x3

× × ×

×× ×

C C C

z1 z2 z3

similarly backward pass

A A A

Page 235: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

DynaMMo Learning

51

Guess Initial

Estimate Hidden

Recover Missing

Update Model

Fix X,

Estimate P(Z|X;):E(zn|X;), E(znz’n|X;)E(znz’n+1|X;)

fix Z,

estimate

E(X_missing|Z;)Using E(zn|X;),

Fix both X and Z,

estimate new model

parameters

argmax E[log(X,Z;)]

Random

Guess model

parameters

1

2

3

0

(details)

Page 236: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Result – Better Missing Value Recovery

52

Reconstruction

error

Average missing length

Ideal

Proposed DynaMMo

MSVD

[Srebro’03]Linear

Interpolation

Spline

Dataset:

CMU Mocap #16

mocap.cs.cmu.edumore results in [Li+2009]

Page 237: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Outline

• Intuition, example, and definition

• Extensions

– Handling missing values

– Switching LDS

– Particle filters

• Kalman filters at work

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 53

Page 238: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Switching LDS: Intuition

• Tracking human motion

– Start from slow walking

– Transition to running for a while

– Gradually stop

• Each part corresponds to a different

state & dynamics

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 54

Page 239: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Switching LDS

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 55

S={running,

walking}

Z2,1 Z2,2 Z2,3 Z2,4

X1 X2 X3X4

s1 s2 s3 s4

Z1,1 Z1,2 Z1,3 Z1,4

Page 240: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Switching LDS: example

S=1, walking

A=A1, C=C1

S=2, running

A=A2 , C=C2

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 56

Z2,1 Z2,2 Z2,3 Z2,4

X1 X2 X3X4

s1 s2 s3 s4

Z1,1 Z1,2 Z1,3 Z1,4

Page 241: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Switching LDS in Penalty Kick

Before touching ball

S=1, running towards left

A=A1, C=C1

After hitting ball

S=2, kicking towards right

A=A2 , C=C2

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 57

Z2,1 Z2,2 Z2,3 Z2,4

X1 X2 X3X4

s1 s2 s3 s4

Z1,1 Z1,2 Z1,3 Z1,4

From FIFA 2010

Page 242: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Estimating Parameters for SLDS

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 58

Approximate inference: Variational EM

[Ghahramani&Hinton. Variational learning for switching state-space

models.2000]

Z2,1 Z2,2 Z2,3 Z2,4

X1 X2 X3X4

s1 s2 s3 s4

Z1,1 Z1,2 Z1,3 Z1,4

Z2,1 Z2,2 Z2,3 Z2,4

s1 s2 s3 s4

Z1,1 Z1,2 Z1,3 Z1,4

E-step

Page 243: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Outline

• Intuition, example, and definition

• Extensions

– Handling missing values

– Switching LDS

– Particle filters

• Kalman filters at work

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 59

Page 244: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Particle Filters

• What if non-Gaussian noise?

• Inference using Markov chain Monte Carlo

sampling

60[Gordon et al, Nonlinear/non-gaussian bayesian state

estimation. 1993]

v1

a1

p1

v2

a2

p2

Page 245: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Outline

• Intuition, example, and definition

• Extensions

• Kalman filters at work

– Segmentation & Compression

– Parallel learning on Multi-core

– Motion Stitching

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 61

Page 246: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

How to Segment

• Segment by threshold on reconstruction error

62

original data

reconstruction

error

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

[Lei Li et al DynaMMo. KDD 2009]

Page 247: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Results – Segmentation

• Find the transition during “running” to

“stop”.

63

left hip

left femur

reconstruction

error

KDD 2010

Page 248: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Results – Segmentation• Find the transition during “running” to

“stop”.

64

left hip

left femur

reconstruction

error

run stopslow

down

KDD 2010

Page 249: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

How to Compress (DynaMMo)

65

Original data w/ missing values

hidden variables

keep only a portion

(optimal samples)

C

DynaMMo

A

Page 250: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Outline

• Intuition, example, and definition

• Extensions

• Kalman filters at work

– Segmentation & Compression

– Parallel learning on Multi-core

– Motion Stitching

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 66

Page 251: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Challenge illustration

Expectation-Maximization Alg.

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

Timeline for E-step (forward-backward) in learning LDS

1 2 3 4 5

67

EM can

only use

single CPU

due to data

dependency

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 252: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Parallel learning by

Cut-And-Stitch Method

Step 1

Step 2

Step 3

Step 4

1 2 3 4 5

68

Goal:

with 2 CPUs

[Li et al 2008b]

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

[Li et al Cut-And-Stitch. KDD 2008]

Page 253: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

It works!

69

sp

eed

up

# of processors

ideal

Proposed

Cut-And-Stitch

EM algorithm

Dataset:

58 motion sequences

CMU Mocap #16

mocap.cs.cmu.edu,

tested on NCSA super

computer,

Page 254: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Outline

• Intuition, example, and definition

• Extensions

• Kalman filters at work

– Segmentation & Compression

– Parallel learning on Multi-core

– Motion Stitching

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 70

Page 255: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Motion Stitching

A Database Approach• Select best

stitchable

segments from a

set of basic motion

pieces and

generate new

natural motions

71KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 256: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Problem Definition

• Given two motion-capture sequences that are to be

stitched together, how can we assess the goodness

of the stitching?

72

12

3

Which stitching looks best?

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 257: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Proposed Method:

Laziness Score [Li+2008a]

• Conjecture: less human effort more natural

• Proposed: use Kalman filters to estimate position,

velocity, acceleration Compute effort/ energy

73KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 258: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Which continues to?

Green or Blue?

74

straight moving U-Turn

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

Page 259: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Result – Laziness-score prefers

straightforward moving

75

straight moving U-Turn

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010

[Li et al Laziness Score. Eurographics 2008]

Page 260: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Conclusion• Intuition, example, and definition

– Original kalman filter (known parameters)• Kalman filtering

• Kalman smoothing

– Kalman filters with parameter estimation (EM)

• Extensions– Handling missing values

– Switching linear dynamical

– Particle filters (MCMC sampling)

• Kalman filters at work– Segmentation & compression

– Parallel learning

– Motion stitchingKDD 2010 Copyright: C. Faloutsos & L. Li, 2010 76

Page 261: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

References

• Zoubin Ghahramani and Geoffrey E. Hinton. Parameter estimation for linear dynamical systems. 1996

• Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):831–864, 2000.

• N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-gaussian bayesian state estimation. IEE Proceedings F Radar and Signal Processing, 140(2):107–113, 1993.

• R. H. Shumway and D. S. Stoffer. An approach to time series smoothing and forecasting using the em algorithm. Journal of Time Series Analysis, 3:253–264, 1982.

• R. E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME šC Journal of Basic Engineering, (82 (Series D)):35–45, 1960.

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 77

Page 262: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

References (cont’)

• H. Rauch, F Tung, and C T Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA Journal, pages 1445–1450, 1965.

• Lei Li, Wenjie Fu, Fan Guo, Todd C Mowry, and Christos Faloutsos. Cut-and-stitch: efficient parallel learning of linear dynamical systems on smps. In Proceedings of the 14th ACM SIGKDD, pages 471–479, 2008.

• Lei Li, James McCann, Christos Faloutsos, and Nancy Pollard. Laziness is a virtue: Motion stitching using effort minimization. In Short Papers Proceedings of EUROGRAPHICS, 2008.

• Lei Li, James McCann, Nancy S Pollard, and Christos Faloutsos. DynaMMo: mining and summarization of coevolving sequences with missing values. In Proceedings of the 15th ACM SIGKDD,, pages 507–516, 2009.

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 78

Page 263: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Software

• DynaMMo code (matlab) for missing value,

compression & segmentation.

• Parallel learning (in C) for LDS

• http://www.cs.cmu.edu/~leili/

• http://www.cs.cmu.edu/~leili/pubs/dynamm

o.2.1.2.zip

• http://www.cs.cmu.edu/~leili/paralearn/para

learn.0.1.zip (running on gcc 4.2.0 above)

KDD 2010 Copyright: C. Faloutsos & L. Li, 2010 79

Page 264: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 212

Page 265: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 213

Outline• Motivation

• Similarity Search and Indexing

• DSP

• Linear Forecasting

• Kalman filters

• Bursty traffic - fractals and multifractals

• Non-linear forecasting

• Conclusions

Page 266: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 214

Outline

• Motivation

• ...

• Linear Forecasting

• Bursty traffic - fractals and multifractals

– Problem

– Main idea (80/20, Hurst exponent)

– Results

Page 267: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 215

Recall: Problem #1:

Goal: given a signal (eg., #bytes over time)

Find: patterns, periodicities, and/or compress

time

#bytes Bytes per 30‟

(packets per day;

earthquakes per year)

Page 268: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 216

Problem #1

• model bursty traffic

• generate realistic traces

• (Poisson does not work)

time

# bytes

Poisson

Page 269: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 217

Motivation

• predict queue length distributions (e.g., to

give probabilistic guarantees)

• “learn” traffic, for buffering, prefetching,

„active disks‟, web servers

Page 270: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 218

Q: any „pattern‟?

time

# bytes• Not Poisson

• spike; silence; more

spikes; more silence…

• any rules?

Page 271: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 219

solution: self-similarity

# bytes

time time

# bytes

Page 272: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 220

solution: self-similarity

# bytes

time time

# bytes

Page 273: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 221

solution: self-similarity

# bytes

time time

# bytes

Page 274: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 222

solution: self-similarity

# bytes

time time

# bytes

Page 275: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 223

But:

• Q1: How to generate realistic traces;

extrapolate; give guarantees?

• Q2: How to estimate the model parameters?

Page 276: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 224

Outline

• Motivation

• ...

• Linear Forecasting

• Bursty traffic - fractals and multifractals

– Problem

– Main idea (80/20, Hurst exponent)

– Results

Page 277: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 225

Approach

• Q1: How to generate a sequence, that is

– bursty

– self-similar

– and has similar queue length distributions

Page 278: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 226

Approach

• A: „binomial multifractal‟ [Wang+02]

• ~ 80-20 „law‟:

– 80% of bytes/queries etc on first half

– repeat recursively

• b: bias factor (eg., 80%)

Page 279: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 227

binary multifractals

20 80

Page 280: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 228

binary multifractals

20 80

Page 281: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 229

Parameter estimation

• Q2: How to estimate the bias factor b?

Page 282: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 230

Parameter estimation

• Q2: How to estimate the bias factor b?

• A: MANY ways [Crovella+96]

– Hurst exponent

– variance plot

– even DFT amplitude spectrum! („periodogram‟)

– More robust: „entropy plot‟ [Wang+02]

Page 283: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 231

Entropy plot

• Rationale:

– burstiness: inverse of uniformity

– entropy measures uniformity of a distribution

– find entropy at several granularities, to see

whether/how our distribution is close to

uniform.

Page 284: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 232

Entropy plot

• Entropy E(n) after n

levels of splits

• n=1: E(1)= - p1 log2(p1)-

p2 log2(p2)

p1 p2

% of bytes

here

Page 285: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 233

Entropy plot

• Entropy E(n) after n

levels of splits

• n=1: E(1)= - p1 log(p1)-

p2 log(p2)

• n=2: E(2) = - Si p2,i *

log2 (p2,i)

p2,1 p2,2 p2,3 p2,4

Page 286: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 234

Real traffic

• Has linear entropy plot

(-> self-similar)

# of levels (n)

Entropy

E(n)

0.73

Page 287: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 235

Observation - intuition:

intuition: slope =

intrinsic dimensionality =

info-bits per coordinate-bit

– unif. Dataset: slope =1

– multi-point: slope = 0

# of levels (n)

Entropy

E(n)

Skip

Page 288: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 236

Entropy plot - Intuition

• Slope ~ intrinsic dimensionality (in fact,

„Information fractal dimension‟)

• = info bit per coordinate bit - eg

Dim = 1

Pick a point;

reveal its coordinate bit-by-bit -

how much info is each bit worth to me?

Skip

Page 289: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 237

Entropy plot

• Slope ~ intrinsic dimensionality (in fact,

„Information fractal dimension‟)

• = info bit per coordinate bit - eg

Dim = 1

Is MSB 0?

„info‟ value = E(1): 1 bit

Skip

Page 290: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 238

Entropy plot

• Slope ~ intrinsic dimensionality (in fact,

„Information fractal dimension‟)

• = info bit per coordinate bit - eg

Dim = 1

Is MSB 0?

Is next MSB =0?

Skip

Page 291: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 239

Entropy plot

• Slope ~ intrinsic dimensionality (in fact,

„Information fractal dimension‟)

• = info bit per coordinate bit - eg

Dim = 1

Is MSB 0?

Is next MSB =0?

Info value =1 bit

= E(2) - E(1) =

slope!

Skip

Page 292: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 240

Entropy plot

• Repeat, for all points at same position:

Dim=0

Skip

Page 293: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 241

Entropy plot

• Repeat, for all points at same position:

• we need 0 bits of info, to determine position

• -> slope = 0 = intrinsic dimensionality

Dim=0

Skip

Page 294: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 242

Entropy plot

• Real (and 80-20) datasets can be in-

between: bursts, gaps, smaller bursts,

smaller gaps, at every scale

Dim = 1

Dim=0

0<Dim<1

Skip

Page 295: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 243

(Fractals)

• What set of points could have behavior

between point and line?

Page 296: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 244

Cantor dust

• Eliminate the middle third

• Recursively!

Page 297: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 245

Cantor dust

Page 298: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 246

Cantor dust

Page 299: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 247

Cantor dust

Page 300: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 248

Cantor dust

Page 301: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 249

Dimensionality?

(no length; infinite # points!)

Answer: log2 / log3 = 0.6

Cantor dust

Page 302: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 250

Some more entropy plots:

• Poisson vs real

Poisson: slope = ~1 -> uniformly distributed

1 0.73

Page 303: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 251

B-model

• b-model traffic gives perfectly

linear plot

• Lemma: its slope is

slope = -b log2b - (1-b) log2 (1-b)

• Fitting: do entropy plot; get

slope; solve for b

E(n)

n

Page 304: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 252

Outline

• Motivation

• ...

• Linear Forecasting

• Bursty traffic - fractals and multifractals

– Problem

– Main idea (80/20, Hurst exponent)

– Experiments - Results

Page 305: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 253

Experimental setup

• Disk traces (from HP [Wilkes 93])

• web traces from LBL

http://repository.cs.vt.edu/

lbl-conn-7.tar.Z

Page 306: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 254

Model validation

• Linear entropy plots

Bias factors b: 0.6-0.8

smallest b / smoothest: nntp traffic

Page 307: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 255

Web traffic - results

• LBL, NCDF of queue lengths (log-log scales)

(queue length l)

Prob( >l)

How to give guarantees?

Page 308: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 256

Web traffic - results

• LBL, NCDF of queue lengths (log-log scales)

(queue length l)

Prob( >l)20% of the requests

will see

queue lengths <100

Page 309: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 257

Conclusions

• Multifractals (80/20, „b-model‟,

Multiplicative Wavelet Model (MWM)) for

analysis and synthesis of bursty traffic

• can give (probabilistic) guarantees

Page 310: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 258

Books

• Fractals: Manfred Schroeder: Fractals, Chaos, Power

Laws: Minutes from an Infinite Paradise W.H. Freeman

and Company, 1991 (Probably the BEST book on

fractals!)

Page 311: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 259

Further reading:

• Crovella, M. and A. Bestavros (1996). Self-Similarity in

World Wide Web Traffic, Evidence and Possible Causes.

Sigmetrics.

• [ieeeTN94] W. E. Leland, M.S. Taqqu, W. Willinger,

D.V. Wilson, On the Self-Similar Nature of Ethernet

Traffic, IEEE Transactions on Networking, 2, 1, pp 1-15,

Feb. 1994.

Page 312: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 260

Further reading

• [Riedi+99] R. H. Riedi, M. S. Crouse, V. J. Ribeiro, and R.

G. Baraniuk, A Multifractal Wavelet Model with

Application to Network Traffic, IEEE Special Issue on

Information Theory, 45. (April 1999), 992-1018.

• [Wang+02] Mengzhi Wang, Tara Madhyastha, Ngai Hang

Chang, Spiros Papadimitriou and Christos Faloutsos, Data

Mining Meets Performance Evaluation: Fast Algorithms

for Modeling Bursty Traffic, ICDE 2002, San Jose, CA,

2/26/2002 - 3/1/2002.

Page 313: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 261

Page 314: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 262

Outline

• Motivation

• Similarity Search and Indexing

• DSP

• Linear Forecasting

• Kalman filters

• Bursty traffic - fractals and multifractals

• Non-linear forecasting

• Conclusions

Page 315: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 263

Detailed Outline

• Non-linear forecasting

– Problem

– Idea

– How-to

– Experiments

– Conclusions

Page 316: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 264

Recall: Problem #1

Given a time series {xt}, predict its future

course, that is, xt+1, xt+2, ...

Time

Value

Page 317: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 265

How to forecast?

• ARIMA - but: linearity assumption

• ANSWER: „Delayed Coordinate

Embedding‟ = Lag Plots [Sauer92]

Page 318: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 266

General Intuition (Lag Plot)

xt-1

xt

4-NNNew Point

Interpolate

these…

To get the final

prediction

Lag = 1,

k = 4 NN

Page 319: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 267

Questions:

• Q1: How to choose lag L?

• Q2: How to choose k (the # of NN)?

• Q3: How to interpolate?

• Q4: why should this work at all?

Page 320: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 268

Q1: Choosing lag L

• Manually (16, in award winning system by

[Sauer94])

Page 321: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 269

Q2: Choosing number of

neighbors k

• Manually (typically ~ 1-10)

Page 322: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 270

Q3: How to interpolate?

How do we interpolate between the

k nearest neighbors?

A3.1: Average

A3.2: Weighted average (weights drop

with distance - how?)

Page 323: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 271

Q3: How to interpolate?

A3.3: Using SVD - seems to perform best

([Sauer94] - first place in the Santa Fe

forecasting competition)

Xt-1

xt

Page 324: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 272

Q4: Any theory behind it?

A4: YES!

Page 325: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 273

Theoretical foundation

• Based on the “Takens‟ Theorem”

[Takens81]

• which says that long enough delay vectors

can do prediction, even if there are

unobserved variables in the dynamical

system (= diff. equations)

Page 326: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 274

Theoretical foundation

Example: Lotka-Volterra equations

dH/dt = r H – a H*P

dP/dt = b H*P – m P

H is count of prey (e.g., hare)

P is count of predators (e.g., lynx)

Suppose only P(t) is observed (t=1, 2, …).

H

P

Skip

Page 327: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 275

Theoretical foundation

• But the delay vector space is a faithful

reconstruction of the internal system state

• So prediction in delay vector space is as

good as prediction in state space

Skip

H

P

P(t-1)

P(t)

Page 328: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

Solution to Volterra-Lotka eq.

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 276from wikipedia

time# prey

# predators

prey

predators

Page 329: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 277

Detailed Outline

• Non-linear forecasting

– Problem

– Idea

– How-to

– Experiments

– Conclusions

Page 330: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 278

Datasets

Logistic Parabola:xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

time

x(t

)

Lag-plot

Page 331: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 279

Datasets

Logistic Parabola:xt = axt-1(1-xt-1) + noise Models population of flies [R. May/1976]

time

x(t

)

Lag-plot

ARIMA: fails

Page 332: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 280

Logistic Parabola

Timesteps

Value

Our Prediction from

here

Page 333: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 281

Logistic Parabola

Timesteps

Value

Comparison of prediction

to correct values

Page 334: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 282

Datasets

LORENZ: Models convection currents in the air

dx / dt = a (y - x)

dy / dt = x (b - z) - y

dz / dt = xy - c z

Skip

Page 335: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 283

LORENZ

Timesteps

Value

Comparison of prediction

to correct values

Skip

Page 336: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 284

Datasets

Time

Value

• LASER: fluctuations in a Laser over time (used in Santa Fe competition)

Skip

Page 337: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 285

Laser

Timesteps

Value

Comparison of prediction

to correct values

Skip

Page 338: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 286

Conclusions

• Lag plots for non-linear forecasting

(Takens‟ theorem)

• suitable for „chaotic‟ signals

Page 339: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 287

References

• Deepay Chakrabarti and Christos Faloutsos F4: Large-

Scale Automated Forecasting using Fractals CIKM 2002,

Washington DC, Nov. 2002.

• Sauer, T. (1994). Time series prediction using delay

coordinate embedding. (in book by Weigend and

Gershenfeld, below) Addison-Wesley.

• Takens, F. (1981). Detecting strange attractors in fluid

turbulence. Dynamical Systems and Turbulence. Berlin:

Springer-Verlag.

Page 340: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 288

References

• Weigend, A. S. and N. A. Gerschenfeld (1994). Time

Series Prediction: Forecasting the Future and

Understanding the Past, Addison Wesley. (Excellent

collection of papers on chaotic/non-linear forecasting,

describing the algorithms behind the winners of the Santa

Fe competition.)

Page 341: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 289

Overall conclusions

• Similarity search: Euclidean/time-warping;

feature extraction and SAMs

Page 342: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 290

Overall conclusions

• Similarity search: Euclidean/time-warping;

feature extraction and SAMs

• Signal processing: DWT is a powerful tool

Page 343: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 291

Overall conclusions

• Similarity search: Euclidean/time-warping;

feature extraction and SAMs

• Signal processing: DWT is a powerful tool

• Linear Forecasting: AR (Box-Jenkins)

Page 344: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 292

Overall conclusions

• Similarity search: Euclidean/time-warping;

feature extraction and SAMs

• Signal processing: DWT is a powerful tool

• Linear Forecasting: AR (Box-Jenkins)

• Kalman filters & extensions: forecasting,

pattern discovery, segmentation

Page 345: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 293

Overall conclusions

• Similarity search: Euclidean/time-warping;

feature extraction and SAMs

• Signal processing: DWT is a powerful tool

• Linear Forecasting: AR (Box-Jenkins)

• Kalman filters & extensions: forecasting,

pattern discovery, segmentation

• Bursty traffic: multifractals (80-20 „law‟)

Page 346: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 294

Overall conclusions

• Similarity search: Euclidean/time-warping;

feature extraction and SAMs

• Signal processing: DWT is a powerful tool

• Linear Forecasting: AR (Box-Jenkins)

• Kalman filters & extensions: forecasting,

pattern discovery, segmentation

• Bursty traffic: multifractals (80-20 „law‟)

• Non-linear forecasting: lag-plots (Takens)

Page 347: Indexing and Mining Streamscs.kangwon.ac.kr/~ysmoon/courses/2011_1/grad_mining/slides/03.pdf · 1 11 21 31 41 51 61 71 81 91 101 111 121 Fourier appx actual • Dow Jones Industrial

CMU SCS

KDD 2010 (c) 2010, C. Faloutsos, Lei Li 295

christos AT cs.cmu.edu

www.cs.cmu.edu/~christos

leili AT cs.cmu.edu

www.cs.cmu.edu/~leili

www.cs.cmu.edu/~leili/pubs/dynammo.2.1.2.zip

THANK YOU!