carnegie mellon db/ir '06c. faloutsos#1 data mining on streams christos faloutsos cmu

95
DB/IR '06 C. Faloutsos #1 Carnegie Mellon Data Mining on Streams Christos Faloutsos CMU

Upload: myles-lindsey

Post on 15-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #1

Carnegie Mellon

Data Mining on Streams

Christos Faloutsos

CMU

Page 2: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #2

Carnegie Mellon

THANK YOU!

• Prof. Panos Ipeirotis

• Julia Mills

Page 3: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #4

Carnegie Mellon

Outline

• Problem and motivation

• Single-sequence mining: AWSOM

• Co-evolving sequences: SPIRIT

• Lag correlations: BRAID

• Conclusions

Page 4: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #5

Carnegie Mellon

Problem definition - example

Each sensor collects data (x1, x2, …, xt, …)

Page 5: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #6

Carnegie Mellon

Problem definition

• Given: one or more sequences x1 , x2 , … , xt , …

(y1, y2, … , yt, …

… )

• Find – patterns; correlations; outliers– incrementally!

Page 6: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #7

Carnegie Mellon

Limitations / ChallengesFind patterns using a method that is• nimble: limited resources

– Memory– Bandwidth, power, CPU

• incremental: on-line, ‘any-time’ response– single pass (‘you get to see it only once’)

• automatic: no human intervention– eg., in remote environments

Page 7: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #8

Carnegie Mellon

Application domains• Sensor devices

– Temperature, weather measurements– Road traffic data– Geological observations– Patient physiological data

• Embedded devices– Network routers– Intelligent (active) disks

Page 8: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #9

Carnegie Mellon

Motivation - Applications (cont’d)

• ‘Smart house’

– sensors monitor temperature, humidity, air quality

• video surveillance

Page 9: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #10

Carnegie Mellon

Motivation - Applications (cont’d)

• civil/automobile infrastructure

– bridge vibrations [Oppenheim+02]

– road conditions / traffic monitoring

Page 10: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #11

Carnegie Mellon

Motivation - Applications (cont’d)

• Weather, environment/anti-pollution

– volcano monitoring

– air/water pollutant monitoring

Page 11: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #12

Carnegie Mellon

Motivation - Applications (cont’d)

• Computer systems

– ‘Active Disks’ (buffering, prefetching)

– web servers (ditto)

– network traffic monitoring

– ...

Page 12: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos

Carnegie Mellon

InteMonw/ Evan Hoke, Jimeng Sun

self-* PetaBytedata center at CMU

Page 13: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #14

Carnegie Mellon

Outline

• Problem and motivation

• Single-sequence mining: AWSOM

• Co-evolving sequences: SPIRIT

• Lag correlations: BRAID

• conclusions

Page 14: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #15

Carnegie Mellon

Single sequence mining - AWSOM

• with Spiros Papadimitriou (CMU -> IBM)

• Anthony Brockwell (CMU/Stat)

Page 15: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #16

Carnegie Mellon

Problem definition• Semi-infinite streams of values (time series) x1, x2,

…, xt, …

• Find patterns, forecasts, outliers…

Periodicity? (daily)

Periodicity? (twice daily)

“Noise”??

Page 16: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #17

Carnegie Mellon

Requirements / Goals

• Adapt and handle arbitrary periodic components

and

• nimble (limited resources, single pass)

• on-line, any-time

• automatic (no human intervention/tuning)

Page 17: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #18

Carnegie Mellon

Overview

• Introduction / Related work

• Background

• Main idea

• Experimental results

Page 18: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #19

Carnegie Mellon

WaveletsExample – Haar transform

t

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency

t

xt

“constant”

Page 19: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #20

Carnegie Mellon

WaveletsWhy we like them

• Wavelets compress many real signals well:– Image compression and processing– Vision– Astronomy, seismology, …

• Wavelet coefficients can be updated as new points arrive

Page 20: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #21

Carnegie Mellon

Overview

• Introduction / Related work

• Background

• Main idea

• Experimental results

Page 21: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #22

Carnegie Mellon

AWSOMxt

tt

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency=

Page 22: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #23

Carnegie Mellon

AWSOMxt

tt

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency

Page 23: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #24

Carnegie Mellon

AWSOM - idea

Wl,tWl,t-1Wl,t-2Wl,t l,1Wl,t-1 l,2Wl,t-2 …

Wl’,t’-1Wl’,t’-2Wl’,t’

Wl’,t’ l’,1Wl’,t’-1 l’,2Wl’,t’-2 …

Page 24: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #25

Carnegie Mellon

More details…

• Update of wavelet coefficients

• Update of linear models

• Feature selection– Not all correlations are significant– Throw away the insignificant ones (“noise”)

(incremental)

(incremental; RLS)

(single-pass)

Page 25: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #26

Carnegie Mellon

Complexity• Model update

Space: OlgN + mk2 OlgNTime: Ok2 O1

Where– N: number of points (so far)– k: number of regression coefficients; fixed– m: number of linear models; OlgN

?

Page 26: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #27

Carnegie Mellon

Overview

• Introduction / Related work

• Background

• Main idea

• Experimental results

Page 27: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #28

Carnegie Mellon

Results - Synthetic data• Triangle pulse

• Mix (sine + square)

• AR captures wrong trend (or none)

• Seasonal AR estimation fails

AWSOM AR Seasonal AR

Page 28: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #29

Carnegie Mellon

Results - Real data

• Automobile traffic– Daily periodicity– Bursty “noise” at smaller scales

• AR fails to capture any trend• Seasonal AR estimation fails

Page 29: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #30

Carnegie Mellon

Results - real data

• Sunspot intensity– Slightly time-varying “period”

• AR captures wrong trend• Seasonal ARIMA

– wrong downward trend, despite help by human!

Page 30: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #31

Carnegie Mellon

Conclusions

Adapt and handle arbitrary periodic components

andnimble

Limited memory (logarithmic)

Constant-time update

on-line, any-timeSingle pass over the data

automatic: No human intervention/tuning

Page 31: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #32

Carnegie Mellon

Outline

• Problem and motivation

• Single-sequence mining: AWSOM

• Co-evolving sequences: SPIRIT

• Lag correlations: BRAID

• conclusions

Page 32: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #33

Carnegie Mellon

Part 2

SPIRIT: Mining co-evolving streams

[Papadimitriou, Sun, Faloutsos, VLDB05]

Page 33: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #34

Carnegie Mellon

Motivation• Eg., chlorine concentration in water

distribution network

Page 34: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #35

Carnegie Mellon

Motivation

water distribution network

normal operationMay have hundreds of measurements, but

it is unlikely they are completely unrelated!

Phase 1 Phase 2 Phase 3

: : : : : :

: : : : : :

chlo

rine c

once

ntr

ati

ons

Page 35: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #36

Carnegie Mellon

Phase 1 Phase 2 Phase 3

: : : : : :

: : : : : :

Motivation

water distribution network

normal operation major leak

chlo

rine c

once

ntr

ati

ons

sensorsnear leak

sensorsawayfrom leak

Page 36: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #37

Carnegie Mellon

Phase 1 Phase 2 Phase 3

: : : : : :

: : : : : :

Motivation

water distribution network

normal operation major leak

chlo

rine c

once

ntr

ati

ons

sensorsnear leak

sensorsawayfrom leak

Page 37: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #38

Carnegie Mellon

Motivation

actual measurements(n streams)

k hidden variable(s)

We would like to discover a few “hidden(latent) variables” that summarize the key trends

Phase 1

: : : : : :

: : : : : :

chlo

rine c

once

ntr

ati

ons

Phase 1

k = 1

Page 38: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #39

Carnegie Mellon

Motivation

We would like to discover a few “hidden(latent) variables” that summarize the key trends

chlo

rine c

once

ntr

ati

ons

Phase 1 Phase 1Phase 2 Phase 2

actual measurements(n streams)

k hidden variable(s)

k = 2

: : : : : :

: : : : : :

Page 39: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #40

Carnegie Mellon

Motivation

We would like to discover a few “hidden(latent) variables” that summarize the key trends

chlo

rine c

once

ntr

ati

ons

Phase 1 Phase 1Phase 2 Phase 2Phase 3 Phase 3

actual measurements(n streams)

k hidden variable(s)

k = 1

: : : : : :

: : : : : :

Page 40: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #41

Carnegie Mellon

• Discover “hidden” (latent) variables for:– Summarization of main trends for users– Efficient forecasting, spotting outliers/anomalies

and the usual:

• nimble: Limited memory requirements

• on-line, any-time: (single pass etc)

• automatic: No special parameters to tune

Goals

Page 41: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #42

Carnegie Mellon

Related workStream mining

• Stream SVD [Guha, Gunopulos, Koudas / KDD03]• StatStream [Zhu, Shasha / VLDB02]• Clustering

[Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE],[Lin, Vlachos, Keogh, Gunopulos / EDBT04],

• Classification[Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01]

Page 42: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #43

Carnegie Mellon

Related workStream mining

• Piecewise approximations[Palpanas, Vlachos, Keogh, etal / ICDE 2004]

• Queries on streams[Dobra, Garofalakis, Gehrke, et al / SIGMOD02],[Madden, Franklin, Hellerstein, et al / OSDI02],[Considine, Li, Kollios, et al / ICDE04],[Hammad, Aref, Elmagarmid / SSDBM03]

• …

Page 43: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #44

Carnegie Mellon

OverviewPart 2

• Method

• Experiments

• Conclusions & Other work

Page 44: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #45

Carnegie Mellon

Stream correlations

• Step 1: How to capture correlations?

• Step 2: How to do it incrementally, when we have a very large number of points?

• Step 3: How to dynamically adjust the number of hidden variables?

Page 45: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #46

Carnegie Mellon

1. How to capture correlations?

20oC

30oC

Tem

pera

ture

t1

First sensor

time

Page 46: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #47

Carnegie Mellon

1. How to capture correlations?

First sensor

Second sensor

20oC

30oC

Tem

pera

ture

t2

time

Page 47: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #48

Carnegie Mellon

20oC 30oC

1. How to capture correlations

20oC

30oC

Temperature t1

Correlations:

Let’s take a closer look at the first three value-pairs…T

em

pera

ture

t2

Page 48: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #49

Carnegie Mellon

20oC 30oC

1. How to capture correlations

20oC

30oC

Tem

pera

ture

t2

Temperature t1

First three lie (almost) on a line in the space of value-pairs… O(n) numbers for the slope, and One number for each value-pair (offset on line)

offse

t = “h

idde

n va

riabl

e”

time=1

time=2

time=3

Page 49: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #50

Carnegie Mellon

1. How to capture correlations

20oC 30oC

20oC

30oC

Tem

pera

ture

t2

Temperature t1

Other pairs also follow the same pattern: they lie (approximately) on this line

Page 50: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #51

Carnegie Mellon

Stream correlations

• Step 1: How to capture correlations?

• Step 2: How to do it incrementally, when we have a very large number of points?

• Step 3: How to dynamically adjust the number of hidden variables?

Page 51: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos

Carnegie Mellon

Incremental updates

error

20o

C30o

C

20o

C

30o

C

Tem

pera

ture

T2

Temperature T1

Page 52: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos

Carnegie Mellon

Incremental updates• Algorithm runs in O(n) where

n= # of streams• no need to access old data

error

20oC

30oC

20oC 30oCTemperature T1

Page 53: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #54

Carnegie Mellon

Stream correlationsPrincipal Component Analysis (PCA)

• The “line” is the first principal component (PC)

• This line is optimal: it minimizes the sum of squared projection errors

Page 54: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #55

Carnegie Mellon

2. Incremental updateGiven number of hidden variables k

• Assuming k is known

• We know how to update the slope

For each new point x and for i = 1, …, k :

• yi := wiTx (proj. onto wi)

• di di + yi2 (energy i-th eigenval.)

• ei := x – yiwi (error)

• wi wi + (1/di) yiei (update estimate)

• x x – yiwi (repeat with remainder)

y1

w1

xe1

w1 updated

Page 55: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #56

Carnegie Mellon

Stream correlations

• Step 1: How to capture correlations?

• Step 2: How to do it incrementally, when we have a very large number of points?

• Step 3: How to dynamically adjust k, the number of hidden variables?

Page 56: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #57

Carnegie Mellon

Answer

• When the reconstruction accuracy is too low (say, <95%)

• then introduce another hidden variable (k++)

• [How to initialize its values: tricky]

Page 57: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #58

Carnegie Mellon

Missing values

20oC 30oC

20oC

30oC

Tem

pera

ture

T2

Temperature T1

true values (pair)

all possiblevalue pairs(given only t1)

best guess(given correlations: intersection)

Page 58: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #59

Carnegie Mellon

Forecasting

?

• Assume we want to forecast the next value for a particular stream (e.g. auto-regression)

n streams

Page 59: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #60

Carnegie Mellon

Forecasting

• Option 1: One complex model per stream– Next value = function of

previous values on all streams

– Captures correlations

– Too costly! [ ~ O(n3) ]

+

n streams

Page 60: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #61

Carnegie Mellon

Forecasting

• Option 1: One complex model per stream

• Option 2: One simple model per stream– Next value = function of

previous value on same stream

– Worse accuracy, but maybe acceptable

– But, still need n models

+

n streams

Page 61: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #62

Carnegie Mellon

Forecasting

n streams

hiddenvariables

k hidden vars

k << n and already

capture correlations

+

Only k simplemodels

Efficiency &robustness

Page 62: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #63

Carnegie Mellon

Time/space requirementsIncremental PCA

O(nk) space (total) and time (per tuple), i.e.,

• Independent of # points

• Linear w.r.t. # streams (n)

• Linear w.r.t. # hidden variables (k)

In fact,

• Can be done in real time

Page 63: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #64

Carnegie Mellon

OverviewPart 2

• Method

• Experiments

• Conclusions & Other work

Page 64: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #65

Carnegie Mellon

ExperimentsChlorine concentration

166 streams2 hidden variables (~4% error)

Measurements

Reconstruction

[CMU Civil Engineering]

Page 65: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #66

Carnegie Mellon

ExperimentsChlorine concentration

hidden variables

• Both capture global, periodic pattern• Second: ~ first, but phase-shifted• Can express any phase-shift…

[CMU Civil Engineering]

Page 66: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #67

Carnegie MellonExperiments

Light measurements

measurementreconstruction

54 sensors2-4 hidden variables (~6% error)

Page 67: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #68

Carnegie MellonExperiments

Light measurements

• 1 & 2: main trend (as before)• 3 & 4: potential anomalies and outliers

hidden variables

intermittentintermittent

Page 68: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #69

Carnegie Mellon

ConclusionsSPIRIT:

Discovers hidden variables for– Summarization of main trends for users– Efficient forecasting, spotting outliers/anomalies

Incremental, real time computationnimble: With limited memoryautomatic: No special parameters to tune

Page 69: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #70

Carnegie Mellon

Outline

• Problem and motivation

• Single-sequence mining: AWSOM

• Co-evolving sequences: SPIRIT

• Lag correlations: BRAID

• Conclusions

Page 70: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #71

Carnegie Mellon

Part 3:BRAID: Discovering Lag

Correlations in Multiple StreamsYasushi Sakurai, Spiros Papadimitriou, Christos FaloutsosSIGMOD’05

Page 71: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #72

Carnegie Mellon

Lag Correlations

• Examples– A decrease in interest rates typically precedes

an increase in house sales by a few months

– Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later

Page 72: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #73

Carnegie Mellon

Lag Correlations• Example of lag-correlated sequences

These sequences are correlated with lag l=1300 time-ticks

CCF (Cross-Correlation Function)

Page 73: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #74

Carnegie Mellon

Lag Correlations• Example of lag-correlated sequences

CCF (Cross-Correlation Function)

how to compute it•quickly•cheaply•incrementally

Page 74: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #75

Carnegie Mellon

Challenging Problems

• Problem definitions– For given two co-evolving sequences X and Y,

determine• Whether there is a lag correlation• If yes, what is the lag length l

– For given k numerical sequences, X1,…,Xk , report

• Which pairs have a lag correlation• The corresponding lag for each pair

Page 75: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #76

Carnegie Mellon

Our solution

• Ideal characteristics:– ‘Any-time’ processing, and fast

Computation time per time tick is constant

– NimbleMemory space requirement is sub-linear of sequence

length

– AccurateApproximation introduces small error

Page 76: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #77

Carnegie Mellon

• Sequence indexing– Agrawal et al. (FODO 1993)

– Faloutsos et al. (SIGMOD 1994)

– Keogh et al. (SIGMOD 2001)

• Compression (wavelet and random projections)– Gilbert et al. (VLDB 2001), Guha et al. (VLDB 2004)

– Dobra et al.(SIGMOD 2002), Ganguly et al.(SIGMOD 2003)

• Data Stream Management– Abadi et al. (VLDB Journal 2003)

– Motwani et al. (CIDR 2003)

– Chandrasekaran et al. (CIDR 2003)

– Cranor et al. (SIGMOD 2003)

Related Work

Page 77: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #78

Carnegie Mellon

Related Work• Pattern discovery

– Clustering for data streamsGuha et al. (TKDE 2003)

– Monitoring multiple streamsZhu et al. (VLDB 2002)

– ForecastingYi et al. (ICDE 2000)

Papadimitriou et al. (VLDB 2003)

• None of previously published methods focuses on the problem

Page 78: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #79

Carnegie Mellon

Overview

• Introduction / Related work

• Background

• Main ideas

• Theoretical analysis

• Experimental results

Page 79: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #80

Carnegie Mellon

Main Idea (1)• Incremental compution

– Sufficient statistics• Sum of X :

• Square sum of X :

• Inner-product for X and the shifted Y :

– Compute R(l) incrementally:

• Covariance of X and Y:

• Variance of X:

n

lt ltt yxlSxy1

)(

n

t txnSx1

),1(

n

t txnSxx1

2),1(

),1(),1(

)()(

lnVynlVx

lClR

ln

lnSynlSxlSxylC

),1(),1(

)()(

ln

nlSxnlSxxnlVx

2)),1((

),1(),1(

Page 80: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #81

Carnegie Mellon

Main Idea (2)

Lag

Cor

rela

tion

• Sequence smoothing

t=nTime

Page 81: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #82

Carnegie Mellon

Main Idea (2)

Lag

Cor

rela

tion

Level

h=0t=nTime

• Sequence smoothing– Means of windows for each level– Sufficient statistics computed from the means– CCF computed from the sufficient statistics– But, it allows a partial redundancy

Page 82: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #83

Carnegie Mellon

Main Idea (3)

Lag

Cor

rela

tion

Level

h=0t=nTime

• Geometric lag probing

Page 83: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #84

Carnegie Mellon

Main Idea (3)

Lag

Cor

rela

tion

Level

h=0t=nTime

• Geometric lag probing– Use colored windows– Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}– Use a cubic spline to interpolate

Page 84: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #85

Carnegie Mellon

Overview

• Introduction / Related work

• Background

• Main ideas

• Theoretical analysis

• Experimental results

Page 85: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #86

Carnegie Mellon

Experimental results• Setup

– Intel Xeon 2.8GHz, 1GB memory, Linux– Datasets:

Sines, SpikeTrains, Humidity, Light, Temperature,

Kursk, Sunspots– Enhanced BRAID, b=16

• Evaluation– Estimation error of lag correlations– Computation time

Page 86: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #87

Carnegie Mellon

Detecting Lag Correlations (2)• SpikeTrains

CCF (Cross-Correlation Function)

BRAID closely estimates the correlation coefficients

Page 87: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #88

Carnegie Mellon

Detecting Lag Correlations (3)• Humidity

CCF (Cross-Correlation Function)

BRAID closely estimates the correlation coefficients

Page 88: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #89

Carnegie Mellon

Detecting Lag Correlations (4)• Light

CCF (Cross-Correlation Function)

BRAID closely estimates the correlation coefficients

Page 89: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #90

Carnegie Mellon

Detecting Lag Correlations (5)• Kursk

CCF (Cross-Correlation Function)

BRAID closely estimates the correlation coefficients

Page 90: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #91

Carnegie Mellon

Estimation Error

• Largest relative error is about 1%

1.03811681156Sunspots

0.61514721463Kursk

0.529570567Light

0.33838553842Humidity

0.38728302841SpikeTrains

0.000716716Sines

BRAIDNaive

Estimation

error (%)

Lag correlationDatasets

Page 91: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #92

Carnegie Mellon

Performance

• Almost linear w.r.t. sequence length

• Up to 40,000 times faster

Page 92: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #93

Carnegie Mellon

Group Lag Correlations• Two correlated pairs from 55 Temperature sequences• Each sensor is located in a different place

Estimation of CCF of #16 and #19 Estimation of CCF of #47 and #48

#16 #19 #47 #48

Page 93: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #94

Carnegie Mellon

Conclusions

Automatic lag correlation detection on stream data• incremental – online, ‘any-time’• nimble

– O(log n) space, O(1) time to update the statistics

– Up to 40,000 times faster than the naive implementation

• Accurate– Detecting the correct lag within 1% relative error or

less

Page 94: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #95

Carnegie Mellon

Overall Conclusions

• Mining streaming numerical data: challenging!

• Extensions: streaming matrix data (eg., network traffic matrix)

IP-source IP-d

estin

atio

n

tim

e

Page 95: Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

DB/IR '06 C. Faloutsos #96

Carnegie Mellon

Thank you

• christos <at> cs.cmu.edu

• www.cs.cmu.edu/~christos

• [InteMon demo]