sensor and graph mining

63
INTEL 04 C. Faloutsos 1 School of Computer Science Carnegie Mellon Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos

Upload: monita

Post on 07-Jan-2016

38 views

Category:

Documents


3 download

DESCRIPTION

Sensor and Graph Mining. Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos. Joint work with. Anthony Brockwell (CMU/Stat) Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Chenxi Wang (CMU) Yang Wang (CMU). Outline. Introduction - motivation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sensor and Graph Mining

INTEL 04 C. Faloutsos 1

School of Computer ScienceCarnegie Mellon

Sensor and Graph Mining

Christos Faloutsos

Carnegie Mellon University & IBMwww.cs.cmu.edu/~christos

Page 2: Sensor and Graph Mining

INTEL 04 C. Faloutsos 2

School of Computer ScienceCarnegie Mellon

Joint work with

• Anthony Brockwell (CMU/Stat)

• Deepayan Chakrabarti (CMU)

• Spiros Papadimitriou (CMU)

• Chenxi Wang (CMU)• Yang Wang (CMU)

Page 3: Sensor and Graph Mining

INTEL 04 C. Faloutsos 3

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation

• Problem #1: Stream Mining– Motivation– Main idea– Experimental results

• Problem #2: Graphs & Virus propagation

• Conclusions

Page 4: Sensor and Graph Mining

INTEL 04 C. Faloutsos 4

School of Computer ScienceCarnegie Mellon

Introduction• Sensor devices

– Temperature, weather measurements– Road traffic data– Geological observations– Patient physiological data

• Embedded devices– Network routers– Intelligent (active) disks

Page 5: Sensor and Graph Mining

INTEL 04 C. Faloutsos 5

School of Computer ScienceCarnegie Mellon

Introduction• Limited resources

– Memory– Bandwidth– Power– CPU

• Remote environments– No human intervention

Page 6: Sensor and Graph Mining

INTEL 04 C. Faloutsos 6

School of Computer ScienceCarnegie Mellon

Introduction – problem dfn• Given a emi-infinite stream of values (time

series) x1, x2, …, xt, …

• Find patterns, forecasts, outliers…

Page 7: Sensor and Graph Mining

INTEL 04 C. Faloutsos 7

School of Computer ScienceCarnegie Mellon

Introduction

Periodicity? (daily)

Periodicity? (twice daily)

“Noise”??

• E.g.,

Page 8: Sensor and Graph Mining

INTEL 04 C. Faloutsos 8

School of Computer ScienceCarnegie Mellon

Introduction

Periodicity? (daily)

Periodicity? (twice daily)

“Noise”??

• Can we capture these patterns– automatically– with limited resources?

Page 9: Sensor and Graph Mining

INTEL 04 C. Faloutsos 9

School of Computer ScienceCarnegie Mellon

Related workStatistics: Time series forecasting

• Main problem:

“[…] The first step in the analysis of any time series is to plot the data [and inspect the graph]” [Brockwell 91]

• Typically:• Resource intensive

• Cannot update online

• AR(I)MA and seasonal variants• ARFIMA, GARCH, …

Page 10: Sensor and Graph Mining

INTEL 04 C. Faloutsos 10

School of Computer ScienceCarnegie Mellon

Related workDatabases: Continuous Queries

• Typically, different focus:– “Compression”– Not generative models

• Largely orthogonal problem…– Gilbert, Guha, Indyk et al. (STOC 2002)– Garofalakis, Gibbons (SIGMOD 2002)– Chen, Dong, Han et al. (VLDB 2002); Bulut, Singh (ICDE 2003)– Gehrke, Korn, et al. (SIGMOD 2001), Dobra, Garofalakis, Gehrke

et al. (SIGMOD 2002)– Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et al. (SODA

2002)– Madden+ [SIGMOD02], [SIGMOD03]

Page 11: Sensor and Graph Mining

INTEL 04 C. Faloutsos 11

School of Computer ScienceCarnegie Mellon

Goals

• Adapt and handle arbitrary periodic components

• No human intervention/tuning

Also:

• Single pass over the data

• Limited memory (logarithmic)

• Constant-time update

Page 12: Sensor and Graph Mining

INTEL 04 C. Faloutsos 12

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation

• Problem #1: Stream Mining– Motivation– Main idea– Experimental results

• Problem #2: Graphs & Virus propagation

• Conclusions

Page 13: Sensor and Graph Mining

INTEL 04 C. Faloutsos 13

School of Computer ScienceCarnegie Mellon

Wavelets“Straight” signal

t

I1

t

I2

t

I3

t

I4

t

I5

t

I6

t

I7

t

I8

time

t

xt

Page 14: Sensor and Graph Mining

INTEL 04 C. Faloutsos 14

School of Computer ScienceCarnegie Mellon

WaveletsIntroduction – Haar

t

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency

t

xt

Page 15: Sensor and Graph Mining

INTEL 04 C. Faloutsos 15

School of Computer ScienceCarnegie Mellon

Wavelets

• So?

• Wavelets compress many real signals well…– Image compression and processing– Vision; Astronomy, seismology, …

• Wavelet coefficients can be updated as new points arrive [Kotidis+]

Page 16: Sensor and Graph Mining

INTEL 04 C. Faloutsos 16

School of Computer ScienceCarnegie Mellon

WaveletsCorrelations

t

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency

xt

t

=

Page 17: Sensor and Graph Mining

INTEL 04 C. Faloutsos 17

School of Computer ScienceCarnegie Mellon

WaveletsCorrelations

t

W1,1

t

W1,2

t

W1,3

t

W1,4

t

W2,1

t

W2,2

t

W3,1

t

V4,1

time

frequ

ency

xt

t

Page 18: Sensor and Graph Mining

INTEL 04 C. Faloutsos 18

School of Computer ScienceCarnegie Mellon

Main ideaCorrelations

• Wavelets are good…

• …we can do even better– One number…– …and the fact that they are

equal/correlated

Page 19: Sensor and Graph Mining

INTEL 04 C. Faloutsos 19

School of Computer ScienceCarnegie Mellon

Proposed method

Wl,tWl,t-1Wl,t-2Wl,t l,1Wl,t-1 l,2Wl,t-2 …

Wl’,t’-1Wl’,t’-2Wl’,t’

Wl’,t’ l’,1Wl’,t’-1 l’,2Wl’,t’-2 …

Small windows suffice… (k~4)

Page 20: Sensor and Graph Mining

INTEL 04 C. Faloutsos 20

School of Computer ScienceCarnegie Mellon

More details…

• Update of wavelet coefficients

• Update of linear models

• Feature selection– Not all correlations are significant– Throw away the insignificant ones– very important!!

[see paper]

(incremental)

(incremental; RLS)

(single-pass)

Page 21: Sensor and Graph Mining

INTEL 04 C. Faloutsos 21

School of Computer ScienceCarnegie Mellon

Complexity• Model update

Space: OlgN + mk2 OlgNTime: Ok2 O1

Where– N: number of points (so far)– k: number of regression coefficients; fixed– m: number of linear models; OlgN

[see paper]

SKIP

Page 22: Sensor and Graph Mining

INTEL 04 C. Faloutsos 22

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation

• Problem #1: Stream Mining– Motivation– Main idea– Experimental results

• Problem #2: Graphs & Virus propagation

• Conclusions

Page 23: Sensor and Graph Mining

INTEL 04 C. Faloutsos 23

School of Computer ScienceCarnegie Mellon

Setup

• First half used for model estimation

• Models applied forward to forecast entire second half

• AR, Seasonal AR (SAR): R– Simplest possible estimation – no maximum

likelihood estimation (MLE), etc.

• … vs. Python scripts

Page 24: Sensor and Graph Mining

INTEL 04 C. Faloutsos 24

School of Computer ScienceCarnegie Mellon

ResultsSynthetic data – Triangle pulse

• Triangle pulse• AR captures wrong trend (or none)• Seasonal AR (SAR) estimation fails

Page 25: Sensor and Graph Mining

INTEL 04 C. Faloutsos 25

School of Computer ScienceCarnegie Mellon

ResultsSynthetic data – Mix

• Mix (sine + square pulse)• AR captures wrong trend (or none)• Seasonal AR estimation fails

Page 26: Sensor and Graph Mining

INTEL 04 C. Faloutsos 26

School of Computer ScienceCarnegie Mellon

ResultsReal data – Automobile

• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales

(filtered)

Page 27: Sensor and Graph Mining

INTEL 04 C. Faloutsos 27

School of Computer ScienceCarnegie Mellon

ResultsReal data – Automobile

• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales

• AR fails to capture any trend (average)• Seasonal AR estimation fails

Page 28: Sensor and Graph Mining

INTEL 04 C. Faloutsos 28

School of Computer ScienceCarnegie Mellon

ResultsReal data – Automobile

• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales

• AWSOM spots periodicities, automatically

Page 29: Sensor and Graph Mining

INTEL 04 C. Faloutsos 29

School of Computer ScienceCarnegie Mellon

ResultsReal data – Automobile

• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales

• Generation with identified noise

Page 30: Sensor and Graph Mining

INTEL 04 C. Faloutsos 30

School of Computer ScienceCarnegie Mellon

ResultsReal data – Sunspot

• Sunspot intensity – Slightly time-varying “period”• AR captures wrong trend (average)• Seasonal ARIMA

– Captures immediate wrong downward trend– Requires human to determine seasonal component period (fixed)

Page 31: Sensor and Graph Mining

INTEL 04 C. Faloutsos 31

School of Computer ScienceCarnegie Mellon

ResultsReal data – Sunspot

• Sunspot intensity – Slightly time-varying “period”

Estimation: 40 minutes (R) vs. 9 seconds (Python)

Page 32: Sensor and Graph Mining

INTEL 04 C. Faloutsos 32

School of Computer ScienceCarnegie Mellon

Variance

• Variance (log-power) vs. scale:– “Noise” diagnostic (if decreasing linear…)

– Can use to estimate noise parameters

~ 1 hour

SKIP

~Hurst exponent

Page 33: Sensor and Graph Mining

INTEL 04 C. Faloutsos 33

School of Computer ScienceCarnegie Mellon

Running time

stream size (N)

tim

e (

t)

Page 34: Sensor and Graph Mining

INTEL 04 C. Faloutsos 34

School of Computer ScienceCarnegie Mellon

Space requirements

Equal total number of model parameters

Page 35: Sensor and Graph Mining

INTEL 04 C. Faloutsos 35

School of Computer ScienceCarnegie Mellon

Conclusion

Adapt and handle arbitrary periodic components

No human intervention/tuning

Single pass over the dataLimited memory (logarithmic)Constant-time update

Page 36: Sensor and Graph Mining

INTEL 04 C. Faloutsos 36

School of Computer ScienceCarnegie Mellon

Conclusion

Adapt and handle arbitrary periodic components

No human intervention/tuning

Single pass over the dataLimited memory (logarithmic)Constant-time update

no human

limitedresources

Page 37: Sensor and Graph Mining

INTEL 04 C. Faloutsos 37

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation• Problem #1: Streams• Problem #2: Graphs & Virus propagation

– Motivation & problem definition– Related work– Main idea– Experiments

• Conclusions

Page 38: Sensor and Graph Mining

INTEL 04 C. Faloutsos 38

School of Computer ScienceCarnegie Mellon

Introduction

Internet Map [lumeta.com]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Friendship Network [Moody ’01]

► Graphs are ubiquitious

Page 39: Sensor and Graph Mining

INTEL 04 C. Faloutsos 39

School of Computer ScienceCarnegie Mellon

Introduction

• What can we do with graph analysis?– Immunization;– Information

Dissemination– network value of a

customer [Domingos+] “Needle exchange” networks of drug users

[Weeks et al. 2002]

“bridges”

Page 40: Sensor and Graph Mining

INTEL 04 C. Faloutsos 40

School of Computer ScienceCarnegie Mellon

Problem definition

• Q1: How does a virus spread across an arbitrary network?

• Q2: will it create an epidemic?

• (in a sensor setting, with a ‘gossip’ protocol, will a rumor/query spread?)

Page 41: Sensor and Graph Mining

INTEL 04 C. Faloutsos 41

School of Computer ScienceCarnegie Mellon

Framework

• Susceptible-Infected-Susceptible (SIS) model – Cured nodes immediately become susceptible

Susceptible/

healthy

Infected &

infectious

Infected by neighbor

Cured internally

Page 42: Sensor and Graph Mining

INTEL 04 C. Faloutsos 43

School of Computer ScienceCarnegie Mellon

The model

• (virus) Birth rate β : probability than an infected neighbor attacks

• (virus) Death rate δ : probability that an infected node heals

Infected

Healthy

NN1

N3

N2Prob. β

Prob. β

Prob. δ

Page 43: Sensor and Graph Mining

INTEL 04 C. Faloutsos 44

School of Computer ScienceCarnegie Mellon

Epidemic threshold

Defined as the value of , such that

if / < an epidemic can not happen

Thus,

• given a graph

• compute its epidemic threshold

Page 44: Sensor and Graph Mining

INTEL 04 C. Faloutsos 45

School of Computer ScienceCarnegie Mellon

Epidemic threshold

What should depend on?

• avg. degree? and/or highest degree?

• and/or variance of degree?

• and/or determinant of the adjacency matrix?

Page 45: Sensor and Graph Mining

INTEL 04 C. Faloutsos 46

School of Computer ScienceCarnegie Mellon

Basic Homogeneous Model

Homogeneous graphs [Kephart-White ’91, ’93]

• Epidemic threshold = 1/<k>• Homogeneous connectivity <k>, ie, all

nodes have ~same degree unrealistic

Page 46: Sensor and Graph Mining

INTEL 04 C. Faloutsos 47

School of Computer ScienceCarnegie Mellon

Power-law Networks

• Model for Barabási-Albert networks– [Pastor-Satorras &

Vespignani, ’01, ’02]

– Epidemic threshold = <k> / <k2>

– for BA type networks, with only γ = 3 (γ = slope of power-law exponent)

Page 47: Sensor and Graph Mining

INTEL 04 C. Faloutsos 48

School of Computer ScienceCarnegie Mellon

Epidemic threshold

• Homogeneous graphs: 1/<k>• BA (=3) <k> / <k2>

• more complicated graphs ?

• arbitrary, REAL graphs ?

• how many parameters??

Page 48: Sensor and Graph Mining

INTEL 04 C. Faloutsos 49

School of Computer ScienceCarnegie Mellon

Epidemic threshold

• [Theorem] We have no epidemic, if

β/δ <τ = 1/ λ1,A

Page 49: Sensor and Graph Mining

INTEL 04 C. Faloutsos 50

School of Computer ScienceCarnegie Mellon

Epidemic threshold

• [Theorem] We have no epidemic, if

β/δ <τ = 1/ λ1,A

largest eigenvalueof adj. matrix A

attack prob.

recovery prob.epidemic threshold

Proof: [Wang+03]

Page 50: Sensor and Graph Mining

INTEL 04 C. Faloutsos 51

School of Computer ScienceCarnegie Mellon

Epidemic threshold for various networks

• sanity checks / older results:

• Homogeneous networks– λ1,A = <k>; τ = 1/<k>

– where <k> = average degree– This is the same result as of Kephart & White !

Page 51: Sensor and Graph Mining

INTEL 04 C. Faloutsos 52

School of Computer ScienceCarnegie Mellon

Epidemic threshold for various networks

• sanity checks / older results:

• Star networks– λ1,A = sqrt(d); τ = 1/ sqrt(d)

– where d = the degree of the central node

Page 52: Sensor and Graph Mining

INTEL 04 C. Faloutsos 53

School of Computer ScienceCarnegie Mellon

Epidemic threshold for various networks

• sanity checks / older results:

• Infinite, power-law networks– λ1,A = ∞; τ = 0 : *any* virus has a chance!

[Barabasi et al]

• Finite power-law networks– τ = 1/ λ1,A

Page 53: Sensor and Graph Mining

INTEL 04 C. Faloutsos 54

School of Computer ScienceCarnegie Mellon

Outline

• Introduction - motivation• Problem #1: Streams• Problem #2: Graphs & Virus propagation

– Motivation & problem definition– Related work– Main idea– Experiments

• Conclusions

Page 54: Sensor and Graph Mining

INTEL 04 C. Faloutsos 55

School of Computer ScienceCarnegie Mellon

Experiments

• 2 graphs– Star network: one “hub” + 99 “spokes”– “Oregon” Internet AS graph:

• 10,900 nodes, 31180 edges

• topology.eecs.umich.edu/data.html

• More in our paper: [SRDS ’03]

Page 55: Sensor and Graph Mining

INTEL 04 C. Faloutsos 56

School of Computer ScienceCarnegie Mellon

β/δ > τ (above threshold)

β/δ = τ (at the threshold)

β/δ < τ (below threshold)

Experiments (Star)

Page 56: Sensor and Graph Mining

INTEL 04 C. Faloutsos 57

School of Computer ScienceCarnegie Mellon

Experiments (Oregon)

β/δ > τ (above threshold)

β/δ = τ (at the threshold)

β/δ < τ (below threshold)

Page 57: Sensor and Graph Mining

INTEL 04 C. Faloutsos 58

School of Computer ScienceCarnegie Mellon

Our prediction vs. previous prediction

• our predictions are more accurate

Oregon Star

PL3PL3

OurOur

Nu

mb

er o

f in

fect

ed n

odes

β/δ β/δ

Page 58: Sensor and Graph Mining

INTEL 04 C. Faloutsos 59

School of Computer ScienceCarnegie Mellon

Conclusions

We found an epidemic threshold

√ that applies to any network topology

√ and it depends only on one parameter of the graph

Page 59: Sensor and Graph Mining

INTEL 04 C. Faloutsos 60

School of Computer ScienceCarnegie Mellon

Overall conclusions

• Automatic stream mining: AWSOM

• graphs and virus propagation: eigenvalue

Page 60: Sensor and Graph Mining

INTEL 04 C. Faloutsos 61

School of Computer ScienceCarnegie Mellon

Ongoing / related work

• Streams– how to find hidden variables on multiple

streams [w/ Spiros and Jimeng Sun]– ‘network tomography’ [w/ Airoldi +]

• Graphs– graph partitioning [w/ Deepay+]– important subgraphs [w/ Tomkins + McCurley]– graph generators [RMAT, w/ Deepay]

Page 61: Sensor and Graph Mining

INTEL 04 C. Faloutsos 62

School of Computer ScienceCarnegie Mellon

Thank you!

Contact info:christos @ cs.cmu.edu

spapadim @ cs.cmu.edu

deepay @ cs.cmu.edu

Page 62: Sensor and Graph Mining

INTEL 04 C. Faloutsos 63

School of Computer ScienceCarnegie Mellon

Main References

• Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003.

• [Wang+03] Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos: Epidemic Spreading in Real Networks: an Eigenvalue Viewpoint, SRDS 2003, Florence, Italy.

Page 63: Sensor and Graph Mining

INTEL 04 C. Faloutsos 64

School of Computer ScienceCarnegie Mellon

Additional References

• Connection Subgraphs, C. Faloutsos, K. McCurley, A. Tomkins, SIAM-DM 2004 workshop on link analysis

• RMAT: A recursive graph generator, D. Chakrabarti, Y. Zhan, C. Faloutsos, SIAM-DM 2004

• iFilter: Network tomography using particle filters, Edoardo Airoldi, Christos Faloutsos (submitted)