sensor and graph mining
DESCRIPTION
Sensor and Graph Mining. Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos. Joint work with. Anthony Brockwell (CMU/Stat) Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Chenxi Wang (CMU) Yang Wang (CMU). Outline. Introduction - motivation - PowerPoint PPT PresentationTRANSCRIPT
School of Computer ScienceCarnegie Mellon
Sensor and Graph Mining
Christos Faloutsos
Carnegie Mellon University & IBMwww.cs.cmu.edu/~christos
USC 04 C. Faloutsos 2
School of Computer ScienceCarnegie Mellon
Joint work with
• Anthony Brockwell (CMU/Stat)• Deepayan Chakrabarti (CMU) • Spiros Papadimitriou (CMU)• Chenxi Wang (CMU)• Yang Wang (CMU)
USC 04 C. Faloutsos 3
School of Computer ScienceCarnegie Mellon
Outline
• Introduction - motivation
• Problem #1: Stream Mining– Motivation– Main idea– Experimental results
• Problem #2: Graphs & Virus propagation
• Conclusions
USC 04 C. Faloutsos 4
School of Computer ScienceCarnegie Mellon
Introduction• Sensor devices
– Temperature, weather measurements– Road traffic data– Geological observations– Patient physiological data
• Embedded devices– Network routers– Intelligent (active) disks
USC 04 C. Faloutsos 5
School of Computer ScienceCarnegie Mellon
Introduction• Limited resources
– Memory– Bandwidth– Power– CPU
• Remote environments– No human intervention
USC 04 C. Faloutsos 6
School of Computer ScienceCarnegie Mellon
Introduction – problem dfn• Given a emi-infinite stream of values (time
series) x1, x2, …, xt, …
• Find patterns, forecasts, outliers…
USC 04 C. Faloutsos 7
School of Computer ScienceCarnegie Mellon
Introduction
Periodicity? (daily)
Periodicity? (twice daily)
“Noise”??
• E.g.,
USC 04 C. Faloutsos 8
School of Computer ScienceCarnegie Mellon
Introduction
Periodicity? (daily)
Periodicity? (twice daily)
“Noise”??
• Can we capture these patterns– automatically– with limited resources?
USC 04 C. Faloutsos 9
School of Computer ScienceCarnegie Mellon
Related workStatistics: Time series forecasting
• Main problem:
“[…] The first step in the analysis of any time series is to plot the data [and inspect the graph]” [Brockwell 91]
• Typically:• Resource intensive
• Cannot update online
• AR(I)MA and seasonal variants• ARFIMA, GARCH, …
USC 04 C. Faloutsos 10
School of Computer ScienceCarnegie Mellon
Related workDatabases: Continuous Queries
• Typically, different focus:– “Compression”– Not generative models
• Largely orthogonal problem…– Gilbert, Guha, Indyk et al. (STOC 2002)– Garofalakis, Gibbons (SIGMOD 2002)– Chen, Dong, Han et al. (VLDB 2002); Bulut, Singh (ICDE 2003)– Gehrke, Korn, et al. (SIGMOD 2001), Dobra, Garofalakis, Gehrke
et al. (SIGMOD 2002)– Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et al. (SODA
2002)– Madden+ [SIGMOD02], [SIGMOD03]
USC 04 C. Faloutsos 11
School of Computer ScienceCarnegie Mellon
Goals
• Adapt and handle arbitrary periodic components
• No human intervention/tuning
Also:
• Single pass over the data
• Limited memory (logarithmic)
• Constant-time update
USC 04 C. Faloutsos 12
School of Computer ScienceCarnegie Mellon
Outline
• Introduction - motivation
• Problem #1: Stream Mining– Motivation– Main idea– Experimental results
• Problem #2: Graphs & Virus propagation
• Conclusions
USC 04 C. Faloutsos 13
School of Computer ScienceCarnegie Mellon
Wavelets“Straight” signal
t
I1
t
I2
t
I3
t
I4
t
I5
t
I6
t
I7
t
I8
time
t
xt
USC 04 C. Faloutsos 14
School of Computer ScienceCarnegie Mellon
WaveletsIntroduction – Haar
t
W1,1
t
W1,2
t
W1,3
t
W1,4
t
W2,1
t
W2,2
t
W3,1
t
V4,1
time
frequ
ency
t
xt
USC 04 C. Faloutsos 15
School of Computer ScienceCarnegie Mellon
Wavelets
• So?
• Wavelets compress many real signals well…– Image compression and processing– Vision; Astronomy, seismology, …
• Wavelet coefficients can be updated as new points arrive [Kotidis+]
USC 04 C. Faloutsos 16
School of Computer ScienceCarnegie Mellon
WaveletsCorrelations
t
W1,1
t
W1,2
t
W1,3
t
W1,4
t
W2,1
t
W2,2
t
W3,1
t
V4,1
time
frequ
ency
xt
t
=
USC 04 C. Faloutsos 17
School of Computer ScienceCarnegie Mellon
WaveletsCorrelations
t
W1,1
t
W1,2
t
W1,3
t
W1,4
t
W2,1
t
W2,2
t
W3,1
t
V4,1
time
frequ
ency
xt
t
USC 04 C. Faloutsos 18
School of Computer ScienceCarnegie Mellon
Main ideaCorrelations
• Wavelets are good…
• …we can do even better– One number…– …and the fact that they are
equal/correlated
USC 04 C. Faloutsos 19
School of Computer ScienceCarnegie Mellon
Proposed method
Wl,tWl,t-1Wl,t-2Wl,t l,1Wl,t-1 l,2Wl,t-2 …
Wl’,t’-1Wl’,t’-2Wl’,t’
Wl’,t’ l’,1Wl’,t’-1 l’,2Wl’,t’-2 …
Small windows suffice… (k~4)
USC 04 C. Faloutsos 20
School of Computer ScienceCarnegie Mellon
More details…
• Update of wavelet coefficients
• Update of linear models
• Feature selection– Not all correlations are significant– Throw away the insignificant ones– very important!!
[see paper]
(incremental)
(incremental; RLS)
(single-pass)
USC 04 C. Faloutsos 21
School of Computer ScienceCarnegie Mellon
Complexity• Model update
Space: OlgN + mk2 OlgNTime: Ok2 O1
Where– N: number of points (so far)– k: number of regression coefficients; fixed– m: number of linear models; OlgN
[see paper]
SKIP
USC 04 C. Faloutsos 22
School of Computer ScienceCarnegie Mellon
Outline
• Introduction - motivation
• Problem #1: Stream Mining– Motivation– Main idea– Experimental results
• Problem #2: Graphs & Virus propagation
• Conclusions
USC 04 C. Faloutsos 23
School of Computer ScienceCarnegie Mellon
Setup
• First half used for model estimation
• Models applied forward to forecast entire second half
• AR, Seasonal AR (SAR): R– Simplest possible estimation – no maximum
likelihood estimation (MLE), etc.
• … vs. Python scripts
USC 04 C. Faloutsos 24
School of Computer ScienceCarnegie Mellon
ResultsSynthetic data – Triangle pulse
• Triangle pulse• AR captures wrong trend (or none)• Seasonal AR (SAR) estimation fails
USC 04 C. Faloutsos 25
School of Computer ScienceCarnegie Mellon
ResultsSynthetic data – Mix
• Mix (sine + square pulse)• AR captures wrong trend (or none)• Seasonal AR estimation fails
USC 04 C. Faloutsos 26
School of Computer ScienceCarnegie Mellon
ResultsReal data – Automobile
• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales
(filtered)
USC 04 C. Faloutsos 27
School of Computer ScienceCarnegie Mellon
ResultsReal data – Automobile
• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales
• AR fails to capture any trend (average)• Seasonal AR estimation fails
USC 04 C. Faloutsos 28
School of Computer ScienceCarnegie Mellon
ResultsReal data – Automobile
• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales
• AWSOM spots periodicities, automatically
USC 04 C. Faloutsos 29
School of Computer ScienceCarnegie Mellon
ResultsReal data – Automobile
• Automobile traffic– Daily periodicity with rush-hour peaks– Bursty “noise” at smaller time scales
• Generation with identified noise
USC 04 C. Faloutsos 30
School of Computer ScienceCarnegie Mellon
ResultsReal data – Sunspot
• Sunspot intensity – Slightly time-varying “period”• AR captures wrong trend (average)• Seasonal ARIMA
– Captures immediate wrong downward trend– Requires human to determine seasonal component period (fixed)
USC 04 C. Faloutsos 31
School of Computer ScienceCarnegie Mellon
ResultsReal data – Sunspot
• Sunspot intensity – Slightly time-varying “period”
Estimation: 40 minutes (R) vs. 9 seconds (Python)
USC 04 C. Faloutsos 32
School of Computer ScienceCarnegie Mellon
Variance
• Variance (log-power) vs. scale:– “Noise” diagnostic (if decreasing linear…)
– Can use to estimate noise parameters
~ 1 hour
SKIP
~Hurst exponent
USC 04 C. Faloutsos 33
School of Computer ScienceCarnegie Mellon
Running time
stream size (N)
tim
e (
t)
USC 04 C. Faloutsos 34
School of Computer ScienceCarnegie Mellon
Space requirements
Equal total number of model parameters
USC 04 C. Faloutsos 35
School of Computer ScienceCarnegie Mellon
Conclusion
Adapt and handle arbitrary periodic components
No human intervention/tuning
Single pass over the dataLimited memory (logarithmic)Constant-time update
USC 04 C. Faloutsos 36
School of Computer ScienceCarnegie Mellon
Conclusion
Adapt and handle arbitrary periodic components
No human intervention/tuning
Single pass over the dataLimited memory (logarithmic)Constant-time update
no human
limitedresources
USC 04 C. Faloutsos 37
School of Computer ScienceCarnegie Mellon
Outline
• Introduction - motivation• Problem #1: Streams• Problem #2: Graphs & Virus propagation
– Motivation & problem definition– Related work– Main idea– Experiments
• Conclusions
USC 04 C. Faloutsos 38
School of Computer ScienceCarnegie Mellon
Introduction
Internet Map [lumeta.com]
Food Web [Martinez ’91]
Protein Interactions [genomebiology.com]
Friendship Network [Moody ’01]
► Graphs are ubiquitious
USC 04 C. Faloutsos 39
School of Computer ScienceCarnegie Mellon
Introduction
• What can we do with graph analysis?– Immunization;– Information
Dissemination– network value of a
customer [Domingos+] “Needle exchange” networks of drug users
[Weeks et al. 2002]
“bridges”
USC 04 C. Faloutsos 40
School of Computer ScienceCarnegie Mellon
Problem definition
• Q1: How does a virus spread across an arbitrary network?
• Q2: will it create an epidemic?
• (in a sensor setting, with a ‘gossip’ protocol, will a rumor/query spread?)
USC 04 C. Faloutsos 41
School of Computer ScienceCarnegie Mellon
Framework
• Susceptible-Infected-Susceptible (SIS) model – Cured nodes immediately become susceptible
Susceptible/
healthy
Infected &
infectious
Infected by neighbor
Cured internally
USC 04 C. Faloutsos 43
School of Computer ScienceCarnegie Mellon
The model
• (virus) Birth rate β : probability than an infected neighbor attacks
• (virus) Death rate δ : probability that an infected node heals
Infected
Healthy
NN1
N3
N2Prob. β
Prob. β
Prob. δ
USC 04 C. Faloutsos 44
School of Computer ScienceCarnegie Mellon
Epidemic threshold
Defined as the value of , such that
if / < an epidemic can not happen
Thus,
• given a graph
• compute its epidemic threshold
USC 04 C. Faloutsos 45
School of Computer ScienceCarnegie Mellon
Epidemic threshold
What should depend on?
• avg. degree? and/or highest degree?
• and/or variance of degree?
• and/or determinant of the adjacency matrix?
USC 04 C. Faloutsos 46
School of Computer ScienceCarnegie Mellon
Basic Homogeneous Model
Homogeneous graphs [Kephart-White ’91, ’93]
• Epidemic threshold = 1/<k>• Homogeneous connectivity <k>, ie, all
nodes have ~same degree unrealistic
USC 04 C. Faloutsos 47
School of Computer ScienceCarnegie Mellon
Power-law Networks
• Model for Barabási-Albert networks– [Pastor-Satorras &
Vespignani, ’01, ’02]
– Epidemic threshold = <k> / <k2>
– for BA type networks, with only γ = 3 (γ = slope of power-law exponent)
USC 04 C. Faloutsos 48
School of Computer ScienceCarnegie Mellon
Epidemic threshold
• Homogeneous graphs: 1/<k>• BA (=3) <k> / <k2>
• more complicated graphs ?
• arbitrary, REAL graphs ?
• how many parameters??
USC 04 C. Faloutsos 49
School of Computer ScienceCarnegie Mellon
Epidemic threshold
• [Theorem] We have no epidemic, if
β/δ <τ = 1/ λ1,A
USC 04 C. Faloutsos 50
School of Computer ScienceCarnegie Mellon
Epidemic threshold
• [Theorem] We have no epidemic, if
β/δ <τ = 1/ λ1,A
largest eigenvalueof adj. matrix A
attack prob.
recovery prob.epidemic threshold
Proof: [Wang+03]
USC 04 C. Faloutsos 51
School of Computer ScienceCarnegie Mellon
Epidemic threshold for various networks
• sanity checks / older results:
• Homogeneous networks– λ1,A = <k>; τ = 1/<k>
– where <k> = average degree– This is the same result as of Kephart & White !
USC 04 C. Faloutsos 52
School of Computer ScienceCarnegie Mellon
Epidemic threshold for various networks
• sanity checks / older results:
• Star networks– λ1,A = sqrt(d); τ = 1/ sqrt(d)
– where d = the degree of the central node
USC 04 C. Faloutsos 53
School of Computer ScienceCarnegie Mellon
Epidemic threshold for various networks
• sanity checks / older results:
• Infinite, power-law networks– λ1,A = ∞; τ = 0 : *any* virus has a chance!
[Barabasi et al]
• Finite power-law networks– τ = 1/ λ1,A
USC 04 C. Faloutsos 54
School of Computer ScienceCarnegie Mellon
Outline
• Introduction - motivation• Problem #1: Streams• Problem #2: Graphs & Virus propagation
– Motivation & problem definition– Related work– Main idea– Experiments
• Conclusions
USC 04 C. Faloutsos 55
School of Computer ScienceCarnegie Mellon
Experiments
• 2 graphs– Star network: one “hub” + 99 “spokes”– “Oregon” Internet AS graph:
• 10,900 nodes, 31180 edges
• topology.eecs.umich.edu/data.html
• More in our paper: [SRDS ’03]
USC 04 C. Faloutsos 56
School of Computer ScienceCarnegie Mellon
β/δ > τ (above threshold)
β/δ = τ (at the threshold)
β/δ < τ (below threshold)
Experiments (Star)
USC 04 C. Faloutsos 57
School of Computer ScienceCarnegie Mellon
Experiments (Oregon)
β/δ > τ (above threshold)
β/δ = τ (at the threshold)
β/δ < τ (below threshold)
USC 04 C. Faloutsos 58
School of Computer ScienceCarnegie Mellon
Our prediction vs. previous prediction
• our predictions are more accurate
Oregon Star
PL3PL3
OurOur
Nu
mb
er o
f in
fect
ed n
odes
β/δ β/δ
USC 04 C. Faloutsos 59
School of Computer ScienceCarnegie Mellon
Conclusions
We found an epidemic threshold
√ that applies to any network topology
√ and it depends only on one parameter of the graph
USC 04 C. Faloutsos 60
School of Computer ScienceCarnegie Mellon
Overall conclusions
• Automatic stream mining: AWSOM
• graphs and virus propagation: eigenvalue
USC 04 C. Faloutsos 61
School of Computer ScienceCarnegie Mellon
Ongoing / related work
• Streams– how to find hidden variables on multiple
streams [w/ Spiros and Jimeng Sun]– ‘network tomography’ [w/ Airoldi +]
• Graphs– graph partitioning [w/ Deepay+]– important subgraphs [w/ Tomkins + McCurley]– graph generators [RMAT, w/ Deepay]
USC 04 C. Faloutsos 62
School of Computer ScienceCarnegie Mellon
Thank you!
Contact info:christos @ cs.cmu.edu
spapadim @ cs.cmu.edu
deepay @ cs.cmu.edu
USC 04 C. Faloutsos 63
School of Computer ScienceCarnegie Mellon
Main References
• Spiros Papadimitriou, Anthony Brockwell and Christos Faloutsos Adaptive, Hands-Off Stream Mining VLDB 2003, Berlin, Germany, Sept. 2003.
• [Wang+03] Yang Wang, Deepayan Chakrabarti, Chenxi Wang and Christos Faloutsos: Epidemic Spreading in Real Networks: an Eigenvalue Viewpoint, SRDS 2003, Florence, Italy.
USC 04 C. Faloutsos 64
School of Computer ScienceCarnegie Mellon
Additional References
• Connection Subgraphs, C. Faloutsos, K. McCurley, A. Tomkins, SIAM-DM 2004 workshop on link analysis
• RMAT: A recursive graph generator, D. Chakrabarti, Y. Zhan, C. Faloutsos, SIAM-DM 2004
• iFilter: Network tomography using particle filters, Edoardo Airoldi, Christos Faloutsos (submitted)