data mining meets systems: tools and case studies

52
CMU SCS Data Mining Meets Systems: Tools and Case Studies Christos Faloutsos SCS CMU

Upload: ingo

Post on 15-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Data Mining Meets Systems: Tools and Case Studies. Christos Faloutsos SCS CMU. Spiros Papadimitriou (CMU->IBM). Mengzhi Wang (CMU->Google). Thanks. Jimeng Sun (CMU -> IBM). Outline. Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Mining Meets Systems: Tools and Case Studies

CMU SCS

Data Mining Meets Systems:Tools and Case Studies

Christos Faloutsos

SCS CMU

Page 2: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #2

CMU SCS

Thanks

Spiros Papadimitriou (CMU->IBM)

Mengzhi Wang (CMU->Google)

Jimeng Sun (CMU -> IBM)

Page 3: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #3

CMU SCS

Outline

• Problem 1: workload characterization

• Problem 2: self-* monitoring

• Problem 3: BGP mining

• (Problem 4: sensor mining)

• (Problem 5: Large graphs & hadoop)

fractals

SVDwavelets

tensors

PageRank

Page 4: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #4

CMU SCS

Problem #1:

Goal: given a signal (eg., #bytes over time)

Find: patterns, periodicities, and/or compress

time

#bytes Bytes per 30’(packets per day;earthquakes per year)

Page 5: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #5

CMU SCS

Problem #1

• model bursty traffic

• generate realistic traces

• (Poisson does not work)

time

# bytes

Poisson

Page 6: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #6

CMU SCS

Motivation

• predict queue length distributions (e.g., to give probabilistic guarantees)

• “learn” traffic, for buffering, prefetching, ‘active disks’, web servers

Page 7: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #7

CMU SCS

Q: any ‘pattern’?

time

# bytes• Not Poisson• spike; silence; more

spikes; more silence…• any rules?

Page 8: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #8

CMU SCS

solution: self-similarity

# bytes

time time

# bytes

Page 9: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #9

CMU SCS

But:

• Q1: How to generate realistic traces; extrapolate?

• Q2: How to estimate the model parameters?

Page 10: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #10

CMU SCS

Approach

• Q1: How to generate a sequence, that is– bursty– self-similar– and has similar queue length distributions

Page 11: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #11

CMU SCS

Approach

• A: ‘binomial multifractal’ [Wang+02]

• ~ 80-20 ‘law’:– 80% of bytes/queries etc on first half– repeat recursively

• b: bias factor (eg., 80%)

Page 12: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #12

CMU SCS

binary multifractals20 80

Page 13: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #13

CMU SCS

binary multifractals20 80

Page 14: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #14

CMU SCS

Parameter estimation

• Q2: How to estimate the bias factor b?

Page 15: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #15

CMU SCS

Parameter estimation

• Q2: How to estimate the bias factor b?

• A: MANY ways [Crovella+96]– Hurst exponent– variance plot– even DFT amplitude spectrum! (‘periodogram’)– More robust: ‘entropy plot’ [Wang+02]

Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang, Spiros Papadimitriou and Christos Faloutsos, Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty Traffic, ICDE 2002

Page 16: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #16

CMU SCS

Entropy plot

• Rationale:– burstiness: inverse of uniformity– entropy measures uniformity of a distribution– find entropy at several granularities, to see

whether/how our distribution is close to uniform.

Page 17: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #17

CMU SCS

Entropy plot

• Entropy E(n) after n levels of splits

• n=1: E(1)= - p1 log2(p1)- p2 log2(p2)

p1 p2% of bytes

here

Page 18: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #18

CMU SCS

Entropy plot

• Entropy E(n) after n levels of splits

• n=1: E(1)= - p1 log(p1)- p2 log(p2)

• n=2: E(2) = - p2,i * log2 (p2,i)

p2,1 p2,2 p2,3 p2,4

Page 19: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #19

CMU SCS

Real traffic

• Has linear entropy plot (-> self-similar)

# of levels (n)

EntropyE(n)

0.73

Page 20: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #20

CMU SCS

Observation - intuition:

intuition: slope =

intrinsic dimensionality =~

‘degrees of freedom’ or

info-bits per coordinate-bit– unif. Dataset: slope =1

– multi-point: slope = 0

# of levels (n)

EntropyE(n)

0.73

Page 21: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #35

CMU SCS

Some more entropy plots:

• Poisson vs real

Poisson: slope = ~1 -> uniformly distributed

1 0.73

Page 22: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #36

CMU SCS

B-model

• b-model traffic gives perfectly linear plot

• Lemma: its slope isslope = -b log2b - (1-b) log2 (1-b)

• Fitting: do entropy plot; get slope; solve for b

E(n)

n

Page 23: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #37

CMU SCS

Experimental setup

• Disk traces (from HP [Wilkes 93])

• web traces from LBLhttp://repository.cs.vt.edu/lbl-conn-7.tar.Z

Page 24: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #38

CMU SCS

Model validation

• Linear entropy plots

Bias factors b: 0.6-0.8 smallest b / smoothest: nntp traffic

Page 25: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #39

CMU SCS

Web traffic - results

• LBL, NCDF of queue lengths (log-log scales)

(queue length l)

Prob( >l)

Page 26: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #40

CMU SCS

Conclusions

• Multifractals (80/20, ‘b-model’, Multiplicative Wavelet Model (MWM)) for analysis and synthesis of bursty traffic

Page 27: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #41

CMU SCS

Books

• Fractals: Manfred Schroeder: Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 (Probably the BEST book on fractals!)

Page 28: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #42

CMU SCS

Outline

• Problem 1: workload characterization

• Problem 2: self-* monitoring

• Problem 3: BGP mining

• (Problem 4: sensor mining)

• (Problem 5: Large graphs & hadoop)

Page 29: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #43

CMU SCS

Clusters/data center monitoring

• Monitor correlations of multiple measurements• Automatically flag anomalous behavior• Intemon: intelligent monitoring system

– warsteiner.db.cs.cmu.edu/demo/intemon.jsp

Page 30: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #44

CMU SCS

Publication

Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, Christos Faloutsos. InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. ACM SIGOPS Operating Systems Review, 40(3):38-44. ACM Press, July 2006

Page 31: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #45

CMU SCS

Under the hood: SVD

• Singular Value Decomposition

• Done incrementally

Spiros Papadimitriou, Jimeng Sun and Christos Faloutsos Streaming Pattern Discovery in Multiple Time-Series VLDB 2005, Trondheim, Norway.

Page 32: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #46

CMU SCS

Singular Value Decomposition (SVD)

• SVD (~LSI ~ KL ~ PCA ~ spectral analysis...)

LSI: S. Dumais; M. Berry

KL: eg, Duda+Hart

PCA: eg., Jolliffe

Details: [Press+]

u of CPU1

u ofCPU2

t=1t=2

Page 33: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #47

CMU SCS

Singular Value Decomposition (SVD)

• SVD (~LSI ~ KL ~ PCA ~ spectral analysis...)

u of CPU1

u ofCPU2

t=1t=2

Page 34: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #48

CMU SCS

Singular Value Decomposition (SVD)

• SVD (~LSI ~ KL ~ PCA ~ spectral analysis...)

u of CPU1

u ofCPU2

t=1t=2

Page 35: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #49

CMU SCS

Singular Value Decomposition (SVD)

• SVD (~LSI ~ KL ~ PCA ~ spectral analysis...)

u of CPU1

u ofCPU2

t=1t=2

Page 36: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #50

CMU SCS

Outline

• Problem 1: workload characterization

• Problem 2: self-* monitoring

• Problem 3: BGP mining

• (Problem 4: sensor mining)

• (Problem 5: Large graphs & hadoop)

Page 37: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #51

CMU SCS

BGP updates

With • Aditya Prakash (CMU)

• Michalis Faloutsos (UC Riverside)

• Nicholas Valler (UC Riverside)

• Dave Andersen (CMU)

Page 38: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #52

CMU SCS

Time Series: #Updates per 600s, Washington Router 09/2004-09/2006

Tool #0: Time plot

Page 39: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #53

CMU SCS

Tool #0: Time plot

• Observation #1: Missing values• Observation #2: Bursty

Page 40: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #54

CMU SCS

Tool #1: Wavelets

Page 41: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #55

CMU SCS

Wavelets - DWT

• Short window Fourier transform (SWFT)

• But: how short should be the window?

time

freq

time

value

Page 42: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #56

CMU SCS

Wavelets - DWT

• Answer: multiple window sizes! -> DWT

time

freq

Timedomain DFT SWFT DWT

Page 43: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #57

CMU SCS

Haar Wavelets

• subtract sum of left half from right half

• repeat recursively for quarters, eight-ths, ...

Page 44: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #58

CMU SCS

‘Tornado Plot’ for Washington Router: Dark areas correspond to high energy

Low freq.

High freq.

time

Page 45: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #59

CMU SCS

Tornado Plot: Wavelet Transformfor Washington Router 09/2004-09/2006, All coefficients andDetail levels 1-12

Observations:

1.Obvious Spikes (E1): tornados that “touch down”

2. Prolonged Spikes (E2 and E3): when coarser scales have high values but finer scales do not

3.Intermittent Waves (E4 and E5): High-energy entries at nearby scales correspond to local periodic motion

Page 46: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #60

CMU SCS

E2: Prolonged Spike Sustained Period of relatively high Activity

Magnification of updates on 28th Aug. 2005

time

# updates

Page 47: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #61

CMU SCS

Tool #2: logarithms

Page 48: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #62

CMU SCS

Tool #2: logarithms

Prominent `clothesline’ at ~ 50 updates per 600 secs.

Culprit IP addresses:

192.211.42.0/24216.109.38.0/24207.157.115.0/24

All from Alabama (Supercomputing Center)!

Page 49: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #63

CMU SCS

Outline

• Problem 1: workload characterization

• Problem 2: self-* monitoring

• Problem 3: BGP mining

• (Problem 4: sensor mining)

• (Problem 5: Large graphs & hadoop)

fractals

SVDwavelets

tensors

PageRank

Page 50: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #64

CMU SCS

Main point

Two-way street:

<- DM can use such infrastructures to find patterns

-> DM can help such systems/networks etc to become self-healing, self-adjusting, ‘self-*’

Hot topic in Data Mining: finding patterns in Tera- and Peta-bytes

Page 51: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #65

CMU SCS

Additional resources

• Machine learning classes at SCS/MLD• Tom Mitchell’s book on Machine Learning

– Classification– Clustering/Anomaly detection– Support vector machines– Graphical models– Bayesian networks– <etc etc>

Page 52: Data Mining Meets Systems: Tools and Case Studies

PDL 2008 C. Faloutsos #66

CMU SCS

www.cs.cmu.edu/~christos

For code, papers etc

WeH 7107 christos <at> cs