advanced data mining - datalab.snu.ac.krukang/courses/20f-adm/l1-intro.pdf · movies-coming-on.html...

51
U Kang 1 Advanced Data Mining Introduction U Kang Seoul National University

Upload: others

Post on 25-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 1

Advanced Data Mining

Introduction

U KangSeoul National University

Page 2: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 2

In This Lecture

Motivation Overview of Topics

Page 3: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 3

Outline

MotivationOverview of TopicsConclusion

Page 4: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 4

Motivation

There are many “big data” Graph Time series Text Image …

Page 5: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 5

Main Questions

How can we find patterns and models from big data?

How can we do it in a scalable way?

Page 6: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 6

What is this course about?

This course covers advanced theories, algorithms and systems for mining big data.

Topics Graph Spectral Analysis Large scale distributed system (e.g. MapReduce) Singular Value Decomposition, Tensor Time series, approximation, graph compression,

community detection, anomaly detection

Page 7: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 7

Outline

MotivationOverview of Topics

GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools

Conclusion

Page 8: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 8

What does the Internet look like? What does FaceBook look like? What is ‘normal’/‘abnormal’? Which patterns/laws hold?

Graph Mining

MRFerocius, Social Network, 2011, https://stackoverflow.com/questions/4594962/social-network-directed-graph-library-for-net

Page 9: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 9

What does the Internet look like? What does FaceBook look like? What is ‘normal’/‘abnormal’? Which patterns/laws hold?

Large datasets reveal patterns and anomalies that may be invisible otherwise

Graph Mining

MRFerocius, Social Network, 2011, https://stackoverflow.com/questions/4594962/social-network-directed-graph-library-for-net

Page 10: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 10

Are real graphs random?

Power Law

Page 11: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 11

Are real graphs random? No!

Power Law

PowerLaw

Page 12: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 12

Node (closeness) centrality

B

C

A

Q: If you have to pick 1 person to advertise,who do you want to choose?

[Kang et al. SDM’10]

Page 13: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 13

Outline

MotivationOverview of Topics

GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools

Conclusion

Page 14: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 14

Spectral Graph Analysis

Solve graph problems using theory of linear algebra

Adjacency matrix

Eigenvector

Apply the solution

Random walkson the graph(e.g. protein

interaction)

Wikipedia, Schizophrenia PPI, 2016,

https://en.wikipedia.org/wiki/Protein%E2%80%93protein_interaction

Page 15: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 15

Triangle Counting Real social networks have a lot of triangles

Friends of friends are friends

But, triangles are expensive to compute (3-way join; several approx. algos)

Q: Can we do that quickly? A: Yes!

#triangles = 16σ𝑖 𝜆𝑖

3

(and, because of skewness in eigenvalues, we only need the top few eigenvalues!)

Triangle Counting[Kang et al. PAKDD’11]

Page 16: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 16

Outline

MotivationOverview of Topics

GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools

Conclusion

Page 17: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 17

Motivation

Many big data Crawled document Web request log …

Many ‘large scale computations’ Inverted index Graph operation Summaries of the number of pages crawled per host Most frequent queries in a given day …

Page 18: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 18

Motivation

But, developing the code is very complex : How to parallelize the computation? How to distribute the data? How to handle failures?

Page 19: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 19

Motivation

Failures Assume a machine works for 3 years without failure What is the expected number of failed machines when

operating 1 million machines?

Page 20: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 20

MapReduce Example: histogram of fruit names

Map 0 Map 1 Map 2

Reduce 0 Reduce 1

Shuffle

(apple, 1)(apple, 1) (strawberry,1)

(apple, 2) (orange, 1)(strawberry, 1)

(orange, 1)

HDFS

HDFS

map( fruit ) {output(fruit, 1);

}

reduce( fruit, v[1..n] ) {for(i=1; i <=n; i++)sum = sum + v[i];

output(fruit, sum);}

Page 21: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 21

Outline

MotivationOverview of Topics

GraphSpectral AnalysisMapReduceSVD, TensorTime SeriesOther tools

Conclusion

Page 22: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 22

Singular Value Decomposition (SVD)

Essential tool for Concept discovery Dimensionality reduction Finding fixed points Solving linear systems …

Page 23: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 23

SVD - Example

A = U Λ VT

datainfo

retrievalbrainlung

=CS

MD

x x

Page 24: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 24

SVD - Example

A = U Λ VT

datainfo

retrievalbrainlung

=CS

MD

x x

CS Medical

Page 25: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 25

SVD - Example

A = U Λ VT

datainfo

retrievalbrainlung

=CS

MD

x x

CS Medical‘strength’ of CS-concept

Page 26: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 26

SVD - Example

A = U Λ VT

datainfo

retrievalbrainlung

=CS

MD

x x

CS Medical‘strength’ of CS-concept

doc-concept similarity

Page 27: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 27

SVD - Example

A = U Λ VT

datainfo

retrievalbrainlung

=CS

MD

x x

CS Medical‘strength’ of CS-concept

doc-concept similarity term-concept similarity

Page 28: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 28

What is a Tensor?

N-D generalization of matrix:

13 11 22 55 ...

5 4 6 7 ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

data mining classif. tree ...JohnPeterMaryNick

...

KDD’20

Page 29: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 29

What is a Tensor?

N-D generalization of matrix:

13 11 22 55 ...

5 4 6 7 ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

data mining classif. tree ...JohnPeterMaryNick

...

KDD’21

KDD’22

KDD’20

Page 30: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 30

Motivating Applications

Why tensors are useful? Multi-way semantic indexing Sensor data analysis

Page 31: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 31

Multi-way Semantic Indexing

Data: author, keyword, year

DBDBDM

DBDBDM

Keywords

Auth

ors

Sun, Jimeng, Dacheng Tao, and Christos Faloutsos. "Beyond streams and graphs: dynamic tensor analysis." KDD. 2006.

Page 32: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 32

Sensor Data Analysis

Data: location, type, time

1st factor (Main trend)

(a1) daily pattern (b1) main pattern (c1) Main correlation

2nd factor (Major abnormal trend)

(a2) abnormal residual (b2) three abnormal sensors (c2) Voltage anomaly

Core TensorTensor Streams

Sun, Jimeng, Spiros Papadimitriou, and S. Yu Philip. "Window-based tensor analysis on high-dimensional and multi-aspect streams." ICDM. 2006.

Page 33: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 33

Outline

MotivationOverview of Topics

GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools

Conclusion

Page 34: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 34

Recommender System

Search vs. recommender system Search: a user actively looks for what the user want (e.g.,

by entering a keyword in a search engine) Recommender system: the system automatically

provides recommended items to users

Page 35: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 35

Real World Applications

Amazon.com

35 percent of what consumers purchase on Amazon come from recommendations

https://c1.staticflickr.com/5/4067/4551424756_3e176d6939_z.jpg

Page 36: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 36

Real World Applications

Netflix

Personalization and recommendations saves ≥ $1B per year

https://www.flickr.com/photos/wfryer/2661730729

Page 37: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 37

Matrix Factorization for CF

Map each user and each item to a low-dimensional space

Serious

Escapist

Geared toward mal

es

Geared toward fem

ales

Koren et al., Matrix Factorization Techniques for Recommender Systems, IEEE Computer, 2009

Page 38: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 38

Outline

MotivationOverview of Topics

GraphSpectral AnalysisMapReduceSVD, TensorTime SeriesOther tools

Conclusion

Page 39: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 39

Tool 1: Time Series Analysis

Given: one or more sequences x1 , x2 , … , xt , …(y1, y2, … , yt, …)

Task Find similar sequences Forecast future values Classify sequences (e.g., fault or normal)

Page 40: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 40

Matrix Profile

Repeated earthquakes

Yeh, Chin-Chia Michael, et al. "Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets." 2016 IEEE 16th international conference on data mining (ICDM). Ieee, 2016. (https://www.cs.ucr.edu/~eamonn/matrix_profile_i.pptx)

Page 41: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 41

Matrix Profile

Abnormal heartbeat detection from ECG

Yeh, Chin-Chia Michael, et al. "Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets." 2016 IEEE 16th international conference on data mining (ICDM). Ieee, 2016. (https://www.cs.ucr.edu/~eamonn/matrix_profile_i.pptx)

Maximum value in matrixprofile indicates discord

Page 42: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 42

Tool 2 : Approximation

Flajolet-Martin sketch: Let’s say there are n unique numbers in a set S We save this information in memory Task : given a new number i, check whether i is in S

Question: how much memory do we need to answer such question?

Page 43: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 43

Tool 2 : Approximation

Flajolet-Martin sketch: Let’s say there are n unique numbers in a set S We save this information in memory Task : given a new number i, check whether i is in S

Question: how much memory do we need to answer such question?

Answer: O(n) bytes, usually. But, Flajolet-Martin sketch uses only O(log(n)) bits to do it almost accurately

Page 44: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 44

Tool 2 : Approximation

Application : speed-up the graph computation

For 2 Billon Edges, - standard closeness takes 30,000 years- effective closeness takes ~ 1 day !1,000,000 times faster!

Page 45: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 45

Tool 3 : Graph Compression

Original SlashBurn

Page 46: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 46

Tool 4 : Community Detection

How to find good communities in a graph?

http://en.wikipedia.org/wiki/Community_structure

Page 47: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 47

Tool 5 : Anomaly Detection

How to find outliers, or anomalies?

Page 48: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 48

OddBall at work (Posts)

# citations

# cr

oss-

citat

ions

223K posts217K citations

http://instapundit.com/archives/025235.phphttp://www.sizemore.co.

uk/2005/08/i-feel-some-movies-coming-on.html

POSTS

Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

Page 49: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 49

Outline

MotivationOverview of TopicsConclusion

Page 50: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 50

Conclusion

Advanced theories, algorithms and systems for mining big data.

Topics Graph Spectral Analysis Large scale distributed system (e.g. MapReduce) Singular Value Decomposition, Tensor Time series, approximation, graph compression,

community detection, anomaly detection

Page 51: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C

U Kang 51

Questions?