advanced data mining - datalab.snu.ac.krukang/courses/20f-adm/l1-intro.pdf · movies-coming-on.html...

Post on 25-Sep-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

U Kang 1

Advanced Data Mining

Introduction

U KangSeoul National University

U Kang 2

In This Lecture

Motivation Overview of Topics

U Kang 3

Outline

MotivationOverview of TopicsConclusion

U Kang 4

Motivation

There are many “big data” Graph Time series Text Image …

U Kang 5

Main Questions

How can we find patterns and models from big data?

How can we do it in a scalable way?

U Kang 6

What is this course about?

This course covers advanced theories, algorithms and systems for mining big data.

Topics Graph Spectral Analysis Large scale distributed system (e.g. MapReduce) Singular Value Decomposition, Tensor Time series, approximation, graph compression,

community detection, anomaly detection

U Kang 7

Outline

MotivationOverview of Topics

GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools

Conclusion

U Kang 8

What does the Internet look like? What does FaceBook look like? What is ‘normal’/‘abnormal’? Which patterns/laws hold?

Graph Mining

MRFerocius, Social Network, 2011, https://stackoverflow.com/questions/4594962/social-network-directed-graph-library-for-net

U Kang 9

What does the Internet look like? What does FaceBook look like? What is ‘normal’/‘abnormal’? Which patterns/laws hold?

Large datasets reveal patterns and anomalies that may be invisible otherwise

Graph Mining

MRFerocius, Social Network, 2011, https://stackoverflow.com/questions/4594962/social-network-directed-graph-library-for-net

U Kang 10

Are real graphs random?

Power Law

U Kang 11

Are real graphs random? No!

Power Law

PowerLaw

U Kang 12

Node (closeness) centrality

B

C

A

Q: If you have to pick 1 person to advertise,who do you want to choose?

[Kang et al. SDM’10]

U Kang 13

Outline

MotivationOverview of Topics

GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools

Conclusion

U Kang 14

Spectral Graph Analysis

Solve graph problems using theory of linear algebra

Adjacency matrix

Eigenvector

Apply the solution

Random walkson the graph(e.g. protein

interaction)

Wikipedia, Schizophrenia PPI, 2016,

https://en.wikipedia.org/wiki/Protein%E2%80%93protein_interaction

U Kang 15

Triangle Counting Real social networks have a lot of triangles

Friends of friends are friends

But, triangles are expensive to compute (3-way join; several approx. algos)

Q: Can we do that quickly? A: Yes!

#triangles = 16σ𝑖 𝜆𝑖

3

(and, because of skewness in eigenvalues, we only need the top few eigenvalues!)

Triangle Counting[Kang et al. PAKDD’11]

U Kang 16

Outline

MotivationOverview of Topics

GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools

Conclusion

U Kang 17

Motivation

Many big data Crawled document Web request log …

Many ‘large scale computations’ Inverted index Graph operation Summaries of the number of pages crawled per host Most frequent queries in a given day …

U Kang 18

Motivation

But, developing the code is very complex : How to parallelize the computation? How to distribute the data? How to handle failures?

U Kang 19

Motivation

Failures Assume a machine works for 3 years without failure What is the expected number of failed machines when

operating 1 million machines?

U Kang 20

MapReduce Example: histogram of fruit names

Map 0 Map 1 Map 2

Reduce 0 Reduce 1

Shuffle

(apple, 1)(apple, 1) (strawberry,1)

(apple, 2) (orange, 1)(strawberry, 1)

(orange, 1)

HDFS

HDFS

map( fruit ) {output(fruit, 1);

}

reduce( fruit, v[1..n] ) {for(i=1; i <=n; i++)sum = sum + v[i];

output(fruit, sum);}

U Kang 21

Outline

MotivationOverview of Topics

GraphSpectral AnalysisMapReduceSVD, TensorTime SeriesOther tools

Conclusion

U Kang 22

Singular Value Decomposition (SVD)

Essential tool for Concept discovery Dimensionality reduction Finding fixed points Solving linear systems …

U Kang 23

SVD - Example

A = U Λ VT

datainfo

retrievalbrainlung

=CS

MD

x x

U Kang 24

SVD - Example

A = U Λ VT

datainfo

retrievalbrainlung

=CS

MD

x x

CS Medical

U Kang 25

SVD - Example

A = U Λ VT

datainfo

retrievalbrainlung

=CS

MD

x x

CS Medical‘strength’ of CS-concept

U Kang 26

SVD - Example

A = U Λ VT

datainfo

retrievalbrainlung

=CS

MD

x x

CS Medical‘strength’ of CS-concept

doc-concept similarity

U Kang 27

SVD - Example

A = U Λ VT

datainfo

retrievalbrainlung

=CS

MD

x x

CS Medical‘strength’ of CS-concept

doc-concept similarity term-concept similarity

U Kang 28

What is a Tensor?

N-D generalization of matrix:

13 11 22 55 ...

5 4 6 7 ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

data mining classif. tree ...JohnPeterMaryNick

...

KDD’20

U Kang 29

What is a Tensor?

N-D generalization of matrix:

13 11 22 55 ...

5 4 6 7 ...

... ... ... ... ...

... ... ... ... ...

... ... ... ... ...

data mining classif. tree ...JohnPeterMaryNick

...

KDD’21

KDD’22

KDD’20

U Kang 30

Motivating Applications

Why tensors are useful? Multi-way semantic indexing Sensor data analysis

U Kang 31

Multi-way Semantic Indexing

Data: author, keyword, year

DBDBDM

DBDBDM

Keywords

Auth

ors

Sun, Jimeng, Dacheng Tao, and Christos Faloutsos. "Beyond streams and graphs: dynamic tensor analysis." KDD. 2006.

U Kang 32

Sensor Data Analysis

Data: location, type, time

1st factor (Main trend)

(a1) daily pattern (b1) main pattern (c1) Main correlation

2nd factor (Major abnormal trend)

(a2) abnormal residual (b2) three abnormal sensors (c2) Voltage anomaly

Core TensorTensor Streams

Sun, Jimeng, Spiros Papadimitriou, and S. Yu Philip. "Window-based tensor analysis on high-dimensional and multi-aspect streams." ICDM. 2006.

U Kang 33

Outline

MotivationOverview of Topics

GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools

Conclusion

U Kang 34

Recommender System

Search vs. recommender system Search: a user actively looks for what the user want (e.g.,

by entering a keyword in a search engine) Recommender system: the system automatically

provides recommended items to users

U Kang 35

Real World Applications

Amazon.com

35 percent of what consumers purchase on Amazon come from recommendations

https://c1.staticflickr.com/5/4067/4551424756_3e176d6939_z.jpg

U Kang 36

Real World Applications

Netflix

Personalization and recommendations saves ≥ $1B per year

https://www.flickr.com/photos/wfryer/2661730729

U Kang 37

Matrix Factorization for CF

Map each user and each item to a low-dimensional space

Serious

Escapist

Geared toward mal

es

Geared toward fem

ales

Koren et al., Matrix Factorization Techniques for Recommender Systems, IEEE Computer, 2009

U Kang 38

Outline

MotivationOverview of Topics

GraphSpectral AnalysisMapReduceSVD, TensorTime SeriesOther tools

Conclusion

U Kang 39

Tool 1: Time Series Analysis

Given: one or more sequences x1 , x2 , … , xt , …(y1, y2, … , yt, …)

Task Find similar sequences Forecast future values Classify sequences (e.g., fault or normal)

U Kang 40

Matrix Profile

Repeated earthquakes

Yeh, Chin-Chia Michael, et al. "Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets." 2016 IEEE 16th international conference on data mining (ICDM). Ieee, 2016. (https://www.cs.ucr.edu/~eamonn/matrix_profile_i.pptx)

U Kang 41

Matrix Profile

Abnormal heartbeat detection from ECG

Yeh, Chin-Chia Michael, et al. "Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets." 2016 IEEE 16th international conference on data mining (ICDM). Ieee, 2016. (https://www.cs.ucr.edu/~eamonn/matrix_profile_i.pptx)

Maximum value in matrixprofile indicates discord

U Kang 42

Tool 2 : Approximation

Flajolet-Martin sketch: Let’s say there are n unique numbers in a set S We save this information in memory Task : given a new number i, check whether i is in S

Question: how much memory do we need to answer such question?

U Kang 43

Tool 2 : Approximation

Flajolet-Martin sketch: Let’s say there are n unique numbers in a set S We save this information in memory Task : given a new number i, check whether i is in S

Question: how much memory do we need to answer such question?

Answer: O(n) bytes, usually. But, Flajolet-Martin sketch uses only O(log(n)) bits to do it almost accurately

U Kang 44

Tool 2 : Approximation

Application : speed-up the graph computation

For 2 Billon Edges, - standard closeness takes 30,000 years- effective closeness takes ~ 1 day !1,000,000 times faster!

U Kang 45

Tool 3 : Graph Compression

Original SlashBurn

U Kang 46

Tool 4 : Community Detection

How to find good communities in a graph?

http://en.wikipedia.org/wiki/Community_structure

U Kang 47

Tool 5 : Anomaly Detection

How to find outliers, or anomalies?

U Kang 48

OddBall at work (Posts)

# citations

# cr

oss-

citat

ions

223K posts217K citations

http://instapundit.com/archives/025235.phphttp://www.sizemore.co.

uk/2005/08/i-feel-some-movies-coming-on.html

POSTS

Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

U Kang 49

Outline

MotivationOverview of TopicsConclusion

U Kang 50

Conclusion

Advanced theories, algorithms and systems for mining big data.

Topics Graph Spectral Analysis Large scale distributed system (e.g. MapReduce) Singular Value Decomposition, Tensor Time series, approximation, graph compression,

community detection, anomaly detection

U Kang 51

Questions?

top related