advanced data mining - datalab.snu.ac.krukang/courses/20f-adm/l2-graphbasic.pdf · protein...

51
U Kang 1 Advanced Data Mining Graph Basics and Diameter U Kang Seoul National University

Upload: others

Post on 25-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 1

Advanced Data Mining

Graph Basics and Diameter

U KangSeoul National University

Page 2: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 2

In This Lecture

Basic definitions in graph mining Small World Phenomenon Diameter over time

Page 3: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 3

Outline

Basic DefinitionSmall World PhenomenonDiameter over TimeConclusion

Page 4: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 4

Basic Definition

Graph: a way of specifying relationships among a collection of items. Nodes (or vertices) Edges

Page 5: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 5

Basic Definition

Types of graph Directed vs. Undirected graph

Directed Undirected

Page 6: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 6

Basic Definition

Types of graph Weighted vs. Unweighted graph

Weighted Unweighted

1.2

0.1

2.5

Page 7: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 7

Basic Definition

Types of graph Simple vs. Attributed graph

Simple Attributed

CEO

Research ManagerAssistant

Marketing Manager

Page 8: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 8

Examples of graph

Arpanet, DEC 19701) Social network2)

1) UCLA and BBN, ARPANET in December 1970, 1970,

https://commons.wikimedia.org/wiki/File:ARPANET_1970_Map.png

2) MRFerocius, Social Network, 2011,

https://stackoverflow.com/questions/4594962/social-network-directed-graph-library-for-net

facebooktwitter

Page 9: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 9

More Examples

Protein Interactions1) World Wide Web Document Network2)

PatentDBLP

1) Wikipedia, Schizophrenia PPI, 2016,

https://en.wikipedia.org/wiki/Protein%E2%80%93protein_interaction

2) Widipedia, Logo of the English Wikipedia, 2001,

https://en.wikipedia.org/wiki/English_Wikipedia

Page 10: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 10

Even more examples

Call graph (who calls whom) Email graph (who emailed whom) Movie-actor database from IMDB (more examples?)

Page 11: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 11

More definitions

Two vertices are adjacent if they share a common edge

Two adjacent vertices are neighbors An edge is incident with another edge if they share

a vertex An edge is incident with two vertices A degree of a node is the number of neighbors of it

Directed graph: in-degree, out-degree

Page 12: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 12

Connection

Connected graph

Disconnected graph

Page 13: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 13

Connected Component

Disconnected Component

Disconnected Component

C

DE

A

B

Giant Connected Component

F

G

H

I

K

M

L

Page 14: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 14

Special Families of Graph

Star graph Equation for |E| (=m) as a function of |V| (=n)?

Chain graph Equation for |E| (=m) as a function of |V| (=n)?

Complete graph = full graph = clique graph Equation for |E| (=m) as a function of |V| (=n)?

Page 15: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 15

Special Families of Graph

Bipartite graph

Complete bipartite graph

Page 16: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 16

Path, Cycle, Walk

Path : a sequence of connected vertices Cycle : a path whose start and end vertices are the

same Simple path : no repeated vertices Simple cycle : no repeated vertices, except the

start vertex (=end vertex)

Some authors use ‘path’ for simple path (no repetition), and ‘walk’ for path (repetition allowed)

Page 17: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 17

Subgraph

Subgraph A subset of the graph

Induced subgraph A subgraph induced from a set of vertices

Page 18: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 18

Outline

Basic DefinitionSmall World PhenomenonDiameter over TimeConclusion

Page 19: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 19

Distance in graph

Length of a path : # of steps from beginning to end

You

Page 20: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 20

Distance in graph

Length of a path : # of steps from beginning to end

You

Distance 1: “friends”

Page 21: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 21

Distance in graph

Length of a path : # of steps from beginning to end

You

Distance 2: “friends of friends”

Page 22: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 22

Distance in graph

Length of a path : # of steps from beginning to end

You

Distance 3: “friends of friends of friends”

Page 23: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 23

Algorithm: Breadth First Search

Find one step neighbors. Find two step neighbors. Continue…

You

Page 24: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 24

Radius and Diameter

Radius of a node : the longest shortest distance to the all other nodes

Diameter of a graph : the maximum radius

v

Page 25: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 25

Small World Phenomenon

Milgram’s experiment

Six degrees of separation

The median value is “6”N

UMBE

RO

F CH

AIN

S

NUMBER OF INTERMEDIARIES

N = 64

- Jeffrey Travers and Stanley Milgram, An Experimental Study of the Small World Problem, Sociometry, Vol. 32, No. 4, 1969, pp. 432

Page 26: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 26

Small World Phenomenon

Erdos Number Paul Erdos : published ~1500 papers Erdos number : the distance to Paul Erdos in co-

authorship graph

1) Billy and Grace Tao, Paul Erdos with Terence Tao, 1985, https://commons.wikimedia.org/wiki/File:Paul_Erdos_with_Terence_Tao.jpg 2) H2g2bob, Erdosnumber, 2006, https://commons.wikimedia.org/wiki/File:Erdosnumber.png

1) 2)

Page 27: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 27

Small World Phenomenon

Erdos Number Albert Einstein : 2 Enrico Fermi : 3 Noam Chomski : 4 Most mathematicians have Erdos numbers at most 4 or 5

Page 28: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 28

Small World Phenomenon

Kevin Bacon number Distance to the Kevin Bacon in actor-actor graph from

Internet Movie DataBase (IMDB) at 1994 Average Bacon number of actors : 2.9

- GabboT, Kevin Bacon TIFF 2015, 2015, https://commons.wikimedia.org/wiki/File:Kevin_Bacon_TIFF_2015.jpg

Page 29: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 29

Outline

Basic DefinitionSmall World PhenomenonDiameter over Time

ObservationModel

Conclusion

Materials based on Jure Leskovec’s slideJ. Leskovec, J. Kleinberg, C. Faloutsos. “Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations”, ACM SIGKDD, 2005

Page 30: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 30

Graph Evolution

How do the graphs evolve over time?

Application Network simulation Network prediction

Page 31: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 31

Graph Evolution

How do the graphs evolve over time?

Conventional Wisdom Constant average degree: the number of edges grows

linearly with the number of nodes Slowly growing diameter

New findings Densification power law: networks are becoming denser

over time Shrinking diameter

Page 32: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 32

Temporal Evolution of Graphs

Densification Power Law Networks are becoming denser over time Average degree is increasing

orequivalently

Page 33: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 33

Graph Densification – A closer look

Densification Power Law

Densification exponent: 1 ≤ a ≤ 2: a=1: linear growth – constant out-degree

(assumed in the literature so far) a=2: quadratic growth – clique

What are the exponents a for real graphs?

Page 34: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 34

Densification – Physics Citations

Citations among physics papers

1992: 1,293 papers,

2,717 citations 2003:

29,555 papers, 352,807 citations

For each month M, create a graph of all citations up to month M

N(t)

E(t)

1.69

Materials based on Jure Leskovec’s slideJ. Leskovec, J. Kleinberg, C. Faloutsos. “Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations”, ACM SIGKDD, 2005

Page 35: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 35

Densification – Patent Citations

Citations among patents granted

1975 334,000 nodes 676,000 edges

1999 2.9 million nodes 16.5 million edges

Each year is a data point

N(t)

E(t)

1.66

Materials based on Jure Leskovec’s slideJ. Leskovec, J. Kleinberg, C. Faloutsos. “Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations”, ACM SIGKDD, 2005

Page 36: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 36

Densification – Autonomous Systems

Graph of Internet 1997

3,000 nodes 10,000 edges

2000 6,000 nodes 26,000 edges

One graph per dayN(t)

E(t)

1.18

Materials based on Jure Leskovec’s slideJ. Leskovec, J. Kleinberg, C. Faloutsos. “Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations”, ACM SIGKDD, 2005

Page 37: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 37

Densification – Affiliation Network

Authors linked to their publications

1992 318 nodes 272 edges

2002 60,000 nodes

20,000 authors 38,000 papers

133,000 edges N(t)

E(t)

1.15

Materials based on Jure Leskovec’s slideJ. Leskovec, J. Kleinberg, C. Faloutsos. “Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations”, ACM SIGKDD, 2005

Page 38: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 38

Graph Densification – Summary

The traditional constant out-degree assumption does not hold

Real world graphs:

The average degree is increasing

Page 39: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 39

Evolution of the Diameter

Prior work on Power Law graphs hints atSlowly growing diameter: diameter ~ O(log N) diameter ~ O(log log N)

What is happening in real data?

Diameter shrinks over time As the network grows the distances between nodes

slowly decrease

Page 40: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 40

Diameter – ArXiv citation graph

Citations among physics papers

1992 –2003 One graph per year

time [years]

diameter

Materials based on Jure Leskovec’s slideJ. Leskovec, J. Kleinberg, C. Faloutsos. “Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations”, ACM SIGKDD, 2005

Page 41: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 41

Diameter – “Autonomous Systems”

Graph of Internet One graph per day 1997 – 2000

number of nodes

diameter

Materials based on Jure Leskovec’s slideJ. Leskovec, J. Kleinberg, C. Faloutsos. “Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations”, ACM SIGKDD, 2005

Page 42: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 42

Diameter – “Affiliation Network”

Graph of collaborations in physics – authors linked to papers

10 years of data

time [years]

diameter

Materials based on Jure Leskovec’s slideJ. Leskovec, J. Kleinberg, C. Faloutsos. “Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations”, ACM SIGKDD, 2005

Page 43: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 43

Diameter – “Patents”

Patent citation network

25 years of data

time [years]

diameter

Materials based on Jure Leskovec’s slideJ. Leskovec, J. Kleinberg, C. Faloutsos. “Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations”, ACM SIGKDD, 2005

Page 44: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 44

Outline

Basic DefinitionSmall World PhenomenonDiameter over Time

ObservationModel

Conclusion

Page 45: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 45

Models

Existing graph generation models do not capture Densification Power Law and Shrinking diameters

Can we find a simple model of local behavior, which naturally leads to observed phenomena?

Page 46: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 46

“Forest Fire” model

How do people make friends in a new environment?

1. Find first a person and make friends2. Follow one of his friends3. Continue recursively4. From time to time get introduced to a new person

Forest Fire model imitates exactly this process

Page 47: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 47

“Forest Fire” – the Model

A node arrives Randomly chooses an “ambassador” Starts burning nodes (with probability p) and

adds links to burned nodes “Fire” spreads recursively

Page 48: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 48

Forest Fire in Action

Forest Fire generates graphs that Densify and have Shrinking Diameter

Materials based on Jure Leskovec’s slideJ. Leskovec, J. Kleinberg, C. Faloutsos. “Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations”, ACM SIGKDD, 2005

Page 49: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 49

Outline

Basic DefinitionSmall World PhenomenonDiameter over TimeConclusion

Page 50: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 50

Conclusion

Definitions: make sure you know them Small World Phenomenon

Six degrees of separation Diameter over time

Shrinking diameter

Page 51: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L2-graphbasic.pdf · Protein Interactions1) World Wide Web Document Network2) Patent DBLP 1) Wikipedia, Schizophrenia

U Kang 51

Questions?