data mining --- mining graphs - usf dept. of computer …lohall/dm/cis6930_dm_gm.pdf · ·...

Data mining --- mining graphs

CAP5771.1University of South Florida

Xiaoning Qian

Xiaoning Qian ([email protected]) 10/13/11

Today’s Lecture

1. Complex networks

2. Graph representation for networks

3. Markov chain

4. Viral propagation

5. Google’s PageRank

CAP5771.1

1. M Faloutsos, P Faloutsos, C Faloutsos, On power-lawrelationships of the Internet topology. Comput. Commun. Rev.29:251-262, 1999

2. JM Kleinberg, Navigation in a small world—it is easier to findshort chains between points in some networks than others. Nature,406:845, 2000

3. AL Barabasi, Linked: The New Science of Networks. Cambridge,MA, 2002

4. DJ Watts, The “new” science of networks. Annu. Rev. Sociol.30:243–70, 2004

5. DL Alderson, Catching the “network science” bug: Insight andopportunity for the operations researcher. Operations Research56(5): 1047-1065, 2008

6. MEJ Newman, Networks: An Introduction. Oxford, 2010


“New” Science of Networks

CAP5771.1

The first network/graph problem Find a tour crossing every bridge just once, Euler, 1735


Network Science

CAP5771.1

Bridges of Königsberg

1. S Milgram, The small world problem. Psychol.Today 2:60–67, 1967

“New” Science Unprecedented number of empirical networks Much larger scale networks Visualization does not convey enough information

Computer are much more powerful Highly interdisciplinary


Network Science

CAP5771.1

Mining networks/graphs

Topology/structure of complex networks Global: degrees, centrality, connectivity, etc.

Scale-free (power-law) networks: 6 degree separation?

Local: clustering (community), network motifs, etc.

Dynamics/behavior of complex networks Global: the topological effect on dynamics

How information, virus, disease, rumors, etc. propagate?

Local: how individual nodes behave


Network Science

CAP5771.1


Complex Networks (Yeast PPI)

CAP5771.1


Complex Networks (Yeast signaling)

CAP5771.1


Complex Networks (food web)

CAP5771.1


Complex Networks (friendship)

CAP5771.1


Complex Networks (romantic relation)

CAP5771.1


Complex Networks (author citation)

CAP5771.1


Complex Networks (Internet)

CAP5771.1


Complex Networks (Web)

CAP5771.1


Mathematics of Networks (Graphs)

CAP5771.1

What is a network/graph? A collection of vertices/nodes joined by edges Different types of vertices and edges:

Directed vs. Undirected Weighed vs. Binary Labeled vs. Nonlabeled Bipartite graphs Hypergraphs

Mathematically,G = {V, E}



CAP5771.1

Undirected network: Undirected network: <v<vii, , vvjj> > ∈∈ E => < E => < vvjj, v, vii > > ∈∈ E E



CAP5771.1

Adjacency matrix L Symmetric for undirected graphs Square matrix for (self-)graphs; rectangular for bipartite

graphsLijij = eijij if <v<vii, , vvjj> > ∈∈ E E

Matrix analysis for graph mining!

Simple graphs, connected graphs, complete graphs, …



CAP5771.1

Node degree cii

The number of edges incident with vertex vvii

Neighbor set Input-, output-degrees Degree distributions (power-law) …

Trail (distinct edges), path (distinct nodes), cycle, cut, …

Sequential data


Markov chain

CAP5771.1


What is a Markov chain?

Finite Markov chain -- (Q, P) Q = {q1, q2, …, qs} : a finite set of states P : state transition probability matrix Given a sequence of observations: The probability of the sequence is:

For first-order time-homogeneous Markov chain:

Hence,

CAP5771.1

Finite Markov chain -- (Q, P) Q = {B, q1, q2, …, qs} : a finite set of states P : state transition probability matrix

initial state probability:

The probability of a sequence can be expressed with P:

Note: The output are states at each time -- states areobservable!!


What is a Markov chain?

CAP5771.1

3-state Markov chain model for the weather:

Q = {Rain (or snow), Cloudy, Sunny};

P is given in the figure; Initial state probability

RR

CC

SS


An example

0.40.4

0.30.3

0.30.3

0.20.2

0.20.2

0.60.6

0.80.8

0.10.1

0.10.1

CAP5771.1

Chapman-Kolmogorov equationp(xn) = P(n-1) p(x1)

Limiting distribution (stationary/steady-state distribution) Irreducibility, Periodicity, Ergocity

p = P p

How to solve p? Eigen-decomposition of P Power method


Chapman-Kolmogorov Equation

CAP5771.1

Random walk on graphs (network diffusion) is a Markovprocess.


Random walk on graphs

CAP5771.1

The algorithm of Google---PageRank


What’s behind Google?

CAP5771.1


PageRank

CAP5771.1

What is an important Webpage?There are many Webpages pointing to it

Important Webpage point to more important Webpage

Importance diffuses based on links between Webpages

Vertices: Webpages; Edges: hyperlinks;

HITS: JM Kleinberg


PageRank

CAP5771.1

Diffusion (Random walk) on Web

pi: importance for page i; Lij: link from page j to i;

λ λ

Hence, the problem becomes a Markov chain problem(diffusion process):

λ λ


PageRank

CAP5771.1

Diffusion (Random walk with restart) on Web

pi: importance for page i; Lij: link from page j to i;

λ λ


PageRank

CAP5771.1

Diffusion (Random walk with restart) on Web

Matrix form:

λ λ

PseudocountDiffusion factor

λ λ


PageRank

CAP5771.1

How do we solve this?λ λ

Note that p is simply for ranking and the absolutevalues are not critical! WLOG, we assume

Hence, the problem becomes a Markov chain problem(diffusion process):

λ λ


Viral propagation

CAP5771.1

How does the virus spread over the network? Will it become an “epidemic” outbreak? How fast the virus will die out or become “epidemic”? How we should design “robust” networks to prevent

cascading failures?

* A Ganesh et. al., The effect of network topology on the spread of epidemics. INFOCOM, 2005

* D Chakrabati, Tools for large graph mining. Ph.D. Thesis, CMU, 2005


Mathematical Epidemiology

CAP5771.1

SIR (Susceptible-Infective-Recovered) model

SIS (Susceptible-Infective-Susceptible) model Catching the disease from Infective neighbors (birth rate): β

Recover rate: δ

Epidemic threshold: τ


SIS model

CAP5771.1

????

Sum and Product rules in probability!!

SIS model is again a Markov process!


SIS model

CAP5771.1



SIS model

CAP5771.1


With appropriate approximations, we can derivep(vi

t=susceptible) = p(vit-1=susceptible) ζi + p(vi

t-1=infective) δ

1-p(vit) = [1-p(vi

t-1)] ζi + p(vit-1) δ

and

pt =[ βL + (1-δ)I ] pt-1


SIS model

CAP5771.1

With appropriate approximations, we can derivept

=[ βL + (1-δ)I ] pt-1

Eigen-decomposition of the matrix S= [ βL + (1-δ)I ]

LL

pp pp 00

LLHence,


SIS model

CAP5771.1

With appropriate approximations, we can derivept

=[ βL + (1-δ)I ] pt-1

Eigen-decomposition of the matrix S= [ βL + (1-δ)I ]

Epidemic threshold:

LLHence,

LL

Networks/graphs are everywhere and require new toolsto study them efficiently and effectively.

Random walk (Markov chain) on graphs and itsextension can be a useful technique to “mine” complexnetworks/graphs PageRank Viral propagation

Have you learned anything? :)

I am teaching Biological Network Analysis, Spring 2012.


Summary

CAP5771.1

data mining --- mining graphs - usf dept. of computer …lohall/dm/cis6930_dm_gm.pdf · ·...

Documents