mining frequent subgraphs

13
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Frequent Subgraphs COMP 790-90 Seminar Spring 2011

Upload: oswald

Post on 26-Jan-2016

31 views

Category:

Documents


2 download

DESCRIPTION

Mining Frequent Subgraphs. COMP 790-90 Seminar Spring 2011. FFSM: Fast Frequent Subgraph Mining -- An Overview:. How to solve graph isomorphism problem? A Novel Graph Canonical Form: CAM How to tackle subgraph isomorphism problem (NP-complete)? Incrementally maintained embeddings - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Mining Frequent Subgraphs

COMP 790-90 Seminar

Spring 2011

Page 2: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

04/21/232

FFSM: Fast Frequent Subgraph Mining -- An Overview:

How to solve graph isomorphism problem?

A Novel Graph Canonical Form: CAM

How to tackle subgraph isomorphism problem (NP-complete)?

Incrementally maintained embeddings

How to enumerate subgraphs:An Efficient Data Structure: CAM Tree

Two Operations: CAM-join, CAM-extension.

Page 3: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

04/21/233

Adjacency Matrix Every diagonal entry of adjacency matrix M corresponds to a distinct vertex in G and is filled with the label of this vertex. Every off-diagonal entry in the lower triangle

part of M1 corresponds to a pair of vertices in G and is filled with the label of the edge between the two vertices and zero if there is no edge.

1for an undirected graph, the upper triangle is always a mirror of the lower triangle

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

M2

0

y

b

by

c0y0

d00

xy

a

M3

0

0

d

bx

a0yy

cy0

0y

b

M1

y

b

by

d000

cy0

xy

a

0

Page 4: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

04/21/234

CodeA Code of n n adjacency matrix M is defined as sequence of lower triangular entries (including the diagonal entries) in the order:M1,1 M2,1 M2,2 … Mn,1 Mn,2 …Mn,n-1 Mn,n

M1

y

b

by

d000

cy0

xy

a

0

Code(M1): aybyxb0y0c00y0d < Code(M2): aybyxb00yd0y00c < Code(M3): bxby0d0y0cyy00aM2

y

b

by

c0y0

d00

xy

a

M3

0

0

d

bx

a0yy

cy0

0y

b

0

The Canonical Adjacency Matrix is the one produces the maximal code, using lexicographic order.

Page 5: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

04/21/235

MP SubmatrixFor an m m matrix A, an n n matrix B is A’s maximal proper submatrix (MP Submatrix), iff B is obtained by removing the last none-zero entry from A.

M6

y

b

by

d000

cy0

xy

a

0

M5

b

by

cy0

xy

a

0

M4

b

by

xy

a

M2

by

a

a

M1 M3

b

by

0y

a

We define a CAM is connected iff the corresponding graph is connected. Theorem I: A CAM’s MP submatrix is CAMTheorem II: A connected CAM’s MP submatrix is connected

Page 6: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

04/21/236

CAM Tree: Subgraphs

y

0

b

by

d000

cy0

0y

a

0

y

y

a

c0y

b0

b

by

a

a

bx

b

b

cy

b

dy

b

cy0

by

a

dy0

by

a

d0y0

c0y

bx

b

c0y

bx

b

d0y

bx

b

c d

bx0

by

a

0

0

y

a

c0y

bx

b

d0y0

bx0

by

a

0

0

y

a

cy0

bx

b

dy00

bx0

by

a

y

0

b

by

d000

cy0

x0

a

0

y

y

a

c0y

bx

b

y

0

b

by

c000

dy0

x0

a

0

y

y

a

d0y

bx

b

bxy

by

a

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

b0y

by

a

bx0

by

a

d0y0

b0y

by

a

Page 7: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

04/21/237

CAM Tree: Frequent Subgraphs

bxy

by

a

b0y

by

a

bx0

by

a

by

a

a

bx

b

b

a

b

b

y

xy

(Q)

q1

q3

q2

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

a

b

b

y

y

(S)

s1

s3

s2

= 2/3

Page 8: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

04/21/238

How to Enumerate Nodes in a CAM Tree?

Two operations to explore CAM tree:CAM-Join

CAM-Extension

Augmenting CAM tree with Suboptimal CAMsObjectives:

no false dismissal

no redundancy

Plus: We want to this efficiently!

Page 9: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

04/21/239

Suboptimal Tree

bx

b

b

cy

b

dy

b

by

a

a

bxy

by

a

cy0

by

a

dy0

by

a

j e e

y

0

b

by

d000

cy0

xy

a

y

y

b

by

c000

d00

xy

a

jj

j j

d0y0

c0y

bx

b

c0y0

d0y

bx

b

j

0

y

y

a

c0y

bx

b

d0y0

bxy

by

a

j e e

dy00

bxy

by

a

cy00

bxy

by

a

j ej e

c0y

bx

b

cy0

bx

b

dy0

bx

b

d0y

bx

b

c dWe define a Suboptimal CAM as a matrix that its MP submatrix is a CAM.

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

Page 10: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

04/21/2310

SummaryTheorem:For a graph G, let CK-1 (Ck) be set of the suboptimal CAMs of all size-(K-1) (K) subgraphs of G (K ≥ 2). Every member of set CK can be enumerated unambiguously either by joining two members of set CK-1 or by extending a member in CK-1.

Page 11: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

04/21/2311

Experimental StudyPredictive Toxicology Evaluation Competition (PTE)

Contains: 337 compounds

Each graph contains 27 nodes and 27 edges on average

NIH DTP Anti-Viral Screen Test (DTP CA/CM)Chemicals are classified to be Confirmed Active (CA), Confirmed Moderate Active (CM) and Confirmed Inactive (CI).

We formed a dataset contains CA (423) and CM (1083).

Each graph contains 25 nodes and 27 edges on average

Page 12: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

04/21/2312

Performance (PTE)

Support Threshold (%) Support Threshold (%)

Page 13: Mining Frequent Subgraphs

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

04/21/2313

Performance (DTP CACM)

Support Threshold (%) Support Threshold (%)