graph indexing techniques seoul national university idb lab. kisung kim 2011. 3. 23

22
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Upload: brooke-hunter

Post on 26-Dec-2015

227 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Graph Indexing Techniques

Seoul National UniversityIDB Lab.

Kisung Kim2011. 3. 23

Page 2: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Outline

• Category of graph queries• Querying in collection DB• References

2/22

Page 3: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Category of Graph Queries: Matching Type

• Exact subgraph matching– Find graphs in DB which have all components of the query graph

• Similarity subgraph matching– Find graphs in DB which have some components of the query graph– Similarity measure is needed

• Super graph matching– Find graphs in DB which are contained in the query graph

Query graph Exact subgraph SimilaritySubgraph

Query graph

3/22

Page 4: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Category of Graph Queries: Target DB

• Collection DB: large number of small graphs– e.g. Chemical compounds– Retrieval component

– IDs of graphs which contain matching parts

• Large graphs: small number of large graphs– e.g. Social network, RDF graph– Retrieval component

– All matching subgraphs

G1

G2

G3

G4

G7

G6

G5

Query graph

G1, G3, G5

Results: graph ID list

Querying Collection DB

Query graph

Results: matching subgraphs

Querying Large Graphs

4/22

Page 5: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Query Processing in Collection DB

• Processing flow

• Verification uses usual pair-wise subgraph isomorphism algo-rithm

• Most of techniques focus on filtering techniques– The cost of verification is high– To reduce the number of verification execution

Query Filtering Candidategraph set Verification Answer

Graphs

5/22

Page 6: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Query Processing in Large Graphs

• Processing flow

• Focus on node indexing– To reduce search space– Use structural information of nodes

• Build subgraph by joining candidate nodes– Join methods are not relatively researched– Optimization using join ordering

QueryIndexsearch

Candidatenode sets

Building subgraphs

Answersubgraphs

6/22

Page 7: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Graph Indexing Techniques

Target Database Query Type

GraphGrep[Shasha et al., PODS’02]

Collection DB Exact Feature(Path) based index

gIndex[Yan et al., SIGMOD’04]

Collection DB Exact Feature(Graph) based index

Grafil[Yan et al., SIGMOD’05]

Collection DB Exact & Similarity Feature based similarity search

C-tree[He and Singh, ICDE’06]

Collection DB Exact & Similarity Closure based index

QuickSI[Shang et al., VLDB’08]

Collection DB Exact Verification algorithm

Tale[Tian and Patel, ICDE’08]

Collection DB Exact & SimilaritySimilarity search using node in-

dex

GraphQL[He and Singh, SIGMOD’08]

Large graphs Exact Node indexing

Spath[Zhao and Han, VLDB’10]

Large graphs ExactNode indexing using neighbor-

hood information

7/22

Page 8: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Outline

• Category of graph queries• Querying in collection DB• References

8/22

Page 9: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

GraphGrep(1/2) [Shasha et al. PODS’02]

• First work adopts the filtering-and-verification framework• Path-based index

– Fingerprint of database– Enumerate the set of all paths(length <= L) of all graphs in DB– For each path, the number of occurrences in each graphs are stored in

hash table

B

A

C

B

B

A

C

B

D

E

C

A B

B

C

Key g1 g2 g3

h(CA) 1 0 1

h(ABCB) 2 2 0

g1 g2g3 Index

9/22

Page 10: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

GraphGrep(2/2): Query Processing

• Filtering– Make the fingerprint of query q

– Hash all paths (length <= L) of q– Compare the fingerprint of the query with the fingerprint of database

– Discard a graph whose value in fingerprint is less than the value in query fin-gerprint

• Verification– Check subgraph isomorphism tests

Key g1 g2 g3

h(AB) 2 2 1

h(AC) 1 0 1

h(BAC) 2 0 1

B

A

C

B

B

A

C

B

D

E

C

A B

B

C

g1 g2g3

Index

B

A C

AB:1AC:1BAC:1

Query

Candidates= {g1, g3}

Verification

10/22

Page 11: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

gIndex(1/6) [Yan et al., SIGMOD’04]

• Path-based approach has week points– Path is too simple: structural information is lost– There are too many paths: the set of paths in a graph database usually

is huge

• Solution– Use graph structure instead of path as the basic index feature

c c c c

c cc c

c c

c c

c c

c c

c c

c c

Sample Database

c

c c

c

c

c

Query

c c c

c c c

Paths in Query Graph

Cannot Filter Any GraphsIn Database

11/22

Page 12: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

gIndex(2/6): Frequent Fragment

• The number of graph structure is largeIndex only frequent subgraphs

• support(g)– The number of graphs in D (graph database), where g is a subgraph

• minSup– Minimum support threshold– Index a fragment, g only if support(g) ≥ minSup

• Size-increasing support– Frequent fragments are increasing as the size of a fragment increases– Low minSup for small fragments, high minSup for large fragment

12/22

Page 13: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

gIndex(3/6): Frequent Fragment

A A

B

A A

B B

A A

B B

A

A

B B

A A

A B

A A B

A B B

B A B

A B A

A B

B

A

A A

B

A

B B

B A

B

A

B A

B

A

B B

A

A A

B B

A

A

A

B B

Size=1 Size=2 Size=3 Size=4

F=3

F=4B B

F=3

F=3

F=3

F=2

F=2

F=2

F=1

F=1

F=1

F=1

F=2

F=1

F=1

minSup=1 minSup=1 minSup=2 minSup=2 13/22

Page 14: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

gIndex(4/6): Discriminative Fragment

• Redundant fragment– Fragments whose indexed graphs are also indexed by its subgraphs– We don’t need to include redundant fragments

• Discriminative fragment– Fragments which are not redundant

A A

B

A A

B B

A A

B BA A B

A B B

A B

B

A

Size=2 Size=3

Df1={g1, g2, g3}

Df2={g2, g3, g4}Df3={g2, g3}=Df1∩Df2

f1

f2

f3

g1

g2

g3

A

A

B B

g4

14/22

Page 15: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

a

gIndex(5/6): gIndex Tree

• Use graph serialization method – For fast graph isomorphism checking during index search– DFS coding [Yan et al. ICDM’02]– Translate a graph into a unique edge sequence

• gIndex Tree– Prefix tree which consists of the edge sequences of discriminative fragments– Record all size-n discriminative fragments in level n– Black nodes discriminative fragments

– Have ID lists: the ids of graphs containing f i

– White nodes redundant fragments; for Apriori pruning

X

X

Z Y

ba

ba

X

X

Z Y

b

ba

v0

v1

v2 v3

DFS Coding

<(v0,v1),(v1,v2),(v2,v0),(v1,v3)>

f1

f2

f3

e1

e2

e3

Level 0

Level 1

Level 2

gIndex Tree15/22

Page 16: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

gIndex(6/6): Searching

• Searching process– Given a query q, enumerate all q’s fragments (size <= maxSize)– Locate the fragments in gIndex tree– Intersect the id lists associated with the fragments

• Apriori pruning– Generating every fragment is inefficient– If a fragment is not in gIndexTree, we need not check its super-graphs

any more– Redundant fragments need to be recorded for Apriori pruning

f1

f2

f3

e1

e2

e3

Level 0

Level 1

Level 2

gIndex Tree

Query<e1, e2, e3, e4, e5>

Fragments<e1><e1, e2><e1, e2, e3><e1, e2, e3, e4> stop<e2>…

16/22

Page 17: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Grafil(1/4) [Yan et al., SIGMOD’05]

• Subgraph similarity search• Feature-based approach• Similarity search using relaxed queries

– Relax a query by deletion of k edges– Missed edges incur missed features

• Main question– What is the maximum missed features() when relaxing a query with k

missed edges?

Feature Vector

G1 {u1, u2, …, un}

G2

Gn

Subgraph exact search

Subgraph similarity search

𝑓𝑜𝑟 1≤ 𝑖≤𝑛 ,𝑢𝑖≥𝑣 𝑖

{v1, v2, …, vn}

Query

17/22

Page 18: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Grafil(2/4): Feature Misses

Query

Relaxed Queries

Features

fa fb fc

fa fb fc

1 2 4

fa fb fc

1 0 3

fa fb fc

0 1 2

fa fb fc

0 1 2

Miss 1 edges =4

=3

=3

FeatureMiss

7-4=3

7-3=4

7-3=4

Maximum Feature Missesmmax=4

18/22

Page 19: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Grafil(3/4): Feature Miss Estimation

• Problem– Given a query Q and a set of features contained in Q, if the relaxation ra-

tio is given, what is the maximal number of features that can be missed?

• Use edge-feature matrix– Find the maximum number of columns that can be hit by k rows– K: the number of missing edges in Q

• Classic maximum coverage problem (set k-cover)– Proved NP-complete

Features

fa fb fc

Query

fa fb1 fb2 fc1 fc2 fc3 fc4

e1 0 1 1 1 0 0 0

e2 1 1 0 0 1 0 1

e3 1 0 1 0 0 1 1

Edge-Feature Matrix

e1

e2 e3

19/22

Page 20: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Grafil(4/4): Feature Conjugation

• Compensate the misses of a feature by occurrences of an-other features in G

• Using all the features together in one filter would deteriorate the filtering performance

• Solution– Use multiple filters– Feature set selection

Query Features

fafa fb

3 4

mmax=4

(3-0)+0=3 ≤ mmax

A

B

A AA A

C

BB B

fb

C

AA A

A A

C

Graph

Relaxation Ratio = 1

20/22

Page 21: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

Graph Indexing Techniques

Target Database Query Type

GraphGrep[Shasha et al., PODS’02]

Collection DB Exact Feature(Path) based index

gIndex[Yan et al., SIGMOD’04]

Collection DB Exact Feature(Graph) based index

Grafil[Yan et al., SIGMOD’05]

Collection DB Exact & Similarity Feature based similarity search

C-tree[He and Singh, ICDE’06]

Collection DB Exact & Similarity Closure based index

QuickSI[Shang et al., VLDB’08]

Collection DB Exact Verification algorithm

Tale[Tian and Patel, ICDE’08]

Collection DB Exact & SimilaritySimilarity search using node in-

dex

GraphQL[He and Singh, SIGMOD’08]

Large graphs Exact Node indexing

Spath[Zhao and Han, VLDB’10]

Large graphs ExactNode indexing using neighbor-

hood information

21/22

Page 22: Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim 2011. 3. 23

References• [Shasha et al., PODS’02] Dennis Shasha, Jaso T. L. Wang, Rosalba Giugno, Algo-

rithmics and Applications of Tree and Graph Searching. PODS, 2002.• [Yan et al., SIGMOD’04] Xifeng Yan, Philip S. Yu, Jiawei Han, Graph Indexing: A

Frequent Structure-based Approach. SIGMOD, 2004.• [Yan et al., SIGMOD’05] Xifeng Yan, Philip S. Yu, Jiawei Han, Substructure Simi-

larity Search in Graph Databases. SIGMOD, 2005. • [Tian and Patel, ICDE’08] Yuanyuan Tian , Jignesh M. Patel. TALE: A Tool for Ap-

proximate Large Graph Matching. ICDE, 2008.• [He and Singh, SIGMOD’08] Huahai He, Ambuj K. Singh. Graphs-at-a-time: query

language and access methods for graph databases. SIGMOD, 2008.• [Zhao and Han, VLDB’10] Peiziang Zhao, Jiawei Han. On Graph Query Optimiza-

tion in Large Networks. VLDB, 2010.• [He and Singh, ICDE’06] Huahai He, Ambuj K. Singh, Closure-Tree: An Index

Structure for Graph Queries. ICDE, 2006• [Shang et al., VLDB’08] Haichuan Shang, Ying Zhang, Xuemin Lin, Jeffrey Xu Yu,

Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph Isomor-phism. VLDB, 2008

22/22