graph indexing techniques seoul national university idb lab. kisung kim 2011. 3. 23

Graph Indexing Techniques

Seoul National UniversityIDB Lab.

Kisung Kim2011. 3. 23

Outline

• Category of graph queries• Querying in collection DB• References

2/22

Category of Graph Queries: Matching Type

• Exact subgraph matching– Find graphs in DB which have all components of the query graph

• Similarity subgraph matching– Find graphs in DB which have some components of the query graph– Similarity measure is needed

• Super graph matching– Find graphs in DB which are contained in the query graph

Query graph Exact subgraph SimilaritySubgraph

Query graph

3/22

Category of Graph Queries: Target DB

• Collection DB: large number of small graphs– e.g. Chemical compounds– Retrieval component

– IDs of graphs which contain matching parts

• Large graphs: small number of large graphs– e.g. Social network, RDF graph– Retrieval component

– All matching subgraphs

G1

G2

G3

G4

G7

G6

G5

Query graph

G1, G3, G5

Results: graph ID list

Querying Collection DB

Query graph

Results: matching subgraphs

Querying Large Graphs

4/22

Query Processing in Collection DB

• Processing flow

• Verification uses usual pair-wise subgraph isomorphism algo-rithm

• Most of techniques focus on filtering techniques– The cost of verification is high– To reduce the number of verification execution

Query Filtering Candidategraph set Verification Answer

Graphs

5/22

Query Processing in Large Graphs

• Processing flow

• Focus on node indexing– To reduce search space– Use structural information of nodes

• Build subgraph by joining candidate nodes– Join methods are not relatively researched– Optimization using join ordering

QueryIndexsearch

Candidatenode sets

Building subgraphs

Answersubgraphs

6/22


Target Database Query Type

GraphGrep[Shasha et al., PODS’02]

Collection DB Exact Feature(Path) based index

gIndex[Yan et al., SIGMOD’04]

Collection DB Exact Feature(Graph) based index

Grafil[Yan et al., SIGMOD’05]

Collection DB Exact & Similarity Feature based similarity search

C-tree[He and Singh, ICDE’06]

Collection DB Exact & Similarity Closure based index

QuickSI[Shang et al., VLDB’08]

Collection DB Exact Verification algorithm

Tale[Tian and Patel, ICDE’08]

Collection DB Exact & SimilaritySimilarity search using node in-

dex

GraphQL[He and Singh, SIGMOD’08]

Large graphs Exact Node indexing

Spath[Zhao and Han, VLDB’10]

Large graphs ExactNode indexing using neighbor-

hood information

7/22

Outline

• Category of graph queries• Querying in collection DB• References

8/22

GraphGrep(1/2) [Shasha et al. PODS’02]

• First work adopts the filtering-and-verification framework• Path-based index

– Fingerprint of database– Enumerate the set of all paths(length <= L) of all graphs in DB– For each path, the number of occurrences in each graphs are stored in

hash table

B

A

C

B

B

A

C

B

D

E

C

A B

B

C

Key g1 g2 g3

h(CA) 1 0 1

…

h(ABCB) 2 2 0

g1 g2g3 Index

9/22

GraphGrep(2/2): Query Processing

• Filtering– Make the fingerprint of query q

– Hash all paths (length <= L) of q– Compare the fingerprint of the query with the fingerprint of database

– Discard a graph whose value in fingerprint is less than the value in query fin-gerprint

• Verification– Check subgraph isomorphism tests

Key g1 g2 g3

h(AB) 2 2 1

h(AC) 1 0 1

h(BAC) 2 0 1

B

A

C

B

B

A

C

B

D

E

C

A B

B

C

g1 g2g3

Index

B

A C

AB:1AC:1BAC:1

Query

Candidates= {g1, g3}

Verification

10/22

gIndex(1/6) [Yan et al., SIGMOD’04]

• Path-based approach has week points– Path is too simple: structural information is lost– There are too many paths: the set of paths in a graph database usually

is huge

• Solution– Use graph structure instead of path as the basic index feature

c c c c

c cc c

c c

c c

c c

c c

c c

c c

Sample Database

c

c c

c

c

c

Query

c c c

c c c

Paths in Query Graph

Cannot Filter Any GraphsIn Database

11/22

gIndex(2/6): Frequent Fragment

• The number of graph structure is largeIndex only frequent subgraphs

• support(g)– The number of graphs in D (graph database), where g is a subgraph

• minSup– Minimum support threshold– Index a fragment, g only if support(g) ≥ minSup

• Size-increasing support– Frequent fragments are increasing as the size of a fragment increases– Low minSup for small fragments, high minSup for large fragment

12/22

gIndex(3/6): Frequent Fragment

A A

B

A A

B B

A A

B B

A

A

B B

A A

A B

A A B

A B B

B A B

A B A

A B

B

A

A A

B

A

B B

B A

B

A

B A

B

A

B B

A

A A

B B

A

A

A

B B

Size=1 Size=2 Size=3 Size=4

F=3

F=4B B

F=3

F=3

F=3

F=2

F=2

F=2

F=1

F=1

F=1

F=1

F=2

F=1

F=1

minSup=1 minSup=1 minSup=2 minSup=2 13/22

gIndex(4/6): Discriminative Fragment

• Redundant fragment– Fragments whose indexed graphs are also indexed by its subgraphs– We don’t need to include redundant fragments

• Discriminative fragment– Fragments which are not redundant

A A

B

A A

B B

A A

B BA A B

A B B

A B

B

A

Size=2 Size=3

Df1={g1, g2, g3}

Df2={g2, g3, g4}Df3={g2, g3}=Df1∩Df2

f1

f2

f3

g1

g2

g3

A

A

B B

g4

14/22

a

gIndex(5/6): gIndex Tree

• Use graph serialization method – For fast graph isomorphism checking during index search– DFS coding [Yan et al. ICDM’02]– Translate a graph into a unique edge sequence

• gIndex Tree– Prefix tree which consists of the edge sequences of discriminative fragments– Record all size-n discriminative fragments in level n– Black nodes discriminative fragments

– Have ID lists: the ids of graphs containing f i

– White nodes redundant fragments; for Apriori pruning

X

X

Z Y

ba

ba

X

X

Z Y

b

ba

v0

v1

v2 v3

DFS Coding

<(v0,v1),(v1,v2),(v2,v0),(v1,v3)>

f1

f2

f3

e1

e2

e3

Level 0

Level 1

Level 2

…

gIndex Tree15/22

gIndex(6/6): Searching

• Searching process– Given a query q, enumerate all q’s fragments (size <= maxSize)– Locate the fragments in gIndex tree– Intersect the id lists associated with the fragments

• Apriori pruning– Generating every fragment is inefficient– If a fragment is not in gIndexTree, we need not check its super-graphs

any more– Redundant fragments need to be recorded for Apriori pruning

f1

f2

f3

e1

e2

e3

Level 0

Level 1

Level 2

…

gIndex Tree

Query<e1, e2, e3, e4, e5>

Fragments<e1><e1, e2><e1, e2, e3><e1, e2, e3, e4> stop<e2>…

16/22

Grafil(1/4) [Yan et al., SIGMOD’05]

• Subgraph similarity search• Feature-based approach• Similarity search using relaxed queries

– Relax a query by deletion of k edges– Missed edges incur missed features

• Main question– What is the maximum missed features() when relaxing a query with k

missed edges?

Feature Vector

G1 {u1, u2, …, un}

G2

…

Gn

Subgraph exact search

Subgraph similarity search

𝑓𝑜𝑟 1≤ 𝑖≤𝑛 ,𝑢𝑖≥𝑣 𝑖

{v1, v2, …, vn}

Query

17/22

Grafil(2/4): Feature Misses

Query

Relaxed Queries

Features

fa fb fc

fa fb fc

1 2 4

fa fb fc

1 0 3

fa fb fc

0 1 2

fa fb fc

0 1 2

Miss 1 edges =4

=3

=3

FeatureMiss

7-4=3

7-3=4

7-3=4

Maximum Feature Missesmmax=4

18/22

Grafil(3/4): Feature Miss Estimation

• Problem– Given a query Q and a set of features contained in Q, if the relaxation ra-

tio is given, what is the maximal number of features that can be missed?

• Use edge-feature matrix– Find the maximum number of columns that can be hit by k rows– K: the number of missing edges in Q

• Classic maximum coverage problem (set k-cover)– Proved NP-complete

Features

fa fb fc

Query

fa fb1 fb2 fc1 fc2 fc3 fc4

e1 0 1 1 1 0 0 0

e2 1 1 0 0 1 0 1

e3 1 0 1 0 0 1 1

Edge-Feature Matrix

e1

e2 e3

19/22

Grafil(4/4): Feature Conjugation

• Compensate the misses of a feature by occurrences of an-other features in G

• Using all the features together in one filter would deteriorate the filtering performance

• Solution– Use multiple filters– Feature set selection

Query Features

fafa fb

3 4

mmax=4

(3-0)+0=3 ≤ mmax

A

B

A AA A

C

BB B

fb

C

AA A

A A

C

Graph

Relaxation Ratio = 1

20/22


Target Database Query Type

GraphGrep[Shasha et al., PODS’02]

Collection DB Exact Feature(Path) based index

gIndex[Yan et al., SIGMOD’04]

Collection DB Exact Feature(Graph) based index

Grafil[Yan et al., SIGMOD’05]

Collection DB Exact & Similarity Feature based similarity search

C-tree[He and Singh, ICDE’06]

Collection DB Exact & Similarity Closure based index

QuickSI[Shang et al., VLDB’08]

Collection DB Exact Verification algorithm

Tale[Tian and Patel, ICDE’08]

Collection DB Exact & SimilaritySimilarity search using node in-

dex

GraphQL[He and Singh, SIGMOD’08]

Large graphs Exact Node indexing

Spath[Zhao and Han, VLDB’10]

Large graphs ExactNode indexing using neighbor-

hood information

21/22

References• [Shasha et al., PODS’02] Dennis Shasha, Jaso T. L. Wang, Rosalba Giugno, Algo-

rithmics and Applications of Tree and Graph Searching. PODS, 2002.• [Yan et al., SIGMOD’04] Xifeng Yan, Philip S. Yu, Jiawei Han, Graph Indexing: A

Frequent Structure-based Approach. SIGMOD, 2004.• [Yan et al., SIGMOD’05] Xifeng Yan, Philip S. Yu, Jiawei Han, Substructure Simi-

larity Search in Graph Databases. SIGMOD, 2005. • [Tian and Patel, ICDE’08] Yuanyuan Tian , Jignesh M. Patel. TALE: A Tool for Ap-

proximate Large Graph Matching. ICDE, 2008.• [He and Singh, SIGMOD’08] Huahai He, Ambuj K. Singh. Graphs-at-a-time: query

language and access methods for graph databases. SIGMOD, 2008.• [Zhao and Han, VLDB’10] Peiziang Zhao, Jiawei Han. On Graph Query Optimiza-

tion in Large Networks. VLDB, 2010.• [He and Singh, ICDE’06] Huahai He, Ambuj K. Singh, Closure-Tree: An Index

Structure for Graph Queries. ICDE, 2006• [Shang et al., VLDB’08] Haichuan Shang, Ying Zhang, Xuemin Lin, Jeffrey Xu Yu,

Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph Isomor-phism. VLDB, 2008

22/22

graph indexing techniques seoul national university idb lab. kisung kim 2011. 3. 23

Documents

super graph

query graph similarity

target db collection

query processing

collection db references

matching parts large

verification answer

large number of small