algorithmics and applications of tree and graph searching dennis shasha, jason t. l. wang, and...

38
Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Rosalba Giugno Presenters: Presenters: Jerod Watson & Christan Grant Jerod Watson & Christan Grant

Upload: alvin-patterson

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Algorithmics and Applications of Tree and Graph Searching

Dennis Shasha, Jason T. L. Wang, and Rosalba Dennis Shasha, Jason T. L. Wang, and Rosalba GiugnoGiugno

Presenters:Presenters:Jerod Watson & Christan GrantJerod Watson & Christan Grant

Page 2: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Introduction Searching in Trees

Approximate Containment Queries Path-Only Searches Extension to Trees

Searching in Graphs Keygraph Searching in Graph DBs GraphGrep Subgraph Matching

Conclusion

Page 3: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Introduction

Modern search engines Keyword-based queries Impressive speed

Several research efforts have attempted to generalize keyword search to keytree and keygraph searching

Page 4: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

XQuery

Page 5: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

AQUA Query

Page 6: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Query expressed as a tree pattern, termed “query tree”

DB can be represented as single tree or as set of trees

Each tree could be ordered or unordered

Queries often concerned with the parent-child, ancestor-descendant”, or path relationship among nodes

Queries can be expressed by containment mapping.

Page 7: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Query tree may contain fixed length don’t cares (FLDCs) ex. “?”

Query tree may contain variable length don’t cares (VLDCs) ex. “*”

This class of queries referred to as approximate containment (AC) queries

Page 8: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Path-Only Searches

Many AC queries are concerned with paths only. Ex. “Find the descendants of Mary who

is a child of John”

XISS is an indexing and querying system designed to support regular path expressions

Page 9: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Extension to Trees

Pathfix algorithm Phase 1: Encodes each root-to-leaf path

of every data tree into a suffix array DB Phase 2: Compares the query tree Q

with each data tree D in the DB allowing a difference of DIFF

Page 10: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Handling Don’t Cares Partition query into connected subtrees

having don’t cares Match each of those don’t care free

subtrees with data trees in the DB For the matched subtrees that belong to

the same data tree, determine whether they combine to match the query based on the matching semantics of the don’t cares.

Filtering

Page 11: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Implementation ATreeGrep

Page 12: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant
Page 13: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphsGraphs

Page 14: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Graphs

Abstract data type of elements (nodes or vertices) interconnected by edges.

A graph is a specialized tree in which there is no constraint on the number of paths is possible from a node

No root Graph may contain cycles

Page 15: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Keygraph Searching

Searching for a particular graph or order of elements inside of a large graph (i.e. internet)

Searching for a particular graph or structure among several graphs (i.e. chemical elements)

Use indexing to reduce complexity

Page 16: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Keygraph Searching

Three basic steps1. Reduce the search space by filtering2. Formulate query into simple

structures3. Match

Page 17: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Keygraph Searching (survey) A* algorithm GraphDB Daylight Lore

Page 18: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

A*

Seminal work by Nilson (1980) Route finding algorithm that keeps track of its

visited nodes and the distance it has traveled.  Applications:

        Protein databases (discovery and search)        Image databases        Chinese character databases        CAD circuit data and software source code

Page 19: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

A*

Pseudocode        function A*(start,goal)            var closed := the empty set            var q := make_queue(path(start))            while q is not empty                var p := remove_first(q)                 var x := the last node of p                if x in closed                    continue                if x = goal                    return p                add x to closed                foreach y in successors(x)                     enqueue(q, p, y)            return failure

Page 20: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphDB

Specifies a data model and query model.

1. Queries are in the form of regular expressions       

2. Nodes are classes representing data objects   

3. Edges are classes to store paths in the database      

4. Path classes are and indexing data structures are used to index database

Provides graph and search operations to:       

Shortest path between two nodes       

Subgraphs from a starting node and range

Page 21: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphDB

Page 22: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Daylight

"Provide the best known computer algorithms for chemical information processing to those who need them."

Uses finger printing to index/prune

Page 23: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

ChemDBChemDB(Contains 6.5 million unique structures or subgraphs)

Page 24: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Lore

Database management system for XML Modeled using rooted labeled subgraph Indexed in four ways for fast regular

expression use Vindex, Tindex, Lindex, Pindex(Data Guide)

Page 25: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Lore

1) Vindex: For each edge labeled l, all nodes are index with incomming edges labeled l and some unique atomic value that satisfy some condition.

2) Tindex: A text index for all nodes with l-labeled edges a with a string of specific values containing specific words       

3) Lindex: Link index to index nodes with outgoing l-labled edges      

4) Pindex (DataGuide): indexes all nodes reachable from root through labled path.

The DataGuide is used by all queries from root.  Other queries traverse paths using indexs(1-3), pruning

what is not a match.

Page 26: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Tindex (1999)

A Data structure to index semistructured database nodes that are reachable from several regular path expressions    

T-index may be more efficient than P-index because it relaxes some constraints

Reportedly in graph of size 1500 T-index is 13% of database

Page 27: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphGrep

Uses variable length paths (cyclic or acyclic) to index DB.  This provides for efficient filtering.

Nodes have ids (numbers) and labels (letters).    

Page 28: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphGrep

Index Construction1. Choose an lp max indexing length

2. Create “path-representation”3. Create fingerprint

Page 29: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphGrep

Filtering the Database1. Query graph is parsed and a

fingerprint built2. Fingerprint are compared

1. If a graph has at least one value in its fingerprint that is less than the query fingerprint it is discarded.

2. Remaining graphs may contain > 1 sub graphs

Page 30: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphGrep

Filtering the Database Takes linear time to the size of the

database But discards 99% of database!!!

Page 31: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphGrep

Finding Subgraphs Matching with Queries Query tree depth first traversal

branches are decomposed into sequences of overlapping label-paths (patterns)

Page 32: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphGrep

Overlaps1. Last node in a patters coincides with

first node of next pattern (e.g. ABCB (lp = 3) ABC CB)

2. If a node has branches, it is included in the first pattern of every branch

3. The first node in a cycle is visited twice

Page 33: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphGrep

Matching Example1. Select the set of paths2. Combine lists with constraints3. Remove lists with equal id nodes in

non overlapping positions

Page 34: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphGrep

Techniques for Queries with Wildcards Consider the parts of the query graph that

is between wild cards (like pathfix) The cartesian product of the components

that match are valid. An entry in the cartesian product is a valid

path (length = wildcards) between nodes.

Page 35: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphGrep

1 GHz pentium III NCI databases (1,000 – 16,000

nodes) Average 20 nodes in db (max 270

nodes) Queries 13-189 nodes Lp = 4 and 10

Page 36: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

GraphGrep

Linear in size of DB Different lp influence running time

Page 37: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Conclusions / Questions

Searching in Trees Introduces ATreeGrep

Searching in GraphsIntroduces GraphGrep

Page 38: Algorithmics and Applications of Tree and Graph Searching Dennis Shasha, Jason T. L. Wang, and Rosalba Giugno Presenters: Jerod Watson & Christan Grant

Thanks to:

God Class Wikipedia Various other Googled sources