xrank: ranked keyword search over xml documents

72
XRANK: Ranked Keyword Search over XML Documents Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram

Upload: cicily

Post on 08-Jan-2016

59 views

Category:

Documents


0 download

DESCRIPTION

XRANK: Ranked Keyword Search over XML Documents. Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram. Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay. Outline. Motivation Problem Definition, Query Semantics Ranking Function - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: XRANK: Ranked Keyword Search over XML Documents

XRANK: Ranked Keyword Search over XML Documents

Presentation by:Meghana Kshirsagar Nitin Gupta

Indian Institute of Technology, Bombay

Lin Guo Feng ShaoChavdar Botev Jayavel Shanmugasundaram

Page 2: XRANK: Ranked Keyword Search over XML Documents

Outline

● Motivation● Problem Definition, Query Semantics● Ranking Function● A New Datastructure – Dewey Inverted List (DIL)

● Algorithms● Performance Evaluation

Page 3: XRANK: Ranked Keyword Search over XML Documents

Motivation

Page 4: XRANK: Ranked Keyword Search over XML Documents

Motivation - I

● Why do we need search over XML data?

● Why not use search techniques used on WWW (keyword search on HTML)?

Page 5: XRANK: Ranked Keyword Search over XML Documents

Motivation - IIKeyword Search: XML Vs HTML

HTML

structural● Links: document-to-document● Tags: Format specifiers

ranking● Result: Document● Page-level ranking● Proximity:

● width: distance between words

XML

structural● Links: IDREFs and Xlinks● Tags: Content specifiers

ranking● Result: XML element (a tree)● Element-level ranking● Proximity:

● width● height

Page 6: XRANK: Ranked Keyword Search over XML Documents

Problem Definition,Query Semantics,

and Ranking

Page 7: XRANK: Ranked Keyword Search over XML Documents

Problem Definition

● Input: Set of keywords

● Output: Ranked XML elements

What is a result? How to rank results ?

Page 8: XRANK: Ranked Keyword Search over XML Documents

Bird's eye view of the system

Query Evaluator

Data Structures(DIL)

XML doc repository

Preprocessing(ElemRank computation)

Query Keywords Results

Page 9: XRANK: Ranked Keyword Search over XML Documents

What is a result?

● A minimal Steiner tree of XML elements

● Result-set is a set of XML elements that

● includes a subset of elements containing all query-keywords at least once, after excluding the occurrences of keywords in contained results (if any).

Page 10: XRANK: Ranked Keyword Search over XML Documents

result 1

result 2

Page 11: XRANK: Ranked Keyword Search over XML Documents

Result: Graphical representation

containment edge

descendant

ancestor

Page 12: XRANK: Ranked Keyword Search over XML Documents

Ranking: Which results to return first?

Properties:

The Ranking function should● reflect Result Specificity● consider Keyword-Proximity● be Hyperlink Aware

Ranking function:f (height, width, link-structure)

Page 13: XRANK: Ranked Keyword Search over XML Documents

Less specific result

More specific result

Page 14: XRANK: Ranked Keyword Search over XML Documents

Ranking Function

r (v1, k

i) = ElemRank ( v

t ) . decayt-1

v1

vt

ki

For a single XML element (node):

Page 15: XRANK: Ranked Keyword Search over XML Documents

Ranking Function

Combining ranks in case of multiple occurrences:

Overall Rank:

Page 16: XRANK: Ranked Keyword Search over XML Documents

Semantics of the ranking function

r (v1, k

i) = ElemRank ( v

t ) . decayt-1

Specificity (height)

Proximity

Link structure

Page 17: XRANK: Ranked Keyword Search over XML Documents

ElemRank Computation – adopt PageRank??

● PageRank

● Short-comings:

Fails to capture:✗ bidirectional transfer of “ElemRanks”✗ discrimination between edge-types (containment and

hyperlink)✗ doesn't aggregate “ElemRanks” for reverse containment

relationships

Page 18: XRANK: Ranked Keyword Search over XML Documents

ElemRank Computation - I

● Ne = total # of XML elements

● Nh(u) = # hyperlinks from 'u'

● Nc(u) = # children of 'u'

● E = HE U CE U CE'● CE' = reverse containment edges

● Consider Both forward and reverse ElemRank propagation.

Page 19: XRANK: Ranked Keyword Search over XML Documents

ElemRank Computation - II

● Seperate containment and hyperlink edges

● CE = containment edges● HE = hyperlink edges● ElemRank (sub elements) α 1 / ( # sibling sub-elements )

Page 20: XRANK: Ranked Keyword Search over XML Documents

ElemRank Computation - III

● Sum over the reverse-containment edges, instead of distributing the weight

● Nd(u) = total # XML documents

● Nde

(v) = # elements in the XML doc containing v● ElemRank (parent) α Sum (ElemRank(sub-elements))

Page 21: XRANK: Ranked Keyword Search over XML Documents

Datastructures and Algorithms

Page 22: XRANK: Ranked Keyword Search over XML Documents

Naïve Algorithm

Approach:● XML element ~ doc● Use “keyword search on WWW”

Limitations:● Space overhead (in inverted indices)● Failure to model Hierarchical relationships

(ancestor~decendent).● Inaccurate Ranking

Need a new datastructure which can model hierarchical relationships !!

Answer: Dewey Inverted Lists

Page 23: XRANK: Ranked Keyword Search over XML Documents

Labeling nodes using Dewey Ids

Page 24: XRANK: Ranked Keyword Search over XML Documents

Dewey Inverted Lists

● One entry per keyword ● Entry for keyword 'k' has Dewey-IDs of elements directly containing 'k'

Simple equi merge-join of Dewey-ID-lists won't work !Need to compute prefixes.

Page 25: XRANK: Ranked Keyword Search over XML Documents

System Architecture

Page 26: XRANK: Ranked Keyword Search over XML Documents

DIL : Query Processing

● Simple equality merge-join will not work

● Need to find LCP (longest common prefix) over all elements with query keyword-match.

● Single pass over the inverted lists suffices!

● Compute LCP while merging the ILs of individual keywords.

● ILs are sorted on Dewey-IDs

Page 27: XRANK: Ranked Keyword Search over XML Documents

Datastructures

● Array of all inverted lists : invertedList[]

● invertedList[i] for keyword 'i'● each invertedList[i] is sorted on Dewey-ID

● Heap to maintain top-m results : resultHeap

● Stack to store current Dewey-ID, ranks, position List, longest common prefixes : deweyStack

Page 28: XRANK: Ranked Keyword Search over XML Documents

Algorithm on DILs - AbstractWhile all inverted-lists are not processed ● Read the next entry from DIL having smallest Dewey-ID

● call this 'currentEntry'

● Find the longest common prefix (lcp) between stack components and entry read from DIL

● lcp (deweyStack , currentEntry)

● Pop non-matching entries from Dewey-stack; Add result to heap if appropriate

● check if current top-of-stack contains all keywords● if yes, compute OverallRank, put this result onto heap● else● non-matching entries are popped one component at a time and

update (rank, posList) on each pop

● Push non-matching part of 'currentEntry' to 'deweyStack'● non-matching components of 'currentEntry.deweyID' are pushed onto

stack● Update components of top entry of deweyStack

Page 29: XRANK: Ranked Keyword Search over XML Documents

ExampleQuery: “XQL Ricardo”

Page 30: XRANK: Ranked Keyword Search over XML Documents

Algorithm Trace – Step 1

Rank[i] = Rank due to keyword 'i'PosList[i] = List of occurrences of keyword 'i'

DIL: invertedList[] DeweyStack

push allcomponentsand findrank, posL

Smallest ID: 5.0.3.0.0

Page 31: XRANK: Ranked Keyword Search over XML Documents

Algorithm Trace – Step 2

DIL: invertedList[] DeweyStack

find lcpand popnonmatchingcomponents

Smallest ID: 5.0.3.0.1

Page 32: XRANK: Ranked Keyword Search over XML Documents

Algorithm Trace – Step 3

DIL: invertedList[] DeweyStack

updatedrank, posL

Smallest ID: 5.0.3.0.1

Page 33: XRANK: Ranked Keyword Search over XML Documents

Algorithm Trace – Step 4

DIL: invertedList[] DeweyStack

push non-matchingcomponents

Smallest ID: 5.0.3.0.1

Page 34: XRANK: Ranked Keyword Search over XML Documents

Algorithm Trace – Step 5

DIL: invertedList[] DeweyStack

find lcp,update,finally popall components

Smallest ID: 6.0.3.8.3

Page 35: XRANK: Ranked Keyword Search over XML Documents

Problems with DIL

● Scans the entire inverted-list for all keywords before a result is output

● Very inefficient for top-k computation

Page 36: XRANK: Ranked Keyword Search over XML Documents

Other Techniques - RDIL● Ranked Dewey Inverted List:

● For efficient top-k result computation● IL is ordered by ElemRank● Each IL has a B+ tree index on the Dewey-IDs

● Algorithm with RDIL uses a threshold

Page 37: XRANK: Ranked Keyword Search over XML Documents

Algorithm using RDIL (Abstract)● Choose the next entry from one of the invertedList[] in a Round-

Robin fashion. ● say chosen IL = invertedList[i]● d = top-ranked Dewey-ID from invertedList[i]

● Find the longest common prefix that contains all query-keywords● Probe the B+ tree index of all other keyword ILs, for the longest

common prefix● Claim:

● d2 = smallest Dewey-ID in invertedList[j] of query-keyword 'j'● d3 = immediate predecessor of d2● lcp = max_prefix (lcp ( d, d2) , lcp ( d, d3))

● Check if 'lcp' is a complete result

● Recompute 'threshold' = sum (ElemRank of last processed element in each query keyword IL)

● If (rank of top-k results on heap) >= threshold) return;

Page 38: XRANK: Ranked Keyword Search over XML Documents

Performance of RDIL

● Works well for queries with highly correlated keywords

● BUT ! becomes equivalent (actually worse) to DIL for totally uncorrelated keywords

● Need an intermediate technique

Page 39: XRANK: Ranked Keyword Search over XML Documents

HDIL● Uses both DIL and RDIL● Adaptive strategy:

– Start with RDIL

– Switch to DIL if performance is bad

● Performance?– Estimated remaining time for RDIL = (m – r ) * t / r

● t = time spent so far● r = no. of results above threshold so far● m = desired no. of results

– Estimated remaining time for DIL ?● No. of query-keywords is known● Size of each IL is known

Page 40: XRANK: Ranked Keyword Search over XML Documents

HDIL

● Datastructures?– Store full IL sorted on Dewey-ID

– Store small fraction of IL sorted on ElemRank

– Share the leaf level between IL and B+ tree (in RDIL)

– Overhead : top levels of B+ tree

Page 41: XRANK: Ranked Keyword Search over XML Documents

Updating the lists

● Updation is easy

● Insertion – very bad! – techniques from Tatarinov et al.

– we've seen a better technique in this course :) – OrdPath

Page 42: XRANK: Ranked Keyword Search over XML Documents

Evaluation

● Criteria:● no. of query-keywords● correlation between query-keywords● desired no. of query results● selectivity of keywords

● Setup:● Datasets used: DBLP, Xmark● d1 = 0.35, d2 = 0.25, d3 = 0.25● 2.8GHz Pentium IV + 1GB RAM + 80GB HDD

Page 43: XRANK: Ranked Keyword Search over XML Documents

Performance - 1

Page 44: XRANK: Ranked Keyword Search over XML Documents

Performance - 2

Page 45: XRANK: Ranked Keyword Search over XML Documents

Critique

● New datastructure (DIL) defined to represent hierarchical relationships accurately and efficiently.

● Hyperlinks and IDREFs are considered only while computing ElemRank. Not used while returning results.

● Only containment edges (ancestor-descendant) are considered while computing result trees.

● Works only on trees, can't handle graphs.

Page 46: XRANK: Ranked Keyword Search over XML Documents
Page 47: XRANK: Ranked Keyword Search over XML Documents
Page 48: XRANK: Ranked Keyword Search over XML Documents

The SphereSearch Engine for Unified Banked Retrieval of Heterogenous XML and Web

Documents

Jens Graupmann Ralf Schenkel Gerhard WeikumMax-Plack-Institut fur Informatik

Presentation by:Nitin Gupta Meghana Kshirsagar

Indian Institute of Technology Bombay

Page 49: XRANK: Ranked Keyword Search over XML Documents

Why another search engine ?● To cope with diversity in the structures and

annotations of the data

● Ranked retrieval paradigm for producing relevance ordered results lists rather than a mere boolean retrieval.

● Short comings of the current search engines– Concept aware

– Context aware (or link-awareness)

– Abstraction aware

– Query Language

Page 50: XRANK: Ranked Keyword Search over XML Documents

Concept awareness

● Example: researcher max planck yields many results about researchers who work at the institute “Max Plack” Society

● Better formulation would be researcher person=“max planck”

● Objective attained by– Transformation to XML

– Data Annotation

Page 51: XRANK: Ranked Keyword Search over XML Documents

Concept awareness :: Transformation

<H1>Experiments</H1>

... Text1 ...

<H2>Settings</H2>

... Text2 ...

<H1> ...

<Experiments>

... Text1 ...

<Settings>

... Text2 ...

</Settings>

</Experiments>

...

Page 52: XRANK: Ranked Keyword Search over XML Documents

Abstraction Awareness

● Example: Synonyms, Ontologies

● Is connection to various encyclopedias/ Wiki's possible?

● Objective attained by using– Ontology Service: provides quantified ontological

information to the system

– Preprocessed information based on focused web crawls to estimate statistical correlations between the characteristic words of related concepts

Page 53: XRANK: Ranked Keyword Search over XML Documents

Context Awareness

● Query may not be answered by web search engines as no single web page may be a match

● Unlike usual navigation axes in XML, context should go beyond trees.

● Consider graph structure spanned by Xlink/XPointer references and href hyperlinks

● Objective attained by– introduction of the concept of a SPHERE

Page 54: XRANK: Ranked Keyword Search over XML Documents

Context Awareness :: Sphere● What is a sphere?

– Relevance of an element for a group of query conditions is not just determined by its own content, but also by the content of other neighboring elements, including linked documents, in an environment - called Sphere - of the element.

Page 55: XRANK: Ranked Keyword Search over XML Documents

Query Language

● Query S = (Q, J) consists of

– set Q = { G1 .. G

q } of query groups

– set J = { J1 .. J

m } of join conditions

● Each Qi consists of

– set of keyword conditions t1 .. t

k

– set of concept value conditions c1 = v

1 ... c

l = v

l

● Each join has the form Qi.v = (or ~) Q

j.w

Page 56: XRANK: Ranked Keyword Search over XML Documents

Query Language

● Example:– P(professor, location=~Germany)

– C(course, ~databases)

– R(~project, ~XML)

– A(gothic, church)

– B(romanic, church)

– A.location = B.location

German professors who teach database courses and have projects on XML

Gothic and Romanic churches at the same location

Page 57: XRANK: Ranked Keyword Search over XML Documents

Data Model

● Collection X = (D, L) of XML documents D together with a set L of (href, Xpointer, or Xlink) links between their elements

● Consider all attributes as elements: then element level graph G

E(X) = (V

E(X), E

E(X)) has the union of all the

elements of the document as nodes and undirected edges between them

● Each edge has nonnegative weight

– 1 for parent-child; ‘λ’ for links

● A distance function δX(x,y) : computes weight of a

shortest path in GE(X) between x and y

Page 58: XRANK: Ranked Keyword Search over XML Documents

Spheres and Query Groups● Node-score ns(n,t) is computed using Okapi BM25 model

● Similarity condition ~K: Compute exp(K) for the keyword. The node score is defined as max

xЄexp(K) sim(K,x) * ns(n,x)

where sim(K,x) is the ontological similarity● Concept value:

– ns(n, c=v) = 0 if name(n) ≠ c– ns(n,v) otherwise

● Similarity concept value: ~c = v: sim(name(n), c) * ns(n,v)

● This is insufficient

– in the presence of linked documents– when content is spread over several elements

{

Page 59: XRANK: Ranked Keyword Search over XML Documents

Spheres and Query Groups

Sphere Sd(n): set of nodes at distance d from node n

sd(n,t) = ∑

v Є Sd(n) ns(v,t)

s(n,t) = ∑ si(n,t) * αi

s(1,t) = 1 + 4*0.5 + 2*0.25 + 5*0.125 = 4.175 s(2,t) =

3 + 0*0.5 + 0*0.25 + 1*0.125 = 3.125

s(n, G) = ∑ j s(n,t

j) + ∑

j s(n, c

j=v

j)

Page 60: XRANK: Ranked Keyword Search over XML Documents

Spheres and Query Groups :: Ranking

Create a connection graph G(N) = (V(N), E(N))

Weight of an edge between x,y:

0 if x and y are not connected

1/ δx(x,y)+1 otherwise

Compactness C(N) of a potential answer N is then the sum of the total edge weights of a maximal spanning tree for G(N), and the score is given by:

s(N, S) = β C(N) + (1- β) ∑i s(n

i, G

i)

Page 61: XRANK: Ranked Keyword Search over XML Documents

Spheres and Query Groups :: Joins

New virtual links to form an extended collection X' = (D, L')

– Connect the elements that match the join

– Similarity join: For Qi.v ~ Qj.w, consider sets N(v) (resp N(w)) with name v (w) or contain v (w) in their content. For each pair x N(v), y N(w) add a link {x,y} with weight 1/csim(x,y)

Page 62: XRANK: Ranked Keyword Search over XML Documents

System Architecture

Content stored in inverted lists with corresponding tf*idf-style term statistics

Indexer stores with each element the corresponding Dewer encoding of its position within the document

Focused web crawls used to estimate statistical correlations between the characteristic words of related concepts. Current version uses Dice coefficient.

Page 63: XRANK: Ranked Keyword Search over XML Documents

Query Processor

1. First compute a result list for each query group

2. Add virtual links for join conditions

3. Compute the compactness of a subset of all potential answers of the query in order to return the top-k results

1. Compute a list of results for each of query keywords and concept-value conditions.

2. Candidate nodes: Nodes that are at distance at most D from any node that occurs in at least one of the lists. Sphere score is computed only for these nodes since only these can have a non-zero score!

3. For eachl candidate node N, look up the node scores of nodes in the sphere of N, and adding these scores with a proper damping factor.

Page 64: XRANK: Ranked Keyword Search over XML Documents

Query Processor● Virtual links: Processor considers only a limited set

of possible end points for efficient computation

● Nodes in the spheres upto distance D around nodes with nonzero sphere score for any query group– Why? Any other node will have distance atleast D+1 to

any results node and thus contributes at most 1/ (D+1)+1 to the compactness, which is negligible

– This set of candidate nodes can be computed on the fly

● Set further reduced by testing join attributes, for example A.x = B.y results in two sets of potential end points.

Page 65: XRANK: Ranked Keyword Search over XML Documents

Query Processor

● Generating answers– Naïve method: generate all possible potential

answers from the answers to query groups, compute connection graphs and compactness, and finally their score

– For top-k answers, use Fagin's Threshold Algorithm with sorted lists only

● Input: Sorted list of node scores and pairwise node scores (edges)

● Output: k potential answers with the best scores

Page 66: XRANK: Ranked Keyword Search over XML Documents

Experiments

● Sun V40z, 16GB RAM, Windows 2003 Server, Tomcat 4.0.6 environment, Oracle 10g database

● Benchmarks: XMach, Xmark, INEX, TREC

Designed for XQuery-style exact matchSemantically poor tags

Does not consider XML at all

Wikipedia Collection from the Wikipedia project: HTML Collection transformed into XML and annotated

Wikipedia++ Collection: Extension of Wikipedia with IMDB data, with generated XML files for each movie and actor

DBLP++ Collection: Based on the DBLP project which indexes more than 480,000 publications

INEX: Set of 12,107 XML documents, a set of queries with and without structural constraints

Page 67: XRANK: Ranked Keyword Search over XML Documents

Experiments

Conversion from HTML to XML

Dataset Statistics

Page 68: XRANK: Ranked Keyword Search over XML Documents

Experiments

● SSE-basic: basic version limited to keyword conditions using sphere-based scoring

● SSE-CV: basic version plus concept-value conditions

● SSE-QC: CV version plus query groups (full contest awareness)

● SSE-Join: full version will all features

● SSE-KW: very restricted version with simple keyword search

● GoogleWiki: Google search restricted to Wikipedia.org

● Google~Wiki: Google on wikipedia.org with Google's ~ operator for query expansion

● GoogleWeb: Google search on the entire web

● Google~Web: Google search on the entire web with query expansion

Page 69: XRANK: Ranked Keyword Search over XML Documents

Experiments

Aggregated results for Wikipedia

Page 70: XRANK: Ranked Keyword Search over XML Documents

Experiments

Aggregated results for Wikipedia++ and DBLP++

Page 71: XRANK: Ranked Keyword Search over XML Documents

Experiments

Graph showing the average runtimes for different versions

Page 72: XRANK: Ranked Keyword Search over XML Documents

Thank you