gökay burak akkuŞ ece aksu xrank xrank: ranked keyword search over xml documents ece aksu gökay...

Gökay Burak AKKUŞ

Ece AKSU

XRANK

XRANK: Ranked Keyword Search overXML Documents

Ece AKSUGökay Burak AKKUŞ

Gökay Burak AKKUŞ

Ece AKSU

This Paper...

Describes the architecture, implementation and evaluation of the XRANK system

The contributions of the paper are: (a) the problem definition and system

architecture (b) an algorithm for computing the

ranking of XML elements (c) new inverted list index structures and

associated query processing algorithms (d) an experimental evaluation of XRANK

Gökay Burak AKKUŞ

Ece AKSU

Overview

Problem: Efficiently producing ranked results for keyword search queries over hierarchical XML documents.

New challanges1. Returns deeply nested XML elements.2. Ranking is at the granularity of an XML

element (not the document)3. Keyword proximity is more complex.

Gökay Burak AKKUŞ

Ece AKSU

Overview - 2

This paper pesents XRANK system to handle these features of XML keyword search.

XRANK offers both space & performance benefits

XRANK generalizes a hyperlink based HTML search engine such as Google.

XRANK can be used to query both HTML and XML documents.

Gökay Burak AKKUŞ

Ece AKSU

Keyword Search Querying - 1

Keyword search queryingAdv: simple users do not have to learn a complex query

language can issue queries without any prior

knowledge about the structure of the underlying data.

Consequence: Interface is fexible Queries may not always be precise and can

return large number of query results.

Gökay Burak AKKUŞ

Ece AKSU


An important requirement for keyword search is to rank the query results so that the most relevant results appear first.

Certain limitations of the HTML data model make such systems ineffective in many domains. HTML is a presentation language HTML cannot capture much semantics

Gökay Burak AKKUŞ

Ece AKSU


The XML data model addresses this limitation by allowing for extensible element tags. (Example: Figure.1)

Gökay Burak AKKUŞ

Ece AKSU

Gökay Burak AKKUŞ

Ece AKSU

Querying XML Documents

One approach is the sophisticated query language XQUERY Effective in some cases Users have to learn a complex query language and

understand the schema of underlying XML An alternative approach is XRANK

Retain the simple keyword search query interface Exploit XML’s tagged and nested structure during query

processing.

Gökay Burak AKKUŞ

Ece AKSU

New Challanges

Keyword searching over XML introduces many new challenges.1. The result of the keyword search query can be a deeply nested XML element.

return the ‘deepest’ node2. Ranking is not solely based on hyperlinks.

semantics of containment links (relating parent and child elements) is very different from that of hyperlinks (such as IDREFs and XLinks)

Gökay Burak AKKUŞ

Ece AKSU

New Challanges

3. The notion of proximity among keywords is more complex

In HTML, proximity among keywords translates directly to the distance between keywords in a document.

For XML there is a 2-dimensional proximity metric.

Keyword distance Ancestor distance

Gökay Burak AKKUŞ

Ece AKSU

XML Data Model

XML is a hierarchical format for data representation and exchange.

An XML document consists of: Root element, nested sub-elements,

attributes and values, supports intra-document and inter-

document references.

Gökay Burak AKKUŞ

Ece AKSU

XML Data Model-2

Intra-document referencees are represented using IDREFs.

Inter-document references are represented using XLink.

Both IDREFs and XLinks are reffered as hyperlinks!

Gökay Burak AKKUŞ

Ece AKSU

Definitions

A collection of hyperlinked XML documents can be defined as a directed graph:G = (N, CE, HE)N : The set of nodes N = NE U NVNE : The set of elementsNV : The set of valuesCE : The set of containment edges relating nodesHE : The set of hyperlink edges relating nodes

Gökay Burak AKKUŞ

Ece AKSU

Definitions - 2

The edge (u, v) CE iff v is a value/nested sub-element of u.

The edge (u, v) HE iff u contains a hyperlink reference to v.

An element u is a sub-element of an element v if (v,u) CE.

An element u is the parent of node v if (u,v) CE.

The predicate contains*(v, k) is true if the node v directly or indirectly contains the keyword k.

Gökay Burak AKKUŞ

Ece AKSU

Keyword Query Results

There are two possible semantics for keyword search queries:

conjunctive keyword query semantics contain all of the query

keywords are returned. disjunctive keyword query semantics

contain at least one of the query keywords are returned

This paper focuses on conjunctive keyword query semantics.

Gökay Burak AKKUŞ

Ece AKSU

Keyword Query Results - 2

Q={k1,…, kn}. R0 = {v v NE k Q(contains*(v,k))}

the set of elements that directly or indirectly contain all of the query keywords.

Result(Q)={v k Q c N ((v,c) CE c R0 contains*(c,k))}

ensures that only the most specific results are returned.

ensures that an element that has multiple independent occurrences of the query keywords is returned,

CE are considered for result set, HE are considered for ranking

Gökay Burak AKKUŞ

Ece AKSU

Keyword Query Results - 3

XML elements provides more context information

Also poses interesting user-interface challenges. One solution is to allow the user to navigate up to

the ancestors of the query result Another solution, is to predefine a set of “answer

nodes” AN. XRANK supports both

may require knowledge of the domain and underlying XML schema

Gökay Burak AKKUŞ

Ece AKSU

Ranking Keyword Query Results

Desired Properties of Ranking Function:1) Result specificity: more specific results higher than less specific results. one dimension of result proximity.2) Keyword proximity: another dimension of result proximity.3) Hyperlink Awareness: hyperlinked structure of XML documents.

Gökay Burak AKKUŞ

Ece AKSU

Ranking Function: Definition

ElemRank is defined at the granularity of an element and takes the nested structure of XML into account.

Similar to Google’s PageRank Q = (k1, k2, …, kn) R = Result(Q) A result element v1 R First define the ranking of v1 with respect

to one query keyword ki, r(v1,ki) before defining the overall rank, rank(v1, Q).

Gökay Burak AKKUŞ

Ece AKSU

Ranking with respect to one keyword

There exists a sub-element/value node v2 of v1 such that

v2 R0 and contains*(v2, ki). There is a sequence of containment edges

in CE of the form (v1, v2), (v2, v3), …, (vt, vt+1) such that vt+1 is a value node that directly contains the keyword ki.

Gökay Burak AKKUŞ

Ece AKSU

Ranking with respect to one keyword

r(v1, ki) does not depend on the ElemRank of the result node v1, except when v1 = vt for 2 reasons:1. less specific results indeed get lower ranks.2. in fact related to ElemRank(v1) due to

certain properties of containment edges.For multiple occurences of ki in v1 combined

rank is:

f = max

Gökay Burak AKKUŞ

Ece AKSU

Overall Ranking

The overall ranking is the sum of the ranks with respect to each query keyword, multiplied by a measure of keyword proximity p(v1, k1, k2, …, kn).

Gökay Burak AKKUŞ

Ece AKSU

XRANK System Architecture

Gökay Burak AKKUŞ

Ece AKSU

XRANK System Architecture-2

ElemRank Computation Module Computes the ElemRanks of XML elements Combined with ancestor info

HDIL Generates an index structure called HDIL

The Query Evaluator Module Evaluates queries using HDIL Returns ranked results.

Gökay Burak AKKUŞ

Ece AKSU

ElemRank Computational Module

ElemRank is a measure of the objective importance of an XML element and is based on the hyperlinked structure of XML docs.

PageRank function is sum of 2 probabilities

Visiting v at random (d=0.85) Visiting v by navigating

Gökay Burak AKKUŞ

Ece AKSU

ElemRank Computational Module

PageRank is unidirectional Forward ElemRank propagation

Paper section Reverse ElemRank propagation

Paper -- > workshop

Gökay Burak AKKUŞ

Ece AKSU

Refinements of PageRank

Bi-directional transfer of ElemRanks Discrimination between containment

and hyperlink edges Aggregate ElemRanks for reverse

containment relationships

Gökay Burak AKKUŞ

Ece AKSU

Bi-directional Transfer of ElemRanks

A simple solution is to add reverse containment edges,

does not distinguish between containment and hyperlink edges

Gökay Burak AKKUŞ

Ece AKSU

Discrimination between containment and hyperlink edges

It weights forward and reverse containment relationships similarly.

Gökay Burak AKKUŞ

Ece AKSU

Aggregate ElemRanks for reverse containment relationships

XRANK System

Efficiently Evaluating XML Keyword Search Queries

Gökay Burak AKKUŞ

Ece AKSU

Efficiently Evaluating XML Keyword Search Queries

Naïve Approach Dewey Inverted List (DIL) Ranked Dewey Inverted List (RDIL) Hybrid Dewey Inverted List (HDIL)

Gökay Burak AKKUŞ

Ece AKSU

Naïve Approach

Main Difference between XML and HTML keyword search: The granularity of query results XML keyword search returns elements HTML keyword search returns

documents One way to do XML keyword search

Treat each element as a document

Gökay Burak AKKUŞ

Ece AKSU

Problems of Naïve Approach

Space Overhead Spurious Query Results Inaccurate ranking of results

Gökay Burak AKKUŞ

Ece AKSU

Space Overhead An inverted list contains for each

keyword, the list of documents that contain the keyword

For XML documents, the list of elements A large space overhead; because each

inverted list contains XML element that directly contains the

keyword(1) All of (1)s ancestors redundantly

Gökay Burak AKKUŞ

Ece AKSU

Spurious Query Results

The naïve approach ignores ancestor-descendant relationships. All elements treated as independent

documents Results will not correspond to the

desired semantics for XML keyword search

Gökay Burak AKKUŞ

Ece AKSU

Inaccurate Ranking of Results

Existing approaches do not take result specificity into account when ranking results.

Gökay Burak AKKUŞ

Ece AKSU

Dewey Inverted List (DIL)

Naïve approach has drawbacks: Decouples representation of

ancestors and descendants. Dewey encoding of Element IDs

jointly captures ancestor and descendant information.

Gökay Burak AKKUŞ

Ece AKSU

Gökay Burak AKKUŞ

Ece AKSU

DIL

An interesting feature: ID of an ancestor is a prefix of the ID

of a descendant. Ancestor-descendant relationships

are implicitly captured in the Dewey ID.

Gökay Burak AKKUŞ

Ece AKSU

DIL Data Structure

The inverted list for a keyword k contains the Dewey IDs of all the XML elements that directly contain the keyword k.

For multiple documents : First component of each Dewey ID is

the document ID

Gökay Burak AKKUŞ

Ece AKSU

DIL Data Structure -2

An entry in DIL: ElemRank of corresponding XML

element The list of all positions where the

keyword k appears in that element. Entries are sorted by Dewey IDs The size of DIL is smaller than that

of Naïve Approach.

Gökay Burak AKKUŞ

Ece AKSU

Gökay Burak AKKUŞ

Ece AKSU

DIL Query Processing

An algorithm that works in a single pass over the query keyword inverted lists.

The key idea: Merge the query keyword inverted lists Simultaneously compute the longest

common prefix of the Dewey IDs in different lists.

Gökay Burak AKKUŞ

Ece AKSU

Gökay Burak AKKUŞ

Ece AKSU

Ranked Dewey Inverted List (RDIL)

“If inverted lists are long (due to common keywords or large document collections) even the cost of a single scan of the inverted list can be expensive, especially if the users want only the top few results.”

Gökay Burak AKKUŞ

Ece AKSU

RDIL -2

One solution: Order the inverted lists by the

ElemRank instead of by the Dewey ID. Higher ranked results will appear first

in the inverted list. Threshold Algorithm.

Gökay Burak AKKUŞ

Ece AKSU

RDIL Data Structure

RDIL is similar to DIL except that:

Inverted lists are ordered by ElemRank,

Each inverted list has a B+-tree index of the Dewey ID field.

Gökay Burak AKKUŞ

Ece AKSU

Gökay Burak AKKUŞ

Ece AKSU

RDIL Query Processing

Consider an entry retrieved from the inverted list of keyword k i .

The entry contains the Dewey ID d of a top-ranked element that directly contains the query keyword k i .

To determine a query result the longest prefix of d that also contains the other query keywords needs to be determined.

Gökay Burak AKKUŞ

Ece AKSU

Gökay Burak AKKUŞ

Ece AKSU

Hybrid Dewey Inverted List (HDIL)

In many cases RDIL is likely to perform well.

It may perform worse than DIL when there is a query where keywords are not correlated.

Gökay Burak AKKUŞ

Ece AKSU

HDIL -2 The individual query keywords occur

relatively frequently in the document collection but rarely occur together in the same document.

Since the number of results is small: RDIL has to scan most (or all) of the

inverted lists to produce the output. Can we combine the benefits of DIL and

RDIL without replicating the entire inverted list index?

Gökay Burak AKKUŞ

Ece AKSU

Gökay Burak AKKUŞ

Ece AKSU

HDIL Query Processing An adaptive strategy:

Periodically monitor performance. Calculate;

Time spent – t The number of results above the threshold – r Estimated time remaining for RDIL = (m-r)*t/r m: desired number of query results

If estimated time is more than the expected time for DIL, then switch to DIL.

Gökay Burak AKKUŞ

Ece AKSU

Experimental Evaluation

Experimental Setup Quality and Ranking Function Space requirements Query Performance

(1) the number of query keywords; (2) the correlation between the keywords; (3) the desired number of query results; (4) the selectivity of the keywords.

gökay burak akkuŞ ece aksu xrank xrank: ranked keyword search over xml documents ece aksu gökay...

Documents

features of xml keyword

keyword search queries

keyword search queryingadv

nested xml elements

hierarchical xml documents

html search engine

html data model

underlying data