ranked information retrieval on xml data

Ranked Information Retrieval on

XML Data

Seminar “Informationsorganisation und -suche mit XML”

Dr. Ralf SchenkelSS 2003

Saarland University

8. Juli 2003Bernadette Blum, Christian Nicolaus, Markus Uhl

Ranked Information Retrieval on XML Data 2/48

OutlineOutline

1. Introduction in Information Retrieval

2. Information Retrieval on XML Data

3. Approaches1. ELIXIR

- The ELIXIR language- The ELIXIR query processing algorithm- Experiments, Conclusion

2. XRANK- Data model- Ranking function- Data structures and algorithms- Experiments

4. Conclusion


1. Introduction in Information Retrieval 1. Introduction in Information Retrieval

• Definition:

– Information Retrieval (IR) is the technology for searching in collections (corpora, intranets, Web) of weakly structured documents: text, HTML, XML, ...

– search engines, digital libraries, similarity search on scientific data

• Vector space model (text analysis):

– based on word occurrence frequency

– documents and queries are vectors

– result ranking based on similarity metric in vector space


1. Introduction in Information Retrieval (II)1. Introduction in Information Retrieval (II)

• Link analysis (structure analysis):

– weighting documents

– improve result ranking

Page rank approach (I):

– web as directed graph G

– “random walk” of a web surfer

• follow hyperlinks with probability (1-)

• “random jump” with probability


Page rank approach (II):

1. Introduction in Information Retrieval (III)1. Introduction in Information Retrieval (III)

“random jump” hyperlinks

Hyperlink

Probability of “random jump” Probability of following hyperlink (1- )

n

1

G)q,p( )p(reedegout

)p(r)(1+

“random jump”

Document

p(q)=

q

(1-)/3

(1-)/3(1-)/3

/5

/5

/5/5

/5


2. Information Retrieval on XML Data2. Information Retrieval on XML Data

• XML: standard for exchange of structured data and documents

• existing query languages (e.g. XML-QL, Quilt, XQL, … XQuery)

– no ranked or weighted results based on textual similarity

– but extensions (XXL, XIRQL …)

2 Approaches

ELIXIR

SQL-like approach

XRANK

Keyword based approach


3.1 ELIXIR3.1 ELIXIR

• ELIXIR = “expressive and efficient language for XML information retrieval”

• extension to XML-QL: similarity operator “~”

• “~” computed by WHIRL

• returns best r answers

ELIXIR – The ELIXIR languageELIXIR – The ELIXIR language

• Syntax:

– XML-QL Syntax (SQL-like)

CONSTRUCT <item>$b</>

WHERE <items.book year=$yb>$b</> in “db.xml”,

<items.cd>$c</> in “db.xml”,

$yb > 1990,

$b ~ $c.

outputformat

pattern statement

s +predicates

boolean operators

ELIXIR’s similarity operator

• similarity calculation even between 2 variables ( expressiveness)

• no nested queries


ELIXIR – The ELIXIR language (II)ELIXIR – The ELIXIR language (II)

WHIRL (I):

• Word-based Heterogeneous Information Retrieval Logic

• extends DATALOG with “~”

• only relational data

• efficiently supports ranked IR

• Syntax (Horn clause):

output($y, $a, $t) :- book($y, $a, $t), $y>1950, $t~$a.

output relation input relation

conjunction of relational predicates

boolean operator

similarity operator


WHIRL (II):

• Similarity computation “~”:

– standard IR term vector techniques

– weighting terms (TF-IDF values)

– cosine measure:

Vt

tt

d'd

'ddsim(d,d')

(V Vocabulary of distinct terms; Terms t V; Documents d, d’ R|V|)

ELIXIR – The ELIXIR language (III)ELIXIR – The ELIXIR language (III)

ELIXIR – The ELIXIR query processing algorithmELIXIR – The ELIXIR query processing algorithm

Example (naïve approach):

<q2>

CONSTRUCT <tuple>$b</><c>$c</></>

WHERE <items.book>$b</> in “db.xml”,

<items.cd>$c</> in “db.xml”

</>

XML-QL query Q2

Similarity computation for every tupel ($b, $c)

full cross product !


ELIXIR – The ELIXIR query processing algorithm (II)ELIXIR – The ELIXIR query processing algorithm (II)

Problem:

full cross product !


Solution:

• not simply map the full XML data into relational model

• invoke WHIRL as a “subroutine” ( efficiency)

Avoid generating full cross product!

ELIXIR – The ELIXIR query processing algorithm (III)ELIXIR – The ELIXIR query processing algorithm (III)


2 pattern statements with variables that are compared with a similarity predicate => distinct Q2

j queries

ELIXIR – The ELIXIR query processing algorithm (IV)ELIXIR – The ELIXIR query processing algorithm (IV)

Start query Q1

3 Stages: intermediate queries Q2, Q3, Q4

1. Partition into a set, Q21 … Q2

N, of XML-QL queries- avoid generating full cross product - ordinary predicates

2. WHIRL query Q3 - similarity predicates - ordered table of the r best answers

3. XML-QL query Q4

– transformation of Q3’s output

– specified XML structure by Q1

Example (Step I – Partition in Q2n queries):

<q21>

CONSTRUCT <tuple>$b</></>

WHERE <items.book>$b</> in "db.xml"

</>

<q22>

CONSTRUCT <tuple><c>$c</></>

WHERE <items.cd>$c</> in "db.xml"

</>

XML-QL query Q21

XML-QL query Q22

<q22><tuple><c>Ukrainian folk music</></>

<tuple><c>Being there</></>

<tuple><c>Milk cow blues</></></>

<q21><tuple>Traditional Ukrainian cookery</></>

<tuple>Being and nothingness</></>

<tuple>Shooting Elvis</></></>

Avoid generating full cross product!

ELIXIR – The ELIXIR query processing algorithm (V)ELIXIR – The ELIXIR query processing algorithm (V)

Example (Step II – WHIRL query Q3):

q3($b) :- q21($b), q22($c), $b ~ $c.WHIRL query Q3


<tuple>Being and nothingness</></></>

<q22><tuple><c>Ukrainian folk music</></>

<tuple><c>Being there</></>

<tuple><c>Milk cow blues</></></>


<tuple>Being and nothingness</></>

<tuple>Shooting Elvis</></></>

ELIXIR – The ELIXIR query processing algorithm (VI)ELIXIR – The ELIXIR query processing algorithm (VI)

Example (Step III – XML-QL query Q4):

<results>

CONSTRUCT <item>$b</>

WHERE <q3.tuple>$b</></> in "q3.xml“

</>

XML-QL query Q4

<results><item>Traditional Ukrainian cookery</>

<item>Being and nothingness</></>

Final XML OUTPUT


<tuple>Being and nothingness</></></>

ELIXIR – The ELIXIR query processing algorithm (VII)ELIXIR – The ELIXIR query processing algorithm (VII)


ELIXIR – Experiments, ConclusionELIXIR – Experiments, Conclusion

Experiments:

Total processing time …

– … depends on details of each query and input data

– … increases marginal with number of answers r

– … increases linearly with number of similarity join predicates

– Partition (Step 1) of initially query dominate (expensive parsing and traversing)


ELIXIR – Experiments, Conclusion (II)ELIXIR – Experiments, Conclusion (II)

Conclusion:

• ELEXIR extends XML-QL by supporting IR-similarity-features for ranking

• similarity joins even between 2 variables (expressiveness)

• Algorithm:

– rewrite original ELIXIR query in a series of intermediate XML-QL and WHIRL queries.

– no full cross product, only filtered tuples of variable bindings (efficiency)

• But …

– only non-nested queries

– strict three-stage approach may be suboptimal in some cases (partition)


XRANK:Ranked Keyword Search

over XML Documents


IntroductionIntroduction

XRANK - Keyword Search over XML documents

results: XML elements that contain all searched keywords

ranking: at granularity of XML elements based on hyperlink structure

advantages: user does not have to learn a query language no knowledge about the structure of XML documents is needed

generalized keyword search engine(both HTML and XML are possible)


• G = (V, CE, HE) : collection of XML documents• V : set of XML elements (tags and attributes)• CE : set of containment edges • HE : set of hyperlinked edges

• (u,v) in CE v is a sub-element of u• (u,v) in HE u contains a hyperlink to v• contains(v,k) v (in)directly contains the keyword k

Data ModelData Model


Example: XML GraphExample: XML Graph

...

XML element value


How to define results of keyword search queries overXML documents?

elements with at least one

sub-element containiningall keywords &

at least one sub-elementcontaining some

keywords

elements that contain all keywords –

no sub-element contains all keywords!

⋃

Keyword Query Results (1)Keyword Query Results (1)


Ranking ElementsRanking Elements

How to rank XML elements?

extension of PageRank at the granularity of elements objective importance of XML elements based on hyperlinked and nested structure of XML elements

ElemRank


n : # XML elementsnc(u) : # sub-elements of unh (u) : # outgoing hyperlinks from u

CE-1 : (v,u) | (u,v) CE “reverse containment edges“E : HE CE CE -1

u

nc(u) = 3

nh(u) = 3

containment edge reverse containment edge hyperlink edge

ElemRank (1)ElemRank (1)


: prob. for following a hyperlink 1- - - : prob. for a random jump : prob. for using a containment edge : prob. for using a reverse containment edge

containment edge reverse containment edge hyperlink edge

/ 3 + ε / 10

/ 3 + ε/10

/ 3 + ε /

10

/

1 +

ε /

10

ε

/ 3 + ε / 10

/ 3 + ε / 10

/ 3 + ε / 10

ε / 10

ε / 10



e(u)

nh(u)

e(u)

nc(u)

ElemRank e(v) =

(0 ≤ , , ≤ 1)

random navigation

via hyperlinks

via forward containment

edges

(u,v) HE (u,v) CE (u,v) CE-1

e(u)

1

via reverse containment

edges

(1- - - ) * 1/n + * ∑ + * ∑ + * ∑



ranking functions should take into account: result specifity hyperlinks keyword proximity

based on hyperlinked structure result specifity

contains(v,k)

∃ sequence (v1,v2), ..., (vn-1,vn) s.t. vn directly contains k

r(v,k) = ElemRank(vn) * decayn-1 (0 ≤ decay ≤ 1)

Ranking Function (1)Ranking Function (1)


• m occurences of keyword k computation of r1, ..., rm

r*(v,k) = f(r1, ..., rm)

• query q consists of keywords k1, ..., kn

R(v,q) = ( r*(v,ki)) * p(v,k1, ..., kn)

keyword proximity

p = proximity measure

(with accumulation function f - e.g. max or sum)

Ranking Function (2)Ranking Function (2)

<CDs>

<CD id = “1“> <title> R.E.M. – Out Of Time </title>

<song> <title> Radio Song </title> <time> 4:12 </time> </song>

<song> <title> Losing My Religion </title> <time> 4:26 </time> </song> ... </CD>

<CD id = “2“> <title> R.E.M. – Automatic For... </title> ... </CD> ...</CDs>


ElemRank computation

XML documents

index structures &algorithms

Query Evaluator

XML elements

with ElemRanks

data acces

keyword search query

ranked result list

XRANK ArchitectureXRANK Architecture


• naïve inverted list: contains all XML elements that contain the keyword

key1 elem11 elem12 ...

key2 elem21 elem22 ...

etc.

space overhead spurious results inaccurate ranking

Naïve ApproachNaïve Approach

<CDs>

<CD><CD>

...

... ...

0

0.00.1

<title> <title><song> <song>0.1.0

R.E.M. – Automatic For The People

0.0.2

0.0.2.1<time>

0.0.2.0<title>

4:26Losing My Religion

0.0.0 0.0.1

0.0.1.10.0.1.0

4:12Radio Song

R.E.M. – Out Of Time

<time><title>

Dewey IDsDewey IDs


• Dewey inverted list:• contains the Dewey IDs of all XML elements that directly contain the keyword• sorted by Dewey ID (ascending)

Dewey ID ElemRank position list

R.E.M.

Religion

0.0.0

0.1.0

75

80

[0]

[0]

…

Dewey ID ElemRank position list

0.0.2.0 88 [2]

…

DIL – Data StructureDIL – Data Structure


• key idea: computation of longest common prefix (lcp) of Dewey IDs

Dew

eyID

ran

k [1

]

ran

k [2

]

po

sLis

t [1

]

po

sLis

t [2

]

po

t_re

sult

1.

0

0

0

75

70

65 0

0

0 y

n

n

DIL – Query Processing (1) DIL – Query Processing (1)


y

Dew

eyID

ran

k [1

]

ran

k [2

]

po

sLis

t [1

]

po

sLis

t [2

]

po

t_re

sult

Dew

eyID

ran

k [1

]

ran

k [2

]

po

sLis

t [1

]

po

sLis

t [2

]

po

t_re

sult

1. 2.

0

0

0

0

2

0

0

75

70

65 0

0

0 y

n

n 70

65

0

0

88

83

78

73 n

n

n

2

2

2

2lcp



y

y

Dew

eyID

ran

k [1

]

ran

k [2

]

po

sLis

t [1

]

po

sLis

t [2

]

po

t_re

sult

Dew

eyID

ran

k [1

]

ran

k [2

]

po

sLis

t [1

]

po

sLis

t [2

]

po

t_re

sult

1.

3.

2.

0

0

0

0

2

0

0

75

70

65 0

0

0 y

n

n

0

0

1

70

65

0

0

88

83

78

73 n

n

n

2

2

2

2

80

75

70 73

0

0 n

n

20 0.0 , 0

lcp

lcp



• ranked Dewey inverted list:• each Dewey ID in the list has a position in the B+-tree• B+-tree sorted by Dewey ID (ascending)• inverted list sorted by ElemRank (descending)

Dewey IDElemRank

R.E.M.80

75

0.1.0

0.0.0

…

0.0.00.1.0 …

B+-tree onDewey IDs

RDIL – Data StructureRDIL – Data Structure


key1 key3

entry21

entry22

entry23

entry31

entry32

entry33

sort

ed b

y E

lem

Ran

k

...

...

key2

entry11

entry12

entry13

...

B+ B+B+on D

ewey ID

s

RDIL – Query Processing (1) RDIL – Query Processing (1)

lcp with Dewey ID11

result heap


key1 key3

entry31

entry32

entry33

sort

ed b

y E

lem

Ran

k

...

...

key2

...

B+ B+B+on D

ewey ID

s


lcp with Dewey ID21

result heap

entry22

entry23

entry21entry11

entry12

entry13

etc.


key1 key3

entry21

entry22

entry23

entry31

entry32

entry33

sort

ed b

y E

lem

Ran

k

...

...

key2

...

B+ B+B+on D

ewey ID

s


entry11

entry12

entry13

∑ Ranking = threshold Ωmax. reachable Ranking ≤

RDIL algorithm stops

if

threshold Ω < lowest ElemRank in result heap

because

max. reachable ranking ≤ Ω < lowest ElemRank in result heap

max. reachable ranking < lowest ElemRank in result heap

!


DIL / RDIL ElemRank computation

XML documents

Query Evaluator

data acces

keyword search query

ranked result list

XML elements

with ElemRanks

XRANK ArchitectureXRANK Architecture


high keyword correlation:

0

0,2

0,4

0,6

0,8

1

1,2

1 2 3 4 5

number of keywords

ex

ecu

tio

n t

ime

(se

c.)

DIL

RDIL

Experimental Results (1)Experimental Results (1)


low keyword correlation:

0

0,4

0,8

1,2

1,6

2

1 2 3 4 5

number of keywords

ex

ecu

tio

n t

ime

(se

c.)

DIL

RDIL

Experimental Results (2)Experimental Results (2)


DIL RDIL

• inverted lists sorted by Dewey ID

• compute longest common prefix on Dewey IDs

• extracts the minimum of all remaining Dewey IDs

• all lists are completely scanned

• outperforms RDIL if keyword correlation is low

• inverted lists sorted by ElemRank

• chooses next list sequentially

• stops if a certain threshold is reached

• outperforms DIL if keyword correlation is high

Comparison DIL - RDILComparison DIL - RDIL


2 Approaches

ELIXIR:– SQL-like structure based

search– extends XML-QL by

supporting IR-similarity-features for ranking

– ranked results based only on textual similarity (even between 2 variables)

XRANK:– keyword based search à la

Google– ranked results based on

textual similarity– hierarchical and

hyperlinked structure

ConclusionConclusion

ranked information retrieval on xml data

Documents

information retrieval

xml dataelixir

xml data1

xml data3

information retrieval2

xml dataxml

xml dataoutline1

xml data2