ranked information retrieval on xml data

48
Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003 Bernadette Blum, Christian Nicolaus, Markus Uhl

Upload: redford

Post on 08-Jan-2016

60 views

Category:

Documents


1 download

DESCRIPTION

Ranked Information Retrieval on XML Data. Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003 Bernadette Blum, Christian Nicolaus, Markus Uhl. Outline. 1. Introduction in Information Retrieval - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on

XML Data

Seminar “Informationsorganisation und -suche mit XML”

Dr. Ralf SchenkelSS 2003

Saarland University

8. Juli 2003Bernadette Blum, Christian Nicolaus, Markus Uhl

Page 2: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 2/48

OutlineOutline

1. Introduction in Information Retrieval

2. Information Retrieval on XML Data

3. Approaches1. ELIXIR

- The ELIXIR language- The ELIXIR query processing algorithm- Experiments, Conclusion

2. XRANK- Data model- Ranking function- Data structures and algorithms- Experiments

4. Conclusion

Page 3: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 3/48

1. Introduction in Information Retrieval 1. Introduction in Information Retrieval

• Definition:

– Information Retrieval (IR) is the technology for searching in collections (corpora, intranets, Web) of weakly structured documents: text, HTML, XML, ...

– search engines, digital libraries, similarity search on scientific data

• Vector space model (text analysis):

– based on word occurrence frequency

– documents and queries are vectors

– result ranking based on similarity metric in vector space

Page 4: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 4/48

1. Introduction in Information Retrieval (II)1. Introduction in Information Retrieval (II)

• Link analysis (structure analysis):

– weighting documents

– improve result ranking

Page rank approach (I):

– web as directed graph G

– “random walk” of a web surfer

• follow hyperlinks with probability (1-)

• “random jump” with probability

Page 5: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 5/48

Page rank approach (II):

1. Introduction in Information Retrieval (III)1. Introduction in Information Retrieval (III)

“random jump” hyperlinks

Hyperlink

Probability of “random jump” Probability of following hyperlink (1- )

n

1

G)q,p( )p(reedegout

)p(r)(1+

“random jump”

Document

p(q)=

q

(1-)/3

(1-)/3(1-)/3

/5

/5

/5/5

/5

Page 6: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 6/48

2. Information Retrieval on XML Data2. Information Retrieval on XML Data

• XML: standard for exchange of structured data and documents

• existing query languages (e.g. XML-QL, Quilt, XQL, … XQuery)

– no ranked or weighted results based on textual similarity

– but extensions (XXL, XIRQL …)

2 Approaches

ELIXIR

SQL-like approach

XRANK

Keyword based approach

Page 7: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 7/48

3.1 ELIXIR3.1 ELIXIR

• ELIXIR = “expressive and efficient language for XML information retrieval”

• extension to XML-QL: similarity operator “~”

• “~” computed by WHIRL

• returns best r answers

Page 8: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 8/48

ELIXIR – The ELIXIR languageELIXIR – The ELIXIR language

• Syntax:

– XML-QL Syntax (SQL-like)

CONSTRUCT <item>$b</>

WHERE <items.book year=$yb>$b</> in “db.xml”,

<items.cd>$c</> in “db.xml”,

$yb > 1990,

$b ~ $c.

outputformat

pattern statement

s +predicates

boolean operators

ELIXIR’s similarity operator

• similarity calculation even between 2 variables ( expressiveness)

• no nested queries

Page 9: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 9/48

ELIXIR – The ELIXIR language (II)ELIXIR – The ELIXIR language (II)

WHIRL (I):

• Word-based Heterogeneous Information Retrieval Logic

• extends DATALOG with “~”

• only relational data

• efficiently supports ranked IR

• Syntax (Horn clause):

output($y, $a, $t) :- book($y, $a, $t), $y>1950, $t~$a.

output relation input relation

conjunction of relational predicates

boolean operator

similarity operator

Page 10: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 10/48

WHIRL (II):

• Similarity computation “~”:

– standard IR term vector techniques

– weighting terms (TF-IDF values)

– cosine measure:

Vt

tt

d'd

'ddsim(d,d')

(V Vocabulary of distinct terms; Terms t V; Documents d, d’ R|V|)

ELIXIR – The ELIXIR language (III)ELIXIR – The ELIXIR language (III)

Page 11: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 11/48

ELIXIR – The ELIXIR query processing algorithmELIXIR – The ELIXIR query processing algorithm

Example (naïve approach):

<q2>

CONSTRUCT <tuple><b>$b</><c>$c</></>

WHERE <items.book>$b</> in “db.xml”,

<items.cd>$c</> in “db.xml”

</>

XML-QL query Q2

Similarity computation for every tupel ($b, $c)

full cross product !

Page 12: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 12/48

ELIXIR – The ELIXIR query processing algorithm (II)ELIXIR – The ELIXIR query processing algorithm (II)

Problem:

full cross product !

Page 13: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 13/48

Solution:

• not simply map the full XML data into relational model

• invoke WHIRL as a “subroutine” ( efficiency)

Avoid generating full cross product!

ELIXIR – The ELIXIR query processing algorithm (III)ELIXIR – The ELIXIR query processing algorithm (III)

Page 14: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 14/48

2 pattern statements with variables that are compared with a similarity predicate => distinct Q2

j queries

ELIXIR – The ELIXIR query processing algorithm (IV)ELIXIR – The ELIXIR query processing algorithm (IV)

Start query Q1

3 Stages: intermediate queries Q2, Q3, Q4

1. Partition into a set, Q21 … Q2

N, of XML-QL queries- avoid generating full cross product - ordinary predicates

2. WHIRL query Q3 - similarity predicates - ordered table of the r best answers

3. XML-QL query Q4

– transformation of Q3’s output

– specified XML structure by Q1

Page 15: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 15/48

Example (Step I – Partition in Q2n queries):

<q21>

CONSTRUCT <tuple><b>$b</></>

WHERE <items.book>$b</> in "db.xml"

</>

<q22>

CONSTRUCT <tuple><c>$c</></>

WHERE <items.cd>$c</> in "db.xml"

</>

XML-QL query Q21

XML-QL query Q22

<q22><tuple><c>Ukrainian folk music</></>

<tuple><c>Being there</></>

<tuple><c>Milk cow blues</></></>

<q21><tuple><b>Traditional Ukrainian cookery</></>

<tuple><b>Being and nothingness</></>

<tuple><b>Shooting Elvis</></></>

Avoid generating full cross product!

ELIXIR – The ELIXIR query processing algorithm (V)ELIXIR – The ELIXIR query processing algorithm (V)

Page 16: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 16/48

Example (Step II – WHIRL query Q3):

q3($b) :- q21($b), q22($c), $b ~ $c.WHIRL query Q3

<q3><tuple><b>Traditional Ukrainian cookery</></>

<tuple><b>Being and nothingness</></></>

<q22><tuple><c>Ukrainian folk music</></>

<tuple><c>Being there</></>

<tuple><c>Milk cow blues</></></>

<q21><tuple><b>Traditional Ukrainian cookery</></>

<tuple><b>Being and nothingness</></>

<tuple><b>Shooting Elvis</></></>

ELIXIR – The ELIXIR query processing algorithm (VI)ELIXIR – The ELIXIR query processing algorithm (VI)

Page 17: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 17/48

Example (Step III – XML-QL query Q4):

<results>

CONSTRUCT <item>$b</>

WHERE <q3.tuple><b>$b</></> in "q3.xml“

</>

XML-QL query Q4

<results><item>Traditional Ukrainian cookery</>

<item>Being and nothingness</></>

Final XML OUTPUT

<q3><tuple><b>Traditional Ukrainian cookery</></>

<tuple><b>Being and nothingness</></></>

ELIXIR – The ELIXIR query processing algorithm (VII)ELIXIR – The ELIXIR query processing algorithm (VII)

Page 18: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 18/48

ELIXIR – Experiments, ConclusionELIXIR – Experiments, Conclusion

Experiments:

Total processing time …

– … depends on details of each query and input data

– … increases marginal with number of answers r

– … increases linearly with number of similarity join predicates

– Partition (Step 1) of initially query dominate (expensive parsing and traversing)

Page 19: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 19/48

ELIXIR – Experiments, Conclusion (II)ELIXIR – Experiments, Conclusion (II)

Conclusion:

• ELEXIR extends XML-QL by supporting IR-similarity-features for ranking

• similarity joins even between 2 variables (expressiveness)

• Algorithm:

– rewrite original ELIXIR query in a series of intermediate XML-QL and WHIRL queries.

– no full cross product, only filtered tuples of variable bindings (efficiency)

• But …

– only non-nested queries

– strict three-stage approach may be suboptimal in some cases (partition)

Page 20: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 20/48

XRANK:Ranked Keyword Search

over XML Documents

Page 21: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 21/48

IntroductionIntroduction

XRANK - Keyword Search over XML documents

results: XML elements that contain all searched keywords

ranking: at granularity of XML elements based on hyperlink structure

advantages: user does not have to learn a query language no knowledge about the structure of XML documents is needed

generalized keyword search engine(both HTML and XML are possible)

Page 22: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 22/48

• G = (V, CE, HE) : collection of XML documents• V : set of XML elements (tags and attributes)• CE : set of containment edges • HE : set of hyperlinked edges

• (u,v) in CE v is a sub-element of u• (u,v) in HE u contains a hyperlink to v• contains(v,k) v (in)directly contains the keyword k

Data ModelData Model

Page 23: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 23/48

Example: XML GraphExample: XML Graph

...

XML element value

Page 24: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 24/48

How to define results of keyword search queries overXML documents?

elements with at least one

sub-element containiningall keywords &

at least one sub-elementcontaining some

keywords

elements that contain all keywords –

no sub-element contains all keywords!

Keyword Query Results (1)Keyword Query Results (1)

Page 25: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 25/48

Ranking ElementsRanking Elements

How to rank XML elements?

extension of PageRank at the granularity of elements objective importance of XML elements based on hyperlinked and nested structure of XML elements

ElemRank

Page 26: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 26/48

n : # XML elementsnc(u) : # sub-elements of unh (u) : # outgoing hyperlinks from u

CE-1 : (v,u) | (u,v) CE “reverse containment edges“E : HE CE CE -1

u

nc(u) = 3

nh(u) = 3

containment edge reverse containment edge hyperlink edge

ElemRank (1)ElemRank (1)

Page 27: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 27/48

: prob. for following a hyperlink 1- - - : prob. for a random jump : prob. for using a containment edge : prob. for using a reverse containment edge

containment edge reverse containment edge hyperlink edge

/ 3 + ε / 10

/ 3 + ε/10

/ 3 + ε /

10

/

1 +

ε /

10

ε

/ 3 + ε / 10

/ 3 + ε / 10

/ 3 + ε / 10

ε / 10

ε / 10

ElemRank (2)ElemRank (2)

Page 28: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 28/48

e(u)

nh(u)

e(u)

nc(u)

ElemRank e(v) =

(0 ≤ , , ≤ 1)

random navigation

via hyperlinks

via forward containment

edges

(u,v) HE (u,v) CE (u,v) CE-1

e(u)

1

via reverse containment

edges

(1- - - ) * 1/n + * ∑ + * ∑ + * ∑

ElemRank (3)ElemRank (3)

Page 29: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 29/48

ranking functions should take into account: result specifity hyperlinks keyword proximity

based on hyperlinked structure result specifity

contains(v,k)

∃ sequence (v1,v2), ..., (vn-1,vn) s.t. vn directly contains k

r(v,k) = ElemRank(vn) * decayn-1 (0 ≤ decay ≤ 1)

Ranking Function (1)Ranking Function (1)

Page 30: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 30/48

• m occurences of keyword k computation of r1, ..., rm

r*(v,k) = f(r1, ..., rm)

• query q consists of keywords k1, ..., kn

R(v,q) = ( r*(v,ki)) * p(v,k1, ..., kn)

keyword proximity

p = proximity measure

(with accumulation function f - e.g. max or sum)

Ranking Function (2)Ranking Function (2)

Page 31: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 31/48

<CDs>

<CD id = “1“> <title> R.E.M. – Out Of Time </title>

<song> <title> Radio Song </title> <time> 4:12 </time> </song>

<song> <title> Losing My Religion </title> <time> 4:26 </time> </song> ... </CD>

<CD id = “2“> <title> R.E.M. – Automatic For... </title> ... </CD> ...</CDs>

Page 32: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 32/48

ElemRank computation

XML documents

index structures &algorithms

Query Evaluator

XML elements

with ElemRanks

data acces

keyword search query

ranked result list

XRANK ArchitectureXRANK Architecture

Page 33: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 33/48

• naïve inverted list: contains all XML elements that contain the keyword

key1 elem11 elem12 ...

key2 elem21 elem22 ...

etc.

space overhead spurious results inaccurate ranking

Naïve ApproachNaïve Approach

Page 34: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 34/48

<CDs>

<CD><CD>

...

... ...

0

0.00.1

<title> <title><song> <song>0.1.0

R.E.M. – Automatic For The People

0.0.2

0.0.2.1<time>

0.0.2.0<title>

4:26Losing My Religion

0.0.0 0.0.1

0.0.1.10.0.1.0

4:12Radio Song

R.E.M. – Out Of Time

<time><title>

Dewey IDsDewey IDs

Page 35: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 35/48

• Dewey inverted list:• contains the Dewey IDs of all XML elements that directly contain the keyword• sorted by Dewey ID (ascending)

Dewey ID ElemRank position list

R.E.M.

Religion

0.0.0

0.1.0

75

80

[0]

[0]

Dewey ID ElemRank position list

0.0.2.0 88 [2]

DIL – Data StructureDIL – Data Structure

Page 36: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 36/48

• key idea: computation of longest common prefix (lcp) of Dewey IDs

Dew

eyID

ran

k [1

]

ran

k [2

]

po

sLis

t [1

]

po

sLis

t [2

]

po

t_re

sult

1.

0

0

0

75

70

65 0

0

0 y

n

n

DIL – Query Processing (1) DIL – Query Processing (1)

Page 37: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 37/48

y

Dew

eyID

ran

k [1

]

ran

k [2

]

po

sLis

t [1

]

po

sLis

t [2

]

po

t_re

sult

Dew

eyID

ran

k [1

]

ran

k [2

]

po

sLis

t [1

]

po

sLis

t [2

]

po

t_re

sult

1. 2.

0

0

0

0

2

0

0

75

70

65 0

0

0 y

n

n 70

65

0

0

88

83

78

73 n

n

n

2

2

2

2lcp

DIL – Query Processing (2) DIL – Query Processing (2)

Page 38: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 38/48

y

y

Dew

eyID

ran

k [1

]

ran

k [2

]

po

sLis

t [1

]

po

sLis

t [2

]

po

t_re

sult

Dew

eyID

ran

k [1

]

ran

k [2

]

po

sLis

t [1

]

po

sLis

t [2

]

po

t_re

sult

1.

3.

2.

0

0

0

0

2

0

0

75

70

65 0

0

0 y

n

n

0

0

1

70

65

0

0

88

83

78

73 n

n

n

2

2

2

2

80

75

70 73

0

0 n

n

20 0.0 , 0

lcp

lcp

DIL – Query Processing (3) DIL – Query Processing (3)

Page 39: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 39/48

• ranked Dewey inverted list:• each Dewey ID in the list has a position in the B+-tree• B+-tree sorted by Dewey ID (ascending)• inverted list sorted by ElemRank (descending)

Dewey IDElemRank

R.E.M.80

75

0.1.0

0.0.0

0.0.00.1.0 …

B+-tree onDewey IDs

RDIL – Data StructureRDIL – Data Structure

Page 40: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 40/48

key1 key3

entry21

entry22

entry23

entry31

entry32

entry33

sort

ed b

y E

lem

Ran

k

...

...

key2

entry11

entry12

entry13

...

B+ B+B+on D

ewey ID

s

RDIL – Query Processing (1) RDIL – Query Processing (1)

lcp with Dewey ID11

result heap

Page 41: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 41/48

key1 key3

entry31

entry32

entry33

sort

ed b

y E

lem

Ran

k

...

...

key2

...

B+ B+B+on D

ewey ID

s

RDIL – Query Processing (2) RDIL – Query Processing (2)

lcp with Dewey ID21

result heap

entry22

entry23

entry21entry11

entry12

entry13

etc.

Page 42: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 42/48

key1 key3

entry21

entry22

entry23

entry31

entry32

entry33

sort

ed b

y E

lem

Ran

k

...

...

key2

...

B+ B+B+on D

ewey ID

s

RDIL – Query Processing (3) RDIL – Query Processing (3)

entry11

entry12

entry13

∑ Ranking = threshold Ωmax. reachable Ranking ≤

Page 43: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 43/48

RDIL algorithm stops

if

threshold Ω < lowest ElemRank in result heap

because

max. reachable ranking ≤ Ω < lowest ElemRank in result heap

max. reachable ranking < lowest ElemRank in result heap

!

RDIL – Query Processing (4) RDIL – Query Processing (4)

Page 44: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 44/48

DIL / RDIL ElemRank computation

XML documents

Query Evaluator

data acces

keyword search query

ranked result list

XML elements

with ElemRanks

XRANK ArchitectureXRANK Architecture

Page 45: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 45/48

high keyword correlation:

0

0,2

0,4

0,6

0,8

1

1,2

1 2 3 4 5

number of keywords

ex

ecu

tio

n t

ime

(se

c.)

DIL

RDIL

Experimental Results (1)Experimental Results (1)

Page 46: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 46/48

low keyword correlation:

0

0,4

0,8

1,2

1,6

2

1 2 3 4 5

number of keywords

ex

ecu

tio

n t

ime

(se

c.)

DIL

RDIL

Experimental Results (2)Experimental Results (2)

Page 47: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 47/48

DIL RDIL

• inverted lists sorted by Dewey ID

• compute longest common prefix on Dewey IDs

• extracts the minimum of all remaining Dewey IDs

• all lists are completely scanned

• outperforms RDIL if keyword correlation is low

• inverted lists sorted by ElemRank

• chooses next list sequentially

• stops if a certain threshold is reached

• outperforms DIL if keyword correlation is high

Comparison DIL - RDILComparison DIL - RDIL

Page 48: Ranked Information Retrieval on  XML Data

Ranked Information Retrieval on XML Data 48/48

2 Approaches

ELIXIR:– SQL-like structure based

search– extends XML-QL by

supporting IR-similarity-features for ranking

– ranked results based only on textual similarity (even between 2 variables)

XRANK:– keyword based search à la

Google– ranked results based on

textual similarity– hierarchical and

hyperlinked structure

ConclusionConclusion