a ranking scheme for xml information retrieval based on benefit and reading effort

A Ranking Scheme for XML Information Retrieval

Based on Benefit and Reading Effort

Toshiyuki Shimizu (Kyoto University)Masatoshi Yoshikawa (Kyoto University)

ICADL 2007 12th December

2

XML-IR systems Growing demand for XML Information

Retrieval (XML-IR) Systems We can identify meaningful document

fragments by encoding documents in XML ex) Sections, subsections and paragraphs

in scholarly articles Browsing only document fragments relevant to a

certain topic The most simple form of queries for XML-IR is

just a set of keywords Simple, intuitively understandable, yet useful form

of queries, especially for unskilled end-users Active research area as in INEX*

* INitiative for the Evaluation of XML Retrieval (http://inex.is.informatik.uni-duisburg.de/)

3

Results of XML-IR Systems Document fragment (element)

With relevance degree (Score) ex) Query term was “XML”

<?xml version="1.0"?><article> <sec> XML labeling The structure of XML is a

tree, and each node in the XML is labeled.

We can get tag name of each XML element.

</sec> <sec> Tree index XML index is constructed

using the labels </sec></article>

Score

articlee0

pe4

pe3

sec

e1

sec

e5

pe2

e6 e7pp

0.56

0.64 0.35

0.9 0.800.4 0.33

4

Naïve XML-IR System

e3 (0.9)e7 (0.8)e1 (0.64)e0 (0.56)e2 (0.4)e5 (0.35)e4 (0.33)

Thorough strategy of INEX 2005 Simply retrieves relevant elements from all

elements and ranks them in order of relevanceScore

articlee0

pe4

pe3

sec

e1

sec

e5

pe2

e6 e7pp

0.56

0.64 0.35

0.9 0.800.4 0.33

Thorough is considered for system evaluation User behavior of browsing search results must be

considered

5

Problems of Thorough Retrieval for XML-IR Nesting elements

Browsing both elements is useless Ancestor element ea Descendant element ed

ed has been fully seen Descendant element ed Ancestor element ea

ea has been partially seen before

Element size Elements retrieved by XML-IR systems varies widely

in size Large element, such as article (whole document) Small element, such as p (paragraph)

Total output size of top-k elements is uncontrollable by simply giving an integer k

6

Overview of our Approach Introduction of the concepts of benefit and

reading effort Users can control the total output size Systems can retrieve non-overlapping elements

7

Properties of Benefit and Reading Effort (1/2) Benefit

The benefit of an element is the amount of gain about the query by reading the element

Assumption 1: The benefit of an element is greater than or equal to the sum of the benefit of the child elements

Information complementation among sibling elements

ex) For two query terms A and Be6 contains topics about A e7 contains topics about B The benefit of e5 seems to be greater than the sum of benefit of e6 and e7

sec

e5

e6 e7pp

8

Properties of Benefit and Reading Effort (2/2) Reading Effort

The reading effort of an element is the amount of cost by reading the content of the element

Assumption 2: The reading effort of an element is less than or equal to the sum of the reading effort of the child elements

Readability of continuous readingex) Users can read the same content

more easily by reading e5 rather than separate e6 and e7

sec

e5

e6 e7pp

:e7:e6

:e5:

9

Overview of our Approach Introduction of the concepts of benefit and

reading effort Users can control the total output size Systems can retrieve non-overlapping elements

Flexible retrieval Users specify a threshold for the total amount of

reading effort The systems return relevant elements that provide

larger benefit and that can be read within specified reading effort

10

Flexible Retrieval Systems calculate benefit and reading effort A variant of knapsack problems

ex) Threshold of reading effort : 15

articlee0

pe4

pe3

sec

e1

sec

e5

pe2 e6 e7

pp

2818

238

52

109

155

5028

150

108

Effort ReadingBenefit

Retrieve {e2, e3} (Total benefit : 11)

11

Flexible Retrieval Systems calculate benefit and reading effort A variant of knapsack problems

ex) Threshold of reading effort : 20

articlee0

pe4

pe3

sec

e1

sec

e5

pe2 e6 e7

pp

2818

238

52

109

155

5028

150

108

Effort ReadingBenefit

Retrieve {e3, e7} (Total benefit : 17)

Search Result Continuity

The running example violate search result continuity The content of element set for reading effort r must be contained

in the content of element set for reading effort r’ if r <= r’

The optimal solution is NP-hard (A variant of knapsack problems) may violate search result continuity

Greedy retrieval algorithm12

articlee0

pe4

pe3

sece1

sece5

pe2 e6 e7

pp

2818

238

52

109

155

5028

150

108

ex) reading effort : 15 Retrieve {e2, e3}

(benefit : 11) reading effort : 20

Retrieve {e3, e7}(benefit : 17)

13

Retrieval Algorithm Based on the result of Thorough strategy*

Adjust benefit and reading effort for nesting elements of retrieved element, and rerank

Remove overlapping contents by nestings

e3 (0.9)e7 (0.8)e1 (0.64)e0 (0.56)e2 (0.4)e5 (0.35)e4 (0.33)

* Simply retrieves relevant elements from all elements and ranks them in order of relevance

articlee0

pe4

pe3

sec

e1

sec

e5

pe2 e6 e7

pp

2818

238

52

109

155

5028

150

108

0.56

0.64 0.35

0.9 0.800.4

0.33

Result of Thorough

14

Retrieval Algorithm

articlee0

pe4

pe3

sec

e1

sec

e5

pe2 e6 e7

pp

2818

238

52

109

155

5028

150

108

0.56

0.64 0.35

0.9 0.800.4

0.33

Result of Thorough

e3 (0.9)e3 (0.9)e7 (0.8)e1 (0.64)e0 (0.56)e2 (0.4)e5 (0.35)e4 (0.33) 18

9

0.5

4019

0.48

e1 (0.5)e0 (0.48) Adjust e1 , e0

Amount of benefit : 9Amount of reading effort : 10

Our result

Threshold of reading effort : 40

15

Retrieval Algorithm

articlee0

pe4

pe3

sec

e1

sec

e5

pe2 e6 e7

pp

238

52

109

155 15

010

8

0.35

0.9 0.800.4

0.33

e3 (0.9)e3 (0.9)e7 (0.8)

e2 (0.4)e5 (0.35)e4 (0.33) 13

0

0

3011

0.37

e1 (0.5)e0 (0.48)

e7 (0.8)

4019

0.48

189

0.5

e5 (0)

e0 (0.37)Adjust and rerank e5 , e0


Result of Thorough Our result



16

Retrieval Algorithm

articlee0

pe4

pe3

sec

e1

sec

e5

pe2 e6 e7

pp

52

109

155 15

010

8

0.9 0.800.4

0.33

e3 (0.9)e3 (0.9)e7 (0.8)

e2 (0.4)

e4 (0.33)

122

0.17

e1 (0.5)e7 (0.8)

189

0.5e5 (0)

e0 (0.37)

e1 (0.5)30

11

0.37e0 (0.17) Adjust and rerank e0




130

0


17

Retrieval Algorithm

articlee0

pe4

pe3

sec

e1

sec

e5

pe2 e6 e7

pp

52

109

155 15

010

8

0.9 0.800.4

0.33

e3 (0.9)e7 (0.8)

e2 (0.4)e4 (0.33)

e1 (0.5)

e7 (0.8)

189

0.5e5 (0)

e1 (0.5)

e0 (0.17)




130

0

122

0.17


18

Retrieval Algorithm

articlee0

pe4

pe3

sec

e1

sec

e5

pe2 e6 e7

pp

52

109

155 15

010

8

0.9 0.800.4

0.33

e3 (0.9)e7 (0.8)

e2 (0.4)e4 (0.33)

e1 (0.5)

e7 (0.8)

189

0.5e5 (0)

e1 (0.5)

e0 (0.17)




130

0

122

0.17


19

Retrieval Algorithm

articlee0

pe4

pe3

sec

e1

sec

e5

pe2 e6 e7

pp

52

109

155 15

010

8

0.9 0.800.4

0.33

e3 (0.9)e7 (0.8)

e2 (0.4)e4 (0.33)

e1 (0.5)

e7 (0.8)

189

0.5e5 (0)

e1 (0.5)

e0 (0.17)

Our result


Result of Thorough


130

0

122

0.17


20

Evaluation Metrics Based on benefit and reading effort

b/e graph (benefit/effort graph)

Comparison with BTIL (Best Thorough Input List) BTIL system is the system which use actual benefit and

reading effort Actual benefit is calculated using manually constructed

assessments (e.g. INEX) We can observe relative effectiveness of benefit

changing the specified threshold of reading effort Use the same values for reading effort between

implemented system and BTIL system

21

Implemented system retrieves {e3, e7}Obtained actual benefit is 10

BTIL system retrieves {e3, e6} Obtained actual benefit is 23

For the threshold value 30 of reading effort

articlee0

pe4

pe3

sece1

sece5

pe2 e6 e7

pp

2818

238

52

109

155

5028

150

108

articlee0

pe4

pe3

sece1

sece5

pe2 e6 e7

pp

2820

2313

50

1010

1510

5035

1513

100

Calculated benefit / reading effort Actual benefit / reading effort

22

Examples of b/e Graph using INEX 2005 Test Collection (1/2) XML document set, Topics, Assessments Calculate actual benefit and reading effort from

Assessments

ex (Exhaustivity): Highly exhaustive (HE) 1 Partially exhaustive (PE) 0.5 Not exhaustive(NE) 0

rsize: relevant text length (in number of characters) size: element length (in number of characters)

We implemented a system using tf-ief ief stands for inverse element frequency satisfies Assumptions for benefit and reading effort

rsizeexbenefite *. sizeeffortreadinge _.

,, : parameter

23

Topic 207 Topic 206

Examples of b/e Graph using INEX 2005 Test Collection (2/2)

We can observe relative effectiveness of implemented systems against BTIL system

24

Conclusions and Future Works Conclusions

Introduction of benefit and reading effort Handling nesting elements Variety of element size

Algorithm for flexible retrieval Result elements change depending on the specified

reading effort System evaluation

Future Works Introduction of switching effort

Cost of switching a result item in the results list Retrieving numerous results increases the cost of

browsing Integration with user interface

a ranking scheme for xml information retrieval based on benefit and reading effort

Documents