xml retrieval

35
XML RETRİEVAL Tarık Teksen Tutal 21.07.2011

Upload: guy

Post on 23-Feb-2016

36 views

Category:

Documents


3 download

DESCRIPTION

XML Retrieval. Tarık Teksen Tutal 21.07.2011. Information Retrieval. XML ( Extensible Markup Language) XQuery Text Centric vs Data Centric. Basic XML Concepts. XML. Ordered, Labeled Tree XML Element XML Attribute - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: XML Retrieval

XML RETRİEVAL

Tarık Teksen Tutal21.07.2011

Page 2: XML Retrieval

INFORMATİON RETRİEVAL

XML (Extensible Markup Language)

XQuery

Text Centric vs Data Centric

Page 3: XML Retrieval

BASİC XML CONCEPTS

Page 4: XML Retrieval

XML Ordered, Labeled Tree

XML Element

XML Attribute

XML DOM (Document Object Model): Standard for accessing and processing XML documents.

Page 5: XML Retrieval

XML STRUCTURE An Example:

Page 6: XML Retrieval

XML DOM OBJECT

XML DOMObject of theSample in thePrevious Slide

Nodes in a Tree

Parse the TreeTop Down

Page 7: XML Retrieval

XPATH Standard for enumerating paths in an XML

document collection

Query language for selecting nodes from an XML document

Defined by the World Wide Web Consortium (W3C)

Page 8: XML Retrieval

SCHEMA

Puts Constraints on the Structure of Allowable XML

Two Standarts for Schemas:

XML DTD XML Schema

Page 9: XML Retrieval

CHALLANGES İN XML RETRİEVAL

Page 10: XML Retrieval

STRUCTURED DOCUMENT RETRİEVAL PRİNCİPLE

A system should always retrieve the most specific part of a document answering the query

In a «Cookbook» collection, if a user queries «Apple Pie», the system should return the relevant, «Apple Pie», chapter of the book, «AppleDeserts», not the entire book.

In the same example however, if user queries «Apple», the book should be returned instead of a chapter.

Page 11: XML Retrieval

INDEXİNG UNİT

Unstructured:

Files on PC, Pages on the Web, E-Mail Messages etc.

Structured

Non-Overlapping Pseudodocuments Top-Down Bottom-Up All

Page 12: XML Retrieval

INDEXİNG UNİT Non-Overlapping Pseudodocuments

Not Coherent

Page 13: XML Retrieval

INDEXİNG UNİT Top-Down

Start with one of the latest units (e.g book in a book collection)

Postprocess search results to find for each book the subelement that is the best hit.

Fail to return the best element since relevance of a book is generally not a good predictor for relevance of subelements.

Page 14: XML Retrieval

INDEXİNG UNİT Bottom-Up

Search all leaves, select relevant ones Extend them to larger units in postprocessing

Fail to return the best element since relevance of a subelement is generally not a good predictor for relevance of larger units.

Page 15: XML Retrieval

INDEXİNG UNİT Index All the Elements

Not Useful to Index Some Elements (e.g ISBN)

Creates redundancy (Deeper Level Elements are Returned Several Times)

Page 16: XML Retrieval

NESTED ELEMENTS To Get Rid of Redundancy,

Discard All Small Elements

Discard All Element Types that Users do not Look at (Working XML Retrieval System Logs)

Discard All Element Types that Assessors Generally do not Judge to be Relevant (If Relevance Assessments are Available)

Only Keep Element Types that a System Designer or Librarian has Deemed to be Useful Search Results

Page 17: XML Retrieval

NESTED ELEMENTS Remove Nested Elements in a Postprocessing

Step

Collapse Several Nested Elements in the Results List and then Highlight Results

Page 18: XML Retrieval

VECTOR SPACE MODEL FOR XML RETRİEVAL

Page 19: XML Retrieval

LEXİCALİZED SUBTREES To get each word together with its position within

the XML tree encoded by a dimension of the vector space

Map XML documents to lexicalized subtrees

Take each text node (leaf) and break it into multiple nodes, one for each word.

E.g. split Bill Gates into Bill and Gates

Define the dimensions of the vector space to be lexicalized subtrees of documents – subtrees that contain at least one vocabulary term

Page 20: XML Retrieval

LEXİCALİZED SUBTREES

Page 21: XML Retrieval

LEXİCALİZED SUBTREES

Queries and documents can be respresented as vectors in this lexicalized subtree context

Matches can then be computed for example by using the Vector Space Formalism

V.S. Formalism -> Unstructured vs Structured

Dimensions: Vocabulary Terms vs Lexicalized Subtrees

Page 22: XML Retrieval

DİMENSİONS: TRADEOFF

Dimensionality of Space vs Accuracy of Results

Restrict Dimensions to Vocabulary Terms Standart Vector Space Retrieval System Do Not Match the Structure of the Query

Separate Lexicalized Dimension for Each Subtree Dimensionality of Space Becomes too Large

Page 23: XML Retrieval

DİMENSİONS: COMPROMİSE

Index All Paths that End with a Single Vocabulary Term (XML-Context Term Pairs)

Structural Term <c, t>: a pair of XML-context c and vocabulary term t

Page 24: XML Retrieval

CONTEXT RESEMBLANCE To measure the similarity between a path in a

query and a path in a document

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd if and only if we can transform cq into cd by inserting additional nodes

Page 25: XML Retrieval

CONTEXT RESEMBLANCE

CR(cq4 , cd2) = 3/4 = 0.75 CR(cq4 , cd3) = 3/5 = 0.6

Page 26: XML Retrieval

DOCUMENT SİMİLARİTY MEASURE

Final Score for a Document

Variant of the Cosine Measure

Also called «SimNoMerge»

Not a True Cosine Measure Since Its Value can be Larger than 1.0

Page 27: XML Retrieval

DOCUMENT SİMİLARİTY MEASURE

V is the vocabulary of non-structural terms B is the set of all XML contexts weight (q, t, c), weight(d, t, c) are the

weights of term t in XML context c in query q and document d, respectively

standard weighting e.g. idft x wft,d, where idft depends on which elements we use to compute dft.

Page 28: XML Retrieval

SİMNOMERGE ALGORİTHMSCOREDOCUMENTSWITHSIMNOMERGE(q, B, V, N, normalizer)

Page 29: XML Retrieval

EVALUATİON OF XML RETRİEVAL

Page 30: XML Retrieval

INEX Initiative for the Evaluation of XML Retrieval

Yearly standard benchmark evaluation that has produced test collections (documents, sets of queries, and relevance judgments)

Based on IEEE journal collection (since 2006 INEX uses the much larger English Wikipedia test collection)

The relevance of documents is judged by human assessors.

Page 31: XML Retrieval

INEX TOPİCS Content Only (CO)

Regular Keyword Queries Like in Unstructured IR Content and Structure (CAS)

Structured Constraints in Addition to Keywords Relevance Assessments are More Complicated

Page 32: XML Retrieval

INEX RELEVANCE ASSESSMENTS INEX 2002 defined component coverage and

topical relevance as orthogonal dimensions of relevance

Component Coverage: Evaluates Whether the Element Retrieved is

«Structurally» Correct

Topical Relevance

Page 33: XML Retrieval

INEX RELEVANCE ASSESSMENTS Component Coverage:

Exact coverage (E): The information sought is the main topic of the component and the component is a meaningful unit of information

Too small (S): The information sought is the main topic of the component, but the component is not a meaningful (self-contained) unit of information

Too large (L): The information sought is present in the component, but is not the main topic

No coverage (N): The information sought is not a topic of the component

Topical Relevance: Highly Relevant (3), Fairly Relevant (2), Marginally

Relevant (1) and Nonrelevant (0)

Page 34: XML Retrieval

COMBİNİNG THE RELEVANCE DİMENSİONS All of the combinations are not possible ->

3N

Quantization:

Page 35: XML Retrieval

INEX EVALUATİON MEASURES Precision and Recall can be applied

Sum Grades vs Binary Relevance

Overlap is not accounted for Nested elements in the same search result

Recent INEX focus: Develop algorithms and evaluation measures

that return non-redundant results lists and evaluate them properly.