integrating keyword search into xml query processing presentation by: alex kremer ariel rosenblatt...

Post on 20-Dec-2015

247 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Integrating Keyword Search into XML Query Processing

Presentation By:Alex Kremer

Ariel Rosenblatt

XML Query Language (XML-QL)Extending XML-QL with Keyword SearchExtended XML-QL Implementation Using

RDBMS

Bibliography(well-formed, but invalid) Bibliography Article elements are from different

sources Same information, but using

different XML Scheme / DTDs (Document Type Descriptors)

XML Queries XML is becoming the Data Storage

and Exchange Format of choice in many applications

Handling of XML data requires a rich and powerful Query Language Allow for querying the content and

structure of an XML document Varying or unknown structures can

make formulating queries very difficult

XML Queries: Why not SQL/OQL XML is not rigidly structured In XML the schema can exists with the

data as tag names If DTD is not available, schema is build while

the document is parsed Missing elements or multiple

occurrences of the same element This flexibility is crucial for EDI

(Electronic Document Interchange)

XML Query Requirements W3C Working Group Goals:

Support different usage scenarios Define data model + query operators Define query language syntax Interoperate with other XML working

groups

XML Query Requirements: Usage Scenarios Human-readable documents

Manuals, Books, Articles Data-oriented documents

XML representation of: Database data, Object data, …

XML representation might be either: Physical or Virtual

XML Query Requirements: Usage Scenarios Contd. Mixed model documents:

Hybrid of document oriented and data-oriented

Catalogues, Patient health records, … Administrative data:

Configuration files, User profiles, Administrative logs

XML Query Requirements: Usage Scenarios Contd. Filtering streams:

On-line: filtering / extracting / transforming / routing, of XML data streams

Logs of email messages, Network packets, Stock market data, Newswire feeds

Document Object Model (DOM) Perform queries on DOM structures to

return sets of nodes that meet the specified criteria

XML Query Requirements: Usage Scenarios Contd. Multiple syntactic environments for

queries embedded in: URL, XML, JSP or ASP pages, a string

in a general-purpose programming language

XML Query Requirements: Interoperability Results must be returned in a DOM

compatible manner XPath (used in XPointer and XSLT)

XPath expressibility and search facilities should be used in query syntax

Usage of XML Schema (XSDL) and/or DTD

XML Query Languages: Proposals to W3C XQL (heavily based on XPath) XML-QL

XML-QL It is declarative It is “relational complete”; in

particular it can express joins Simple enough to enable

optimizations It can extract data from existing

XML documents and construct new documents (transformations)

XML-QL: Syntax

WHERE clause specifies how to filter data from the input XML dataset

CONSTRUCT clause specifies how to assemble the query results in XML

WHERE ( xml-pattern [ ELEMENT_AS $elem_var ] )*

IN url, ( predicate )*

CONSTRUCT xml-pattern | $variable

XML-QL: Example #1

Yields the following result

WHERE <article>

<author><name>$N</name></author>

<title>$T</title>

<article> ELEMENT_AS $E IN “bibliography.xml”,

$N like *Florescu*

CONSTRUCT <result> $E </result>

XML-QL Explained:The Data Model A Set of XML documents must be

represented (XML Data Set) XML elements in a dataset can be

partitioned according to their types Need to represent information in a

loss-less manner (original data set must be recreatable from the representation)

XML-QL Explained:Data Model Representation

ID00

ID01

article

ID02

“20000815” “1” “http:…”

id linkdate

title

ID03

source

“XML Query…” “W3C”

ID04

article

ID05

“3” “http:…”

title

ID06

author

“A Query…”

“Daniela Florescu”

ID07

author

id link

name name“Alon L…”

ID08

ID09

“4” “http:…”

title

ID10

author

“Integr…”

“Daniela Florescu”

ID12

author

id link

name

“Donald K…”

article

ID11 ID13

name

ID14article

id

“6”

“@article…Florescu…}”

Bibliography:

XML-QL Explained:Data Model Representation

Dataset D is represented as a graph GD: Nodes:

Element e node Ne uniquely labeled IDe

Data value v leaf Lv uniquely labeled v Edges:

(Ne , Ne’) labeled with the tag of e’, if e’ is directly nested within e (<e><e’>…</e’></e>)

(Ne , Lv) labeled with “”, if v is directly contained within e (<e>v</e>)

(Ne , Lv) labeled with attribute name a, if v is the value of atribute a of element e (<e a=“v”>…</e>)

XML-QL Explained:Query Processing An XML pattern can be also modeled by a

graph Some labels in the graph are now variables

The result of the evaluation of query q on the input D, is: Each mapping from the graph Gq to the graph

GD which preservers the constant labels This mapping induces a substitution of the

variables in the query on the set of constant values

XML-QL Explained:A Query Graph for Example #1

WHERE <article>

<author><name>$N</name></author>

<title>$T</title>

<article> ELEMENT_AS $E IN “bibliography.xml”,

$N like *Florescu*

CONSTRUCT <result> $E </result>

title author

$T

“*Florescu*”

name

article

XML-QL Explained:Query Processing, Example #1

ID00

ID01

article

ID02

“20000815” “1” “http:…”

id linkdate

title

ID03

source

“XML Query…” “W3C”

ID04

article

ID05

“3” “http:…”

title

ID06

author

“A Query…”

“Daniela Florescu”

ID07

author

id link

name name“Alon L…”

ID08

ID09

“4” “http:…”

title

ID10

author

“Integr…”

“Daniela Florescu”

ID12

author

id link

name

“Donald K…”

article

ID11 ID13

name

ID014article

id

“6”

“@article…Florescu…}”

Bibliography:

title author

$T

“*Florescu*”

name

article

No <author>

No <name>“name” is an attribute

Match! Add ID08 to Results$E = ID08$T = “Integrating Keyword Search…”

XML-QL: Advanced QueriesExample #2 (More Florescu)

We now look for articles where the author name can be also an

attribute!, result

WHERE <article> <*><author><name>$N</name></author></*> <title>$T</title>

<article> ELEMENT_AS $E IN “bibliography.xml”,$N like *Florescu*

CONSTRUCT <result> $E </result>unionWHERE <article>

<*><author><_ name=$N></_></author></*> <title>$T</title>

<article> ELEMENT_AS $E IN “bibliography.xml”,$N like *Florescu*

CONSTRUCT <result> $E </result>

Back

XML-QL: Disadvantages We need to know the XML

structure in order to query We can still perform more efficient

queries, where we get all the information available, but

These queries can easily grow very complex as seen previously

XML-QL: Keyword Search Extension Addition of special predicate called

contains to XML-QL Tests the existence of a given word

within an XML element Works on partially known or not-

known XML structure Allows querying several XML

documents with different structure

Extended XML-QL: The contains Predicate The contains predicate has 4 arguments,

($E, word, depth, location): $E is an XML element variable Word – the word we are searching for Depth is an integer expression limiting the

depth at which the word is found within the element

Location is a boolean expression over the set of constants,

{tag_name, attribute_name, content, attribute_value}

Extended XML-QL:Example #3 We can use the extended XML-QL

to formulate a query which yields the same result as Example #2

WHERE <article>

<author></author> ELEMENT_AS $A

<title>$T</title>

<article> ELEMENT_AS $E IN “bibliography.xml”,

contains($A, “Florescu”, 3,

content or attribute_value)

CONSTRUCT <result> $E </result>

Back

Extended XML-QL:Example #4

WHERE <article></article>

ELEMENT_AS $E IN “bibliography.xml”,

contains($E, “Florescu”, 3, any)

CONSTRUCT <result> $E </result>

We are able to query unstructured data (full text search) within a set of articles:

Yielding the result

Implementing the contains predicate The authors suggest an

implementation of the XML-QL extension on top of a Commercial RDBMS: Oracle 8, IBM DB2, MS-SQL, …

Implementation Using RDBMS Reasons:

Easy to implement an extended XML query processor

Universally available RDBMS allow to mix XML data and

other (relational data) Very good performance over large

volumes of data

Relational Support forFull-text Indexing Use of extended Inverted Files to

implement: The contains predicate Finding of relevant XML data sources

(URLs) in a distributed environment We will use RDBMS to implement

Inverted Files

Inverting Files For our needs the inverted file will

contain tuples of the following format: <word, elID, depth, location>

Examples from bibliography.xml: <“article”, elID01, 0, tag> <“id”, elID01, 1, attr> <“Requirements”, elID01, 2, value>

Storing Inverted Files in RDBMS: Unique Internal elIDs Unique element IDs are modeled

as records containing: Document locators (URLs) Element locators within the document

Using absolute positions (start, end) Using unique identifiers specified by DTD

(explicit id attribute)

Why not XPointer?

Storing Inverted Files in RDBMS: Unique elID Schemes After normalization the authors

propose the following scheme: Elements(elID, docid, start_pos,

end_pos, type, id_val) Documents(docid, URL)

From this point elID can be used as an internal key used for faster processing

Storing Inverted Files in RDBMS Natural way – using scheme:

contains(elID, word, depth, location) Huge! We partition it into word tables

for each keyword <word> in the dataset: <word>(elID, depth, location)

Virtually all IR (Information Retrieval) systems use partitioning by word

Back

Storing Inverted Files in RDBMS: Further Partitioning We use further partitioning to optimize

the query processing: The type (tag) of the element is usually known

at predicate evaluation time by looking at the XML pattern of the query

We further partition the individual <word> tables by the type of the element they are in: <word>-<type>(elID, depth, location)

Table examples: Name-author, Florescu-name bibliography.xml

Back

Implementation: Extended XML-QL Query Processing Two Ways:

Replicating the whole XML data in an RDBMS

XML-QL processing is entirely performed in an RDBMS

Distributed XML Query Processing only index (contains) is stored in an

RDBMS

Replicating the XML Data in an RDBMS The binary table approach:

For each type (tag name or attribute name), a table is built with the following scheme:

<type>(parent, element, value) The parent element contains the element of

type <type> element is null if a <type> has no sub-

elements or if <type> is an attribute name (in that case we are usually interested in the value)

bibliography.xml

Replicating the XML Data in an RDBMS: XML-QL Queries Every XML-QL query can be

translated into an equivalent SQL query

The SQL query will process the binary tables of the replicated XML Data

Back

XML-QL to SQL: Example #5 (from Example #1)

WHERE <article> <author><name>$N</name></author> <title>$T</title>

<article> ELEMENT_AS $E IN “bibliography.xml”,$N like *Florescu*

CONSTRUCT <result> $E </result>

SELECT article.elementFROM article, author, name, title WHERE article.element = author.parent AND

author.element = name.parent AND article.element = title.parent AND /* title exists */

name.value like “Florescu”

Extended XML-QL to SQL: Keyword Search Processing the contains predicate

involves usage of inverted file tables

The word-type table has to be joined with the previous result

The word-type table is the resulting table of the word by type partitioning

Extended XML-QL to SQL: Example #6

SELECT title.valueFROM article, author, name, title, Florescu-author, Integrating-titleWHERE article.element = author.parent AND author.element = Florescu-author.elID AND

article.element = title.parent AND title.element = Integrating-title.elID

WHERE <article> <author></author> ELEMENT_AS $A <title>$Ttext</title> ELEMENT_AS $T

<article> ELEMENT_AS $E IN “bibliography.xml”,contains($A, “Florescu”, 3, any)contains($T, “Integrating”, 3, any)

CONSTRUCT <result> $Ttext </result>

Distributed XML Query Processing XML data can be indexed in RDBMS, but The XML data cannot be stored in the

RDBMS Reasons: volume (entire www) or legal

The mediator (query interface): Uses inverted files in RDBMS, but Accesses the data sources to compute the full

query result (Expensive!) Load relevant documents/elements into RDBMS

and process the query as described before (XML-QL to SQL)

Distributed XML Query Processing: Elements Retrieval Use of Inverted Files for the retrieval

of relevant documents/elements: Evaluate contains predicates to

disqualify irrelevant elements Further reduce the dataset needed to

process the remaining basic XML-QL query

This is an optimization since retrieval of remote data is expensive

Load the relevant documents/elements

Distributed XML Query Processing: Reducing Retrieval

WHERE <article>

<author><name>$N</name></author>

<title>$T</title>

<article> ELEMENT_AS $E IN “bibliography.xml”,

$T like *XML*

CONSTRUCT <result> $N </result>

Get the intersection of elIDs sets from: author-article name-article title-article XML-article

Conclusions XML-QL can be extended to support keyword search Use of RDBMS:

Inverted Files can be stored an queried using an RDBMS XML data itself can be replicated and queried in the RDBMS Keyword search and overall XML query processing can be

carried out very efficiently Data structure influence:

The more structure is known, the faster a query will be executed

Totally unstructured queries can be executed very fast The more structure is known, the higher is the quality of

the query results

top related