storing and querying ordered xml using relational database system swapna dhayagude

40
Storing and Querying Ordered XML Using Relational Database System Swapna Dhayagude

Post on 21-Dec-2015

227 views

Category:

Documents


0 download

TRANSCRIPT

Storing and Querying Ordered XML Using Relational Database System

Swapna Dhayagude

Agenda

Ordered XML Data Model

Order Encoding Methods

Shredding Ordered XML into Relations

Translating XML queries to SQL

Performance Evaluation

Ordered XML Data Model

XML document as a tree structure - Relation as the ‘root’

- Nodes represent elements

- Leaf nodes hold data values

Document Type Descriptor

- schema information about the XML document

Order - a salient feature of an XML document

Significance of order in XML

Order –

Important from the point of view of reconstruction of XML documents- To ensure a lossless mapping from XML to RDB

Performance issues- Choice of order dramatically affects performance- Enhances Efficient Translation of XML into SQL

Order based functionality of XPath and XQuery

XPath – a simple ‘path based’ query language XQuery – a complex query language based on XPath

Three dimensions of XML order

Evaluation of Order based axesXPath expressions requiring document order

1. preceding

2. following

Inter Element Order

result set enforces document order among result set elements

Intra Element Order

For reconstruction, document order is important

Agenda

Ordered XML Data Model Order Encoding Methods Shredding Ordered XML into Relations Translating XML queries to SQL Performance Evaluation

How is order encoded ?

Order is preserved using

a simple numbering scheme

Each node is represented

using a node_id

Node-id is stored as a data value

within the relation

Numbering schemes capture enough information

to reconstruct XML documents

Order Based Functionality of XPath

XPath follows a step-by-step sequential evaluation, Each step is applied to a single node (context node) Result of each step is a set of nodes {node1,node2,..,node n}

XPath syntax Path :: = /Step1/Step2/…/StepN

Where each Xpath Step is defined as follows:Step :: = Axis :: Node-test Predicate*

Axis selects a direction of navigation

e.g. child :: titleWould select all children that are ‘titles’

Order Based Functionality of XPath

Axes – specify the direction of navigation in an XML document Up

parent ancestor

Down child descendant

Left preceding Preceding-sibling

Right following Following-sibling

Order Based Functionality of XQuery

BEFORE operator- Return nodes from the first sequence that are before some node in the second sequence

AFTER operator

- Return nodes from the first sequence that are after some node in the second

sequence

XQuery supports range predicates

- allows selection of a range of elements from a sequence

e.g. /play/act[2 TO 4]

Will return act #2 ,act #3, and act #4 in document order.

Global Order Encoding Methods

Global Order Encoding Absolute positioning of nodes Best performance on queries - Query evaluation requires simple

comparison between node positions

Worst performance on updates, especially deletes

play(1)

title(2)

text#(3)

act(8)act(4)

title(5) scene(7)

text#(6)

Global Order Encoding (contd)

Initially, sparse numbering is used for Global Order Encoding Sparse numbering brings down the cost of renumbering

(on inserts/updates) Sparse numbering results in better performance on updates

Makes intra-element and inter-element ordering easy(since global document order is easily available)

Drawback - performs poorly on inserts(Local Order offers better performance for inserts/updates)

Global Order Renumbering Scenario

Inserting a new element in an existing document causes many nodes to be renumbered

In the adjoining figure, the highlighted nodes need to be renumbered (maximum in the global ordering scheme)

play(1)

title(2)

text#(3)

act(8)

New Element

act(4)

title(5) scene(7)

scene(7)

Local Order Encoding Methods

Local Order Encoding1. Relative positioning of nodes

2. Best performance on updates

3. Worst performance on queries

play(1)

act(2)title(1) act(3)

text(1)title(1) scene(2)

text(1)

Local Order Encoding (continued….)

How does local Order encoding reconstruct absolute path ?

the relative position of a node is combined

with the relative order of the

parent

this combined effect yields a vector that

uniquely identifies the absolute

position within the document

(relative position of node) + (relative position of ancestor)

= (absolute position of node in the document)

Local Order Renumbering Scenario

As opposed to Global Order Encoding, Local Order requires a minimum number of nodes to be renumbered

This is a major advantage, since it dramatically reduces the cost of inserts

play(1)

title(1)

text#(1)

act(2)

New Element

act(2)

title(1) scene(2)

scene(1)

Local Order Encoding (continued….)

Incurs low overhead on updates

Only “following-sibling “ may require renumbering

Drawbacks – Lack of global order information

results in complex evaluations of

following and preceding axes

Dewey Order Encoding Methods

Dewey Order Encoding

1. Strikes a balance between Global and Local

2. Reasonable performance on updates and queries

Play 1

title(1.1)

text(1.1.1)

act(1.2)

title(1.1.2)

act(1.3)

scene(1.2.2)

text(1.1.2.1)

Dewey Order Encoding

Each path uniquely identifies

absolute position of a node in a document

Query processing is similar to that of

Global order

Only “following-sibling “ may require renumbering

Drawbacks – Extra space required to store paths

from root to the node

Dewey Order Renumbering Scenario

Renumbering required is more than that for Local Encoding, however much less than that for Global Encoding

play

title

text#

act

New element

act

title scene

scene

Agenda

Ordered XML Data Model Order Encoding Methods Shredding Ordered XML into Relations Translating XML queries to SQL Performance Evaluation

Shredding XML into Relations

Schema-less Case

Unknown schema of input XML documents

Edge Approach -

Each document is stored as a single table

Schema-aware Case

Schema of input XML documents is available

Inlining –

Single occurrence of child – store within parent relation

Multiple occurrences – store as a new relation table

Inlining

Inlining is an effective way of storing and querying XML

provided the availability of Document Schema

Inlining adapts to Global, Local and Dewey Orders.

Every relation requires an additional column

to encode document order

storing order information of ‘inlined’ elements is unnecessary

(Element position is determined from the position of parent

and from the document schema)

Storing Order Information – Schema less case

The Edge Approach Each relation is stored as a table Each tuple within the table represents a node

Edge (id, parent_id, name, value)

id synonymous to a primary key

parent_id synonymous to the foreign key, provides link to the node’s parent

name stores tag name of element

value stores text value

Storing Order Information – Schema less case

Edge approach adapts differently to Global, Local and Dewey

Global OrderEdge (id, parent_id, end_desc_id, path_id, value)

end_desc_id – id of the last descendant of a node

Local OrderEdge (id, parent_id, sIndex, path_id, value) sIndex – sibling index of a node

Dewey OrderEdge (dewey, path_id, value)dewey – represents both order and ancestor information

Agenda

Ordered XML Data Model Order Encoding Methods Shredding Ordered XML into Relations Translating XML queries to SQL Performance Evaluation

Query Translation for Global Order

Edge (id, parent_id, end_desc_id, path_id, value)Translation of following/preceding

Select nodes from Edge table whereid value (context node) > end_descendant_id of context node

Select nodes from Edge table where id value (context node) < end_descendant_id of context node

Translation of following-sibling/ preceding-siblingSelect

(nodes in Edge table with id value > id of context node) AND

(nodes with parent_id = parent_id of context node)

Select (nodes in Edge table with id value < id of context node)

AND (nodes with parent_id = parent_id of context node)

Note : above expressions are NOT actual SQL statements

Query Translation for Local Order

Edge (id, parent_id, sIndex, path_id, value)

Translation of following-sibling/ preceding-sibling(Similar to Global and Dewey Order)

Translation of following/preceding (Complex Task !!!)

1. Compute all ancestors of context node – {anc}

2. Compute ancestors of following-sibling - {anc_sib}

3. Compute descendants of {anc_sib}

Challenges: Without knowledge of XML schema,

retrieving ancestors/descendants is a complex task Involves recursion

Query Translation for Dewey Order

Edge (dewey, path_id, value)

dewey column

- stored as variable length byte string

- replaces parent_id, and end_desc_id in Global Edge Table- Encodes parent and descendant information within the dewey path

- Eliminates need to store parent_id and child_id

Drawback:

Storage overhead due to large number of bytes

allocated to each component.

Query Translation in Inlining

Essentially uses the same algorithm as that for Edge approach

but with 2 extensions XML data can be spread across several tables

therefore evaluating axes requires access to multiple tables

as opposed to accessing just one Edge table

Secondly translation algorithm does not use recursion

(since the schema contains sufficient information about

depth and postion of nodes)

Drawback:

Data is partitioned across many tables, too many tables to handle

Agenda

Ordered XML Data Model Order Encoding Methods Shredding Ordered XML into Relations Translating XML queries to SQL Performance Evaluation

Storage Requirements

Table 1: Indicates the storage requirements of Global, Local and Dewey Encoding Methods

Order Scheme

Edge Inlining

Table Size Index Size Table Size Index Size

Global 52.1 MB 57.9 MB 44.1 MB 28.9 MB

Local 52.1 MB 87.9 MB 47.7 MB 36.8 MB

Dewey 48.9 MB 38.7 MB 44.5 MB 15.8 MB

Performance

All experiments are based on the Shakespeare’s Plays dataset.Table 2: Test Queries

Query Query Definition

Q1 /play

Q2 /play/act//speech

Q3 /play/act/scene/speech

Q4 /play/act/scene/speech[2]

Q5 /play/act/scene/*[2]

Q6 /play/act/scene/speech[1 TO 3]

Q7 /play/act[2]/following:: speech

Q8 /play/act/scene/speech/speaker/following-sibling::line[2]

Q9 //act/scene/speech BEFORE /play/act[2]

Select and Reconstruct Modes

XPath Queries essentially run in 2 different modes

Select Mode : Result set contains only the ID’s

of the nodes satisfying the XPath expression

Reconstruct Mode: Entire XML fragments are extracted

from the database in document order

Ordered Selection Edge Results

0

2

4

6

8

10

12

14

16

18

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Global

Local

Dewey

X axis: QueriesY axis: Time

(seconds)

Inlining Results

0

1

2

3

4

5

6

7

8

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Global

Local

Dewey

Reconstruction

In reconstruct mode,

XML documents need to be extracted from DB in document order

Optimizers inability to pick the best plan rendered poor results

On the other hand, using ‘tuned’ SQL queries yielded better results

Note: Queries Q3,Q4,Q5,Q9 had a disastrous performance (way beyond the scope of indicated scale)

0

1

2

3

4

5

6

7

8

9

10

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Initial

Tuned

Performance

Results based on experiments Global order is the most efficient order encoding method

Followed by Dewey Order – second best performance

Local Order uses sorting very often which degrades

overall performance

Typically Inlining performs better than Edge

In general the XML document parsing overhead was more than XPath processing

Performance

Conclusions based on results

RDBMS efficiently supports ordered XML Global order is the best for query workloads Dewey Order is slightly less efficient than Global Order

Best for a mix of queries and updates Schema Information makes Local Order a viable alternative Incomprehensiveness of Relational Optimizers

to the hierarchical XML structure

Acknowledgements…

Prof. Elke Rundensteiner

Thank You …