storing and querying ordered xml using relational database system swapna dhayagude
Post on 21-Dec-2015
228 Views
Preview:
TRANSCRIPT
Agenda
Ordered XML Data Model
Order Encoding Methods
Shredding Ordered XML into Relations
Translating XML queries to SQL
Performance Evaluation
Ordered XML Data Model
XML document as a tree structure - Relation as the ‘root’
- Nodes represent elements
- Leaf nodes hold data values
Document Type Descriptor
- schema information about the XML document
Order - a salient feature of an XML document
Significance of order in XML
Order –
Important from the point of view of reconstruction of XML documents- To ensure a lossless mapping from XML to RDB
Performance issues- Choice of order dramatically affects performance- Enhances Efficient Translation of XML into SQL
Order based functionality of XPath and XQuery
XPath – a simple ‘path based’ query language XQuery – a complex query language based on XPath
Three dimensions of XML order
Evaluation of Order based axesXPath expressions requiring document order
1. preceding
2. following
Inter Element Order
result set enforces document order among result set elements
Intra Element Order
For reconstruction, document order is important
Agenda
Ordered XML Data Model Order Encoding Methods Shredding Ordered XML into Relations Translating XML queries to SQL Performance Evaluation
How is order encoded ?
Order is preserved using
a simple numbering scheme
Each node is represented
using a node_id
Node-id is stored as a data value
within the relation
Numbering schemes capture enough information
to reconstruct XML documents
Order Based Functionality of XPath
XPath follows a step-by-step sequential evaluation, Each step is applied to a single node (context node) Result of each step is a set of nodes {node1,node2,..,node n}
XPath syntax Path :: = /Step1/Step2/…/StepN
Where each Xpath Step is defined as follows:Step :: = Axis :: Node-test Predicate*
Axis selects a direction of navigation
e.g. child :: titleWould select all children that are ‘titles’
Order Based Functionality of XPath
Axes – specify the direction of navigation in an XML document Up
parent ancestor
Down child descendant
Left preceding Preceding-sibling
Right following Following-sibling
Order Based Functionality of XQuery
BEFORE operator- Return nodes from the first sequence that are before some node in the second sequence
AFTER operator
- Return nodes from the first sequence that are after some node in the second
sequence
XQuery supports range predicates
- allows selection of a range of elements from a sequence
e.g. /play/act[2 TO 4]
Will return act #2 ,act #3, and act #4 in document order.
Global Order Encoding Methods
Global Order Encoding Absolute positioning of nodes Best performance on queries - Query evaluation requires simple
comparison between node positions
Worst performance on updates, especially deletes
play(1)
title(2)
text#(3)
act(8)act(4)
title(5) scene(7)
text#(6)
Global Order Encoding (contd)
Initially, sparse numbering is used for Global Order Encoding Sparse numbering brings down the cost of renumbering
(on inserts/updates) Sparse numbering results in better performance on updates
Makes intra-element and inter-element ordering easy(since global document order is easily available)
Drawback - performs poorly on inserts(Local Order offers better performance for inserts/updates)
Global Order Renumbering Scenario
Inserting a new element in an existing document causes many nodes to be renumbered
In the adjoining figure, the highlighted nodes need to be renumbered (maximum in the global ordering scheme)
play(1)
title(2)
text#(3)
act(8)
New Element
act(4)
title(5) scene(7)
scene(7)
Local Order Encoding Methods
Local Order Encoding1. Relative positioning of nodes
2. Best performance on updates
3. Worst performance on queries
play(1)
act(2)title(1) act(3)
text(1)title(1) scene(2)
text(1)
Local Order Encoding (continued….)
How does local Order encoding reconstruct absolute path ?
the relative position of a node is combined
with the relative order of the
parent
this combined effect yields a vector that
uniquely identifies the absolute
position within the document
(relative position of node) + (relative position of ancestor)
= (absolute position of node in the document)
Local Order Renumbering Scenario
As opposed to Global Order Encoding, Local Order requires a minimum number of nodes to be renumbered
This is a major advantage, since it dramatically reduces the cost of inserts
play(1)
title(1)
text#(1)
act(2)
New Element
act(2)
title(1) scene(2)
scene(1)
Local Order Encoding (continued….)
Incurs low overhead on updates
Only “following-sibling “ may require renumbering
Drawbacks – Lack of global order information
results in complex evaluations of
following and preceding axes
Dewey Order Encoding Methods
Dewey Order Encoding
1. Strikes a balance between Global and Local
2. Reasonable performance on updates and queries
Play 1
title(1.1)
text(1.1.1)
act(1.2)
title(1.1.2)
act(1.3)
scene(1.2.2)
text(1.1.2.1)
Dewey Order Encoding
Each path uniquely identifies
absolute position of a node in a document
Query processing is similar to that of
Global order
Only “following-sibling “ may require renumbering
Drawbacks – Extra space required to store paths
from root to the node
Dewey Order Renumbering Scenario
Renumbering required is more than that for Local Encoding, however much less than that for Global Encoding
play
title
text#
act
New element
act
title scene
scene
Agenda
Ordered XML Data Model Order Encoding Methods Shredding Ordered XML into Relations Translating XML queries to SQL Performance Evaluation
Shredding XML into Relations
Schema-less Case
Unknown schema of input XML documents
Edge Approach -
Each document is stored as a single table
Schema-aware Case
Schema of input XML documents is available
Inlining –
Single occurrence of child – store within parent relation
Multiple occurrences – store as a new relation table
Inlining
Inlining is an effective way of storing and querying XML
provided the availability of Document Schema
Inlining adapts to Global, Local and Dewey Orders.
Every relation requires an additional column
to encode document order
storing order information of ‘inlined’ elements is unnecessary
(Element position is determined from the position of parent
and from the document schema)
Storing Order Information – Schema less case
The Edge Approach Each relation is stored as a table Each tuple within the table represents a node
Edge (id, parent_id, name, value)
id synonymous to a primary key
parent_id synonymous to the foreign key, provides link to the node’s parent
name stores tag name of element
value stores text value
Storing Order Information – Schema less case
Edge approach adapts differently to Global, Local and Dewey
Global OrderEdge (id, parent_id, end_desc_id, path_id, value)
end_desc_id – id of the last descendant of a node
Local OrderEdge (id, parent_id, sIndex, path_id, value) sIndex – sibling index of a node
Dewey OrderEdge (dewey, path_id, value)dewey – represents both order and ancestor information
Agenda
Ordered XML Data Model Order Encoding Methods Shredding Ordered XML into Relations Translating XML queries to SQL Performance Evaluation
Query Translation for Global Order
Edge (id, parent_id, end_desc_id, path_id, value)Translation of following/preceding
Select nodes from Edge table whereid value (context node) > end_descendant_id of context node
Select nodes from Edge table where id value (context node) < end_descendant_id of context node
Translation of following-sibling/ preceding-siblingSelect
(nodes in Edge table with id value > id of context node) AND
(nodes with parent_id = parent_id of context node)
Select (nodes in Edge table with id value < id of context node)
AND (nodes with parent_id = parent_id of context node)
Note : above expressions are NOT actual SQL statements
Query Translation for Local Order
Edge (id, parent_id, sIndex, path_id, value)
Translation of following-sibling/ preceding-sibling(Similar to Global and Dewey Order)
Translation of following/preceding (Complex Task !!!)
1. Compute all ancestors of context node – {anc}
2. Compute ancestors of following-sibling - {anc_sib}
3. Compute descendants of {anc_sib}
Challenges: Without knowledge of XML schema,
retrieving ancestors/descendants is a complex task Involves recursion
Query Translation for Dewey Order
Edge (dewey, path_id, value)
dewey column
- stored as variable length byte string
- replaces parent_id, and end_desc_id in Global Edge Table- Encodes parent and descendant information within the dewey path
- Eliminates need to store parent_id and child_id
Drawback:
Storage overhead due to large number of bytes
allocated to each component.
Query Translation in Inlining
Essentially uses the same algorithm as that for Edge approach
but with 2 extensions XML data can be spread across several tables
therefore evaluating axes requires access to multiple tables
as opposed to accessing just one Edge table
Secondly translation algorithm does not use recursion
(since the schema contains sufficient information about
depth and postion of nodes)
Drawback:
Data is partitioned across many tables, too many tables to handle
Agenda
Ordered XML Data Model Order Encoding Methods Shredding Ordered XML into Relations Translating XML queries to SQL Performance Evaluation
Storage Requirements
Table 1: Indicates the storage requirements of Global, Local and Dewey Encoding Methods
Order Scheme
Edge Inlining
Table Size Index Size Table Size Index Size
Global 52.1 MB 57.9 MB 44.1 MB 28.9 MB
Local 52.1 MB 87.9 MB 47.7 MB 36.8 MB
Dewey 48.9 MB 38.7 MB 44.5 MB 15.8 MB
Performance
All experiments are based on the Shakespeare’s Plays dataset.Table 2: Test Queries
Query Query Definition
Q1 /play
Q2 /play/act//speech
Q3 /play/act/scene/speech
Q4 /play/act/scene/speech[2]
Q5 /play/act/scene/*[2]
Q6 /play/act/scene/speech[1 TO 3]
Q7 /play/act[2]/following:: speech
Q8 /play/act/scene/speech/speaker/following-sibling::line[2]
Q9 //act/scene/speech BEFORE /play/act[2]
Select and Reconstruct Modes
XPath Queries essentially run in 2 different modes
Select Mode : Result set contains only the ID’s
of the nodes satisfying the XPath expression
Reconstruct Mode: Entire XML fragments are extracted
from the database in document order
Ordered Selection Edge Results
0
2
4
6
8
10
12
14
16
18
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Global
Local
Dewey
X axis: QueriesY axis: Time
(seconds)
Reconstruction
In reconstruct mode,
XML documents need to be extracted from DB in document order
Optimizers inability to pick the best plan rendered poor results
On the other hand, using ‘tuned’ SQL queries yielded better results
Note: Queries Q3,Q4,Q5,Q9 had a disastrous performance (way beyond the scope of indicated scale)
0
1
2
3
4
5
6
7
8
9
10
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Initial
Tuned
Performance
Results based on experiments Global order is the most efficient order encoding method
Followed by Dewey Order – second best performance
Local Order uses sorting very often which degrades
overall performance
Typically Inlining performs better than Edge
In general the XML document parsing overhead was more than XPath processing
Performance
Conclusions based on results
RDBMS efficiently supports ordered XML Global order is the best for query workloads Dewey Order is slightly less efficient than Global Order
Best for a mix of queries and updates Schema Information makes Local Order a viable alternative Incomprehensiveness of Relational Optimizers
to the hierarchical XML structure
top related