L09: introduction to xml data management

L09: Introduction to XML Data Management
XML and XML Query Languages
Structural Summary and Coding Scheme
Managing XML Data in Relational Systems

L09: Introduction to XML Data Management

XML and XML Query Languages
Structural Summary and Coding Scheme
Managing XML Data in Relational Systems

XML and XML Query Languages

XML and XML Query Languages
Structural Summary and Coding Scheme
Managing XML Data in Relational Systems

Extensible Markup Language for data


Extensible Markup Language for data A W3C standard to complement HTML

http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)

Standard for publishing and interchange Origins: structured text SGML

“Cleaner” SGML for the Internet

Motivation: HTML describes presentation XML describes content


XML – Describing the Content

XML – Describing the Content<project>

<talk ><title> XML Query Processing & Optimization

</title> <date> March 18, 2004 </date> <instructor> Instructor

<name> Lu Hongjun </name> <affiliation> HKUST </ affiliation > <email> [email protected] </email>

<name> Jeffrey X. Yu </name> <affiliation> CUHK </ affiliation > <email> [email protected] </email>

</ instructor > </talk>


XML Document/Data

XML Document/Data

Hierarchical document format for information exchange in WWW

Self describing data (tags) Nested element structures having a root Element data can have

Attributes Sub-elements

Basic XML Structures

Basic XML Structures

Elements: <title>… </title>,<name>… </name> Open & close tags or “empty tag” Ordered, nestable an element can be empty

Attributes PCDATA/CDATA An XML document: single root element

well formed XML document: if it has matching tags

Basic XML Structures: Attributes

Basic XML Structures: Attributes Single-valued, ordered

<project proj_id = “P1234” budget = “1000000”> <title> XML Data Management </title> … <year> 2003-2004 </year></project>

Special types: ID, IDREF, IDREFS <member id=“m007”> <name> James </name> </member> <project id=“p123”> <title> XML Data Management </title> <member idref=“m007 m008”/> </project>

Other XML Structures

Other XML Structures

Processing instructions: instructions for applications<?xml version=“1.0”?>

CDATA sections: treat content as char data<![CDATA[<tag>Whatever!!!</tag><whatever>]]>

Comments: just like HTML<!-- Comments -->

Entities: external resources and macros &my-entity; (non-parameter entity) %param-entity; (parameter entity for DTD


Data Centric vs. Document centric

Data Centric vs. Document centric<project>

<pname> XML </pname><member ID=”&3”, age = 50 > <name>H. Lu </name> <email> [email protected] </email> <publication author = ”H. Lu”>

<title> Managing XML data using RDBMS </title>

<year> 2001 </year> </publication>

… </member><member ID=”&24”, age = 35 > <name> J.X. Yu </name> <project>

<pname> Data mining </pname> </project></member>


<bio><p> Dr Lu is a professor at

<b> HKUST. </b> He worked at <b> NUS> </b> before

1998. </p></bio>

XML Data Model

XML Data Model

Several competing models Document Object Model (DOM)

a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents


DOM Core Interface : Node

DOM Core Interface : Node

DOM tree: a tree-like structure of Node objects – the root of the tree is a document object. Node Object (nodeName, nodeValue, nodeType,

parentNode, childnodes, firstChild, lastChild, previousSibling, nextSibling, attributes, ownerDocument)


DOM Interface

DOM Interface

Each node of the document tree may have a number of child nodes, contained in a NodeList object.

Two ways of accessing a node object Based on the location of an object in the

document tree Based on the name of an object

A Sample DOM Tree

publicatiom Node


tagName = “publication”

NodeValue = ‘nill’

A Sample DOM Tree

&28&26 &27

&70 &71


H. Lu

Managing … 2001 Data mining

J.X. [email protected] &294

publicationname emai


pname member










Project Node


tagName = “project”

NodeValue = ‘nill’

name Node


tagName = “name”

NodeValue = ‘H. Lu’

Data Graph

Data Graph

Similar to DOM tree, but may have different notations that represent an XML document








H. Lu

Managing … 2001 Data mining

J.X. [email protected]&294

publicationname email

pname member member














Document Type Definition

Document Type Definition

Inherited from SGML DTD standard BNF grammar establishing constraints on element

structure and content Specification of attributes and their types Definitions of entities

A Sample DTD

A Sample DTD






ID name










title year




<?xml version="1.0" standalone="yes"?><!DOCTYPE Research ><!ELEMENT project(pname,member*,publication*)><!ELEMENT pname(#PCDATA)> <!ELEMENT member (name,email?, publication*,

project*)><!ATTLIST member ID ID #REQUIRED><!ELEMENT name(#PCDATA)><!ELEMENT email(#PCDATA)><!ELEMENT publication(title,year)><!ATTLIST publication author IDREF IMPLIED)><!ELEMENT title(#PCDATA)><!ELEMENT year(#PCDATA)>


XML Query Languages

XML Query Languages

There have been a large number of proposals during the past few years: XPath [Clark, DeRose, W3C 1999] XQuery [Boag, Chamberlin et al, W3C 2003] XML-QL[Deutsch, Fernandez et al, QL99] XQL [Robie, Lapp, QL99] XML_GL [Ceri, Comai et al, WWW99] Quilt [Chamberlin, Robie et al, 2000]

From W3C XQuery 1.0 (W3C Working Draft, 12 November 2003)

• http://www.w3.org/TR/xquery/ XPath 2.0 (W3C Working Draft 12 November 2003)

• http://www.w3.org/TR/xpath20/

XPath: XML Path Language

XPath: XML Path Language The purpose

To address the node of an XML tree using a path notation for navigating through the hierarchical structure of an XML document.

Uses a compact, non-XML syntax Designed to be embedded in a host language (e.g., XSLT,

XQuery) XPath Expressions

String of characters Value of an expression is always an ordered collection of

zero or more items (atomic value, node)

XPath: Steps

XPath: Steps

An XPath expression has following syntax: Path::=/Step1/Step2/…/Stepn,

where each Xpath step is defined as follows: Step::=Axis::Node-test Predicate* Axis specifies the “direction” in which the document should be

navigated. For example, child::title[position() = 2]

There are 12 axes: child, descendant, descendant-or-self, parent, ancestor, ancestor-or-self, following, preceding, following-sibling, preceding-sibling, attribute, self, namespace

XPath Path Expressions

XPath Path Expressions

project matches a project element

* matches any element

/ matches the root element

/project matches a project element under root

project/member matches a member in project

project//name matches a name in project, at any depth

//title matches a title at any depth

member|publciation matches a member or a publication

@age matches an age attribute

project/member/@age matches age attribute in member, in project

project/member/[@age<“45”] matches a member with age < 45

XPath Query Examples

XPath Query Examples

Result: <name> H. Lu </name>

<name> J.X. Yu </name>

/project/member/name: matches a name of member in project


Result: empty – there was no venue element

//pname : matches a pname at any depth

Result: <pname> XML </pname><pname> Data mining </pname>

/project/member/name/text(): text of name elements

Result: H. Lu

J.X. Yu

More XPath Queries

More XPath Queries/project/member[publication] <member ID=”&3”, age = 50 >

<name>H. Lu </name> <email> [email protected] </email> <publication author = ”H. Lu”>

<title> Managing XML data using RDBMS </title><year> 2001 </year>


/project/member[@age < “45”]<member ID=”&24”, age = 35 > <name> J.X. Yu </name> <project>

<pname> Data mining </pname> </project></member>

/project [member/@age < “25”]No element returned

/project/member[email/text()] [email protected]

XQuery 1.0: An XML Query Language


XQuery 1.0: An XML Query Language W3C Working Draft 12 November 2003 http://www.w3.org/TR/xquery/

XPath expressions are still the basic building block

XQuery

XQuery XQuery 1.0: An XML Query Language

W3C Working Draft 12 November 2003 http://www.w3.org/TR/xquery/


FOR/LET Clauses

WHERE Clause


Ordered list of tuples of bound variables

Instance of XML Query data model

FOR $x in expr binds $x to each value in the list expr

LET $x = expr binds $x to the entire list expr Useful for common subexpressions

and for aggregations

Pruned list of tuples of bound variables

XQuery Examples

XQuery Examples


FOR $x in /project/member/publication

WHERE $x/year > 2000

RETURN <recentpub> $x/title

</ recentpub >


FOR $m IN distinct(document(“project.xml")//member) LET $p := document(“project.xml")//publication[author = $m] WHERE count($p) > 10 RETURN $m

</ active_members >

distinct = a function that eliminates duplicates

count = a (aggregate) function that returns the number of elements

Structural Summary and Coding Scheme

XML and XML Query Languages
Structural Summary and Coding Scheme
Managing XMLData in Relational Systems

Structural Summary

Structural Summary

A structural summary for a data graph GD(VD, ED ) is another labeled graph GI (VI, EI ).

Each node vi GI represents a set of nodes, extent(vi ), and extent(vi ) VD.

An edge ed (vi , vi’) GI exists if there is an edge ed (vd , vd’) GD vd extent(vi ), vd’ extent(vi’ ).

The summary preserves all the paths in the data graph. A path expression query can be executed on GI instead of GD, which is most likely more efficient since size of GI is much smaller than GD.

Structural Summary

Structural Summary

Basically, nodes in the data graph is grouped based on certain criteria, each group of nodes is represented by one node in the summary.

The size of summary will be determined by the grouping criteria.

Desired properties in supporting evaluating path expression queries using summary: The results are safe (no false negatives)

• If not safe, only approximate answers can be obtained

The results are precise: contains no false positives • If not precise, need validate results using the data graph

Structural Summary

Structural Summary


a1 a2 a3

b1 b2 b3

c1 c2 c3








Data Graph Structural summary


Sample Structural Summaries

Sample Structural Summaries

Query workload independent summaries Data Guide 1-index [Milo, Suciu, ICDT99] A(k) index [Kaushik, Shenoy, ICDE02]

Query workload dependent summaries APEX [Chung, Min et al, SIGMOD02] D(k)-index [Chen, Lim et al, SIGMOD03]

Data Guides

Data Guides

DataGuide: dynamic structural summary of current database Each label path in

database appears once in DataGuide

No extraneous paths in DataGuide

Maintained incrementally as database evolves

Serves role of schema

C1 is duplicated to achieve determinism in DataGuides

Page 32: L09: Introduction to XML Data Management

Bisimilarity and 1-Index

Bisimilarity and 1-Index

Most existing structural summary are based on graph bisimilarity, defined as follows: Two data nodes u and v are bisimilar (u v) if

• u and v have the same label;

• if u’ is a parent of u, then there is a parent v’ of v such that u’ v’, and vice versa;

Intuitively, the set of paths coming into them is the same if two nodes are bisimilar

Tova Milo and Dan Suciu. Index structures for path expressions. In ICDT’99. 277-295, January 1999.

Page 33: L09: Introduction to XML Data Management

H. Lu/HKUST L09: Introduction to XML Data Management 33


1-index: Each index node represents an equivalence class, in which data nodes are mutually bisimilar.

Evaluating path expression query using 1-index safe: the result always

contains the result of evaluating on the data graph;

precise: its result contains no false data node;

1-index can be big


1-index can be big Formally, based on the notion of k-bisimilarity (k ) which is

defined inductively: Node u k v iff u k-1 v, and for every parent u’ of u, there

is a parent v’ of v such that u’ k-1 v’, and vice versa; For any two nodes, u and v, u 0 v iff u and v have the

same label; Intuitively, if two data nodes are k-bisimilar, the set of paths

coming into them with length ( k) is the same

Page 35: L09: Introduction to XML Data Management

H. Lu/HKUST L09: Introduction to XML Data Management 35


A(k)-Index: group nodes based on their local structure – paths of length up to k, instead of the global path information data nodes in each index nodes of A(k) index are mutually

k-bisimilar; Evaluation path expression query using A(k)-index:

safe: its result always contains the result of evaluating on the data graph;

precision: its result contains no false data node;

Raghav Kaushik, Pradeep Shenoy, Philip Bohannon and Ehud Gudes. Exploiting local similarity for indexing paths in graph-structured data. ICDE’02, 129-140.

C2 and C3 can be grouped because their length-2 incoming paths are the same


C2 and C3 can be grouped because their length-2 incoming paths are the same

APEX: Adaptive Path Index

APEX: Adaptive Path Index

1-index, A(k)-index and F&B index are all workload independent APEX: Adaptive Path index

Maintains two types of paths in the summary:• All paths of length two so that all queries can be answered using APEX• Full paths are maintained for those paths that frequently appear in query

workload so that frequently asked queries can be answered efficiently A hash table is included in the index so that partial matching

queries with the self-or-descendent axis (//) can be processed efficiently

C-W Chung, J-K Min, K. Shim, APEX: An Adaptive Path Index for XML Data, SIGMOD 02

D(k)-Index


A generalization of 1-Index and A(k)-Index. Assigning different local bisimilarites to index nodes in the

summary structure according to the query load to optimize its structure.

for any two index nodes ni and nj, k(ni) k(nj)-1 if there is

an edge from ni to nj, in which k(ni) and k(ni) are ni and nj’s

local bisimilarities, respectively. Advantage over 1-Index and A(k)-Index

workload-sensitive; can be more efficiently updated

Qun Chen, Andrew Lim and Kian Win Ong. D(k)-index: An adaptive structural summary for graph-structured data. SIGMOD 03, 134-144.

Node (Edge) Encoding

Node (Edge) Encoding

Structural relationships Is node u an ancestor of node v? Is node u the parent of node v?

Assigning a unique code to a node (edge) in the data graph so that the above question can be answered by looking at the codes rather than the original data graphs.

Issues: Length of the code. Complexity for computing the structural relationship.

between two nodes from their codes. Efficient code generation and code maintenance.

XML Data Coding Scheme

XML Data Coding Scheme

Region-based XML document is ordered Codes are assigned based on the lexicographical location

of an element in the original document Path-based

XML document is nested Codes are assigned based on the nesting structure of the

document, or the path that reaches and element from the root.

There are quite a number of variants for both categories of coding schemes

XML Region Based Coding

XML Region Based Coding Region code: (start, end, level)

u is an ancestor of v iff u.start < v.start < u.end u is the parent of v, additionally, u.level = v.level-1

Only a depth-first traversal for code generation Property: strictly nesting

Completely disjoint (case 1,4) or containing (case 2,3) Formally, a.start < b.start < a.end, if a is an ancestor of b



b b b b

r o o t r o o t r o o t r o o t





a a

b b


c as e 1 c as e 2 c as e 3 c as e 4

Page 42: L09: Introduction to XML Data Management

Sample of Region Codes


pape r

t i t le al lautho r s

autho r autho r

t i t le

autho r

ye ar c o nf

pape r

jan e p o e jac k lee

2 0 0 1 VL D BX M L X M L d atab as e









L ev el







(8) (11)

(15) (23)



al lautho rs

jac k lee

c o nf




ye ar

2 0 0 3(26,28)




The order of start values is also the document order The region can also be interpreted as an interval

Dewey Encoding











name phone

blah office home mobile

1234 5678 0000







1.2.3 is a prefix of d.DeweyIgor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene Shekita, and Chun Zhang. Storing and querying ordered XML using a relational database system. SIGMOD 2002.

Managing XML Data in Relational Systems

XML and XML Query Languages
XML Coding Scheme and Structural Summary
Managing XMLData in Relational Systems

XML-Enabled DB Systems

XML-Enabled DB Systems

IBM DB2 XML Extender XML column support, XML Collection, File liked from

the DBMS, or Character Large Objects (CLOBs). Side Tables server as XML indexes

Oracle 9i CLOB, OracleText Cartridge, XMLType, and XML SQL

Utility Microsoft SQL Server

CLOBs, Generic Edge technique and user-defined decomposition (from XML to tables), XML views.

Storing XML Data in RDBMSs

Storing XML Data in RDBMSs

RDBMS: a matured technology RDBMS widely available

Less investment to adopt the new technology Easy to be integrated with other existing applications Impedance mismatch

Two level nature of relational schema (tuples and attributes) vs. arbitrary nesting of XML DTD

Flat structure vs. recursion Structure-based and content-based query

XQuery vs SQL: Different Culture

XQuery vs SQL: Different Culture

Data Characteristics Relational data: regular, homogeneous, flat structure in

nature, and no order among tuples. XML data: irregular, heterogeneous, unpredictable

structure, order sensitive. Query Languages

SQL:• Select-from-where• With capability to support some fix-point operation

XQuery: • FLWOR (pronounced “flower”): For-let-where-order-return• Simple/Regular Path expressions

Storing XML Data in RDBMSs: Architecture

Storing XML Data in RDBMSs: Architecture















Automatic Schema/Data Mapping

Commercial RDBMS

Storing XML Data in RDBMSs: Issues

Storing XML Data in RDBMSs: Issues

Schema/Data mapping: Automate storage of XML in RDBMS

Query mapping: Provide XML views of relational sources

Result construction: Export existing data as XML

XML-Relational Mapping

XML-Relational Mapping Model mapping

Database schemas represent constructs of the XML document model.

• DTD Independent

[Florescu & Kossmann 99, Yoshikawa, et. al. TOIT01]

Structure mapping Database schemas represent the logical structure

of target XML documents• DTD Dependent

[Shanmugasundaram et. al. VDLB 99]

A Simple XML Document

A Simple XML Document<project>

<pname> XML </pname><member ID=”&3”> <name>H. Lu </name> <email> [email protected] </email> <publication author = ”H. Lu”>

<title> Managing XML data using RDBMS </title>

<year> 2001 </year> </publication>

… </member><member ID=”&24”> <name> J.X. Yu </name> <project>

<pname> Data mining </pname> </project></member>


A Sample DOM Tree

A Sample DOM Tree

&28&26 &27

&70 &71


H. Lu

Managing … 2001 Data mining

J.X. [email protected] &294

publicationname emai


pname member










Model Mapping: Document Model to Relation

Model Mapping: Document Model to Relation

Database schema represents the constructs of XML documents

Fixed database schema for all XML documents Data graph : tree (may contain cycles) Relational schema represents a tree Pros and cons

DTD is not required. Documents may not conform to DTD

Fixed schema: no schema evolution issue Large collection of documents with various DTDs Semantics get (totally) lost

Model Mapping – Edge/Monet Approach

Model Mapping – Edge/Monet Approach Edge oriented approach

Single table schema [Florescu & Kossmann 99]Edge (source, ordinal, target, label, flag, value)

Monet [Schmidt et. al. WebDB00]• multiple tables, horizontal partitions of edge table on


Source Ordinal Target Label Flag Value&1 1 &2 "Pname" val "XML"&1 1 &3 "Member" ref -&1 2 &24 "Member" ref -&3 1 &26 "name" val "H.Lu"

Note: Document ID is omitted here

Querying with Edge

Querying with Edge

select name.Valuefrom Edge dbgroup, Edge member, Edge age, Edge name where dbgroup.Label = `DBGroup' and member.Label = `Member' and age.Label = `Age' and name.Label = `Name' and dbgroup.Source = 0 and dbgroup.Target = member.Source and member.Target = age.Source and member.Target = name.Source and cast (age.Value as int) > 20


Model Mapping – Node Approach

Model Mapping – Node Approach

XRel [Yoshikawa et. al. TOIT 2001] Four table schema

Element(pathID, start, end, ordinal)Attribute(pathID, start, end, value)Text(pathID, start, end, value)Path(pathID, pathexp)

PathPathID PathExp

1 #/Project2 #/Project#/pname3 #/Project#/member4 #/Project#/member#/name… …

ElementPathID Start End Ordinal

2 1 5 14 6 9 14 21 25 2… … … …

TextPathID Start End Value

2 3 4 "XML"4 7 8 "H. Lu"4 22 24 "J.X. Yu"… … … …

Querying with XRel

Querying with XRel

select v2.Value from Element e1, Path p1, Path p2, Path p3, Text v1, Text v2where p1.Pathexp = `\#/DBGroup\#/Member' and p2.Pathexp = `\#/DBGroup\#/Member\#/Age' and p3.Pathexp = `\#/DBGroup\#/Member\#/Name' and e1.PathID = p1.PathID and v1.PathID = p2.PathID and v2.PathID = p3.PathID /* containment testing */ and e1.Start < v1.Start and e1.End > v1.End and e1.Start < v2.Start and e1.End > v2.End and cast(v1.Value as int ) > 20


Structural Mapping: Simplifying DTDs

Structural Mapping: Simplifying DTDs

DTD element specifications can be of arbitrary complexity

<!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)> is valid!

Simple DTD for translation purposes: Key observations: not necessary to regenerate

DTD from relational schema XML queries query the position of an element,

relative to its siblings, and the parent/child relationships.

Page 59: L09: Introduction to XML Data Management

DTD Simplification: Transformations

DTD Simplification: Transformations

(e1, e2)* e1*, e2*(e1, e2)? e1?, e2?(e1|e2) e1?, e2?

e1** e1*e1*? e1*e1?* e1*e1?? e1?

..., a*, ..., a*, ... a*, ...

..., a*, ..., a?, ... a*, ...

..., a?, ..., a*, ... a*, ...

..., a?, ..., a?, ... a*, ……, ...a, …, a, … a*, …

[Deutsch, Fernandez, and Suciu, SIGMOD99]

[Shanmugasundaram, Tufte, He, Zhang, DeWitt, and Naughton, VLDB99]

Simplification Transformations

Grouping Transformations

<!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)>

<!ELEMENT a (b*, c?, e*, f*)>

Flattening Transformations

A Sample DTD

A Sample DTD

<!ELEMENT book (booktitle, author) <!ELEMENT booktitle (#PCDATA)><!ELEMENT author (name, address)> <!ATTLIST author id ID #REQUIRED> <!ELEMENT name (firstname?, lastname)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT address ANY><!ELEMENT article (title, author*, contactauthor)> <!ELEMENT title (#PCDATA)> <!ELEMENT contactauthor EMPTY> <!ATTLIST contactauthor authorID IDREF IMPLIED><!ELEMENT monograph (title, author, editor)> <!ELEMENT editor (monograph*)> <!ATTLIST editor name CDATA #REQUIRED>











firstname lastname



address authorid



[Shanmugasundaram et. al. VDLB 99]

Page 61: L09: Introduction to XML Data Management

DTD to Relational Schema: Naïve Approach

DTD to Relational Schema: Naïve Approach Each Element ==> Relation Each Attribute of Element ==> Column of Relation Connect elements using foreign keys

<!ELEMENT author (name, address)><!ATTLIST author id ID #REQUIRED><!ELEMENT name (firstname?, lastname)><!ELEMENT firstname (#PCDATA)><!ELEMENT lastname (#PCDATA)><!ELEMENT address ANY>

author (authorID: integer, id: string)name (nameID: integer, authorID: integer)firstname (firstnameID: integer, nameID: integer, value: string)lastname (lastnameID: integer, nameID: integer, value: string)address (addressID: integer, authorID: integer, value: string)

Basic Inlining Technique

Basic Inlining Technique Problem of the naïve approach: fragmentation – too many

tables Results in 5 relations in the previous example: retrieving first and

last names of an author Intuition:

Inline as many sub-elements as possible Do not inline only if it is a set sub-element

• RDBMSs do not all support set-valued columns. Connect relations using foreign keys

• Can handle recursions A document can be rooted at any element

• Create separate a relation for each root

Basic Inlining Technique: Relation Schemas

Basic Inlining Technique: Relation Schemas

article (articleID: integer, article.contactauthor.authorid: string, article.title: string)

article.author (article.authorID: integer, article.author.parentID: integer, article.author.name.firstname: string,

article.author.name.lastname: string, article.author.address: string, article.author.authorid: string)







firstname lastname



address authorid


Basic Inlining Technique: Pros & Cons

Basic Inlining Technique: Pros & Cons

Reduces number of joins for queries like “get the first and last names of a book author”

Efficient for queries such as “list all authors of books” Queries like “list all authors with name Ullman”

Union of 5 queries! Large number of relations:

Unrolling recursive strongly connected components (major)

Separate relational schema for each element as root (minor)

Shared Inlining Technique

Shared Inlining Technique

Intuition: Inline as many sub-elements as possible. Do not inline only if it is a shared, recursive or

set sub-element. An element node is represented in exactly one

relation. Technique:

Mapping the following nodes into relations:• Shared: In-degree >= 2 in DTD graph

• Root elements: In-degree = 0

Page 66: L09: Introduction to XML Data Management

Issues with Sharing Elements

Issues with Sharing Elements

Parent of elements not fixed at schema level Need to store type and ids of parents (or if there are

no parents) parentCODE field (type of parent) parentID field (id of parent) Not foreign key relationship

Page 67: L09: Introduction to XML Data Management

Shared: Relational Schema

Shared: Relational Schema

book (bookID: integer, book.booktitle.isroot: boolean, book.booktitle : string)

article (articleID: integer, article.contactauthor.isroot: boolean,

article.contactauthor.authorid: string)

monograph (monographID: integer, monograph.parentID: integer,

monograph.parentCODE: integer, monograph.editor.isroot: boolean,

monograph.editor.name: string)

title (titleID: integer, title.parentID: integer, title.parentCODE: integer, title: string)

author (authorID: integer, author.parentID: integer, author.parentCODE: integer,

author.name.isroot: boolean, author.name.firstname.isroot: :boolean,

author.name.firstname: string, author.name.lastname.isroot: boolean,

author.name.lastname: string, author.address.isroot: boolean,

author.address: string, author.authorid: string)

Shared Inlining Techniques: Pros & Cons

Shared Inlining Techniques: Pros & Cons

+ Reduces number of joins for queries like “get the first and last names of an author”

+ Efficient for queries such as “list all authors with name Ullman”

- Sharing whenever possible implies extra joins for path expressions• “Article with a given title name”

Hybrid Inlining Technique

Hybrid Inlining Technique

Inlines some elements that are shared in Shared Elements with in-degree >= 2 that are not set sub-

elements or recursive Handles set and recursive sub-elements as in Shared

Hybrid: Relational Schema

Hybrid: Relational Schema

book (bookID: integer, book.booktitle.isroot: boolean, book.booktitle : string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string)article (articleID: integer, article.contactauthor.isroot: boolean, article.contactauthor.authorid: string, article.title.isroot: boolean, article.title: string)monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.title: string, monograph.editor.isroot: boolean, monograph.editor.name: string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string)author (authorID: integer, author.parentID: integer, author.parentCODE: integer, author.name.isroot: boolean, author.name.firstname.isroot: boolean, author.name.firstname: string, author.name.lastname.isroot: boolean, author.name.lastname: string, author.address.isroot: boolean, author.address: string, author.authorid: string)

Hybrid Inlining Technique: Pros & Cons

Hybrid Inlining Technique: Pros & Cons

+ Reduces joins through shared elements (that are not set or recursive elements)

+ Shares some strengths of Shared:• Reduces joins for queries like “get first and last names of a

book author”- Requires more SQL sub-queries to retrieve all authors with

name Ullman.• Tradeoff between reducing number of queries and

reducing number of joins• Shared and Hybrid target query- and join-reduction


More on Shared and Hybrid

More on Shared and Hybrid

Shared and Hybrid have pros and cons In many cases, Shared and Hybrid are nearly

identical Number of joins per SQL query ~ path length Mainly due to large number of set nodes Problem as join processing is expensive!

Regular Expressions

Regular Expressions

Path expression queries can be represented by regular expressions.

Considering path expressions in the following from

r = (r)* | (r)+ | (r)? | r1/r2 | r1|r2 | r1//r2 | name.

*: 0 or more occurrences

+: 1 or more occurrences

? : 0 or 1 occurrences

r1/r2 : form a path from r1 to r2 (child)

r1//r2 : form a path from r1 to r2 (descendant)

| : disjunction.

Example: Find the name of the authors for all member's publications



select m2.name

from member m1, publication, member m2

where publication.perantid = m1.ID

and publication.author = m2.ID

Find the name of the authors for all member’s publications




publicationID name


&7&8 &9 &5


member (ID, name, email, PARENTID);publication (ID, title, author, year, PARENTID);

RPE Expansion

RPE Expansion


project/member/(project.member)*/publication| project/(member.project)*/publication






ID name





&7&8 &9 &5



title year &12


List the title of publications for all projects

Substitute //

RPE Expansion

RPE Expansion

project/member/(project/member)*/publication/title |project/(member/project)*/publication/title






ID name





&7&8 &9 &5



title year &12


List the title of publications for all projects

select project.publication.title unionselect project.member.publication.title unionselect project.member.project.publication.title

Expanding *

Recursive Path Expression Queries to SQL

Recursive Path Expression Queries to SQL Some DBMS supports least-fixed point computation. E.g.,

WITH statement in DB2

WITH R(PARENTID, ID) AS ( select m.PARENTID, p1.ID from member m, project p1 where m.ID=p1.PARENTID UNION ALL select R.PARENTID, p1.ID from R, member m, project p1 where R.ID=m.PARENTID and m.ID=p1.PARENTID) select p3.* from project p2, R, publicaton p3where p2.ID=R.PARENTID and R.ID=p3.PARENTID;

project/(member/project)*/publication project







Expanding Recursive Path Expression Queries

Expanding Recursive Path Expression Queries

Expanding wild cards before sending to DBMS Transitive closure operation is not always supported by

RDBMS Transitive closure with arbitrary nesting seems not

supported Can handle nested recursive queries (though DB2 does not

support it) How many SQL statements are required?

Executing SQL until empty result returned VXMLR approach: keep statistics[Zhou et. al. VLDB 2001]

Query Translation for Structural Mapping

Query Translation for Structural Mapping

Translating XML-QL into SQL [Shanmugasundaram, et al, VLDB99]

Simple Path Expressions to SQL Simple Recursive Path Expressions to SQL Arbitrary Path Expressions to Simple Recursive

Path Expressions Discussion based on Shared approach

Queries with Simple Path Expressions

Queries with Simple Path ExpressionsWHERE <book> <booktitle> The Selfish Gene </booktitle> <author> <name>

<firstname> $f </firstname><lastname> $l </lastname>

</name> </author> </book> IN * CONFORMING TO pubs.dtdCONSTRUCT <result> $f $l </result>

Select A.”author.name.firstname”, A.”author.name.lastname”From author A, book BWhere B.bookID = A.parentIDAND A.parentCODE = 0AND B.”book.booktitle” = “The Selfish Gene”

Queries with Recursive Path Expressions

Queries with Recursive Path ExpressionsWHERE <*.monograph> <editor.(monograph.editor)*> <name> $n </name> </> <title> Subclass Cirripedia </title> </> IN * CONFORMING TO pubs.dtdCONSTRUCT <result> $n </result>

With Q1 (monographID, name) AS(Select X.monographID, X.”editor.name” From monograph X Where X.title = “Subclass Cirripedia”UNION ALL Select Z.monographID, Z.”editor.name” From Q1 Y, monograph Z Where Y.monographID = Z.parentID AND Z.parentCODE = 0)Select A.name From Q1 A

Queries with Arbitrary Path Expressions

Queries with Arbitrary Path Expressions

Split complex path expression to (possibly many) simple recursive path expressions

Has effect of splitting a single XML-QL query to (possibly many) SQL queries

Can handle nested recursive queries

WHERE <(article|monograph).$*.name> $n </>

CONSTRUCT <name> $n </>

H. Lu/HKUST L09: Introduction to XML Data Management 83

