managing xml data dan suciu university of washington

Managing XML Data

Dan Suciu

University of Washington

How the Web is Today

• HTML documents

• often generated by applications

• consumed by humans only

• easy access: across platforms, across organizations

No Application Interoperability

• HTML not understood by applications– screen scraping brittle

• database technology: client-server– still vendor specific

• companies merge, form partnerships; need interoperability fast

people are inventive: send data by fax !

New Universal Data Exchange Format: XML

A recommendation from the W3C

• XML = data

• XML generated by applications

• XML consumed by applications

• easy access: across platforms, organizations

XML data management problems

Paradigm Shift on the Web

• from documents (HTML) to data (XML)

• from information retrieval to data

management

Paradigm shift for databases

• from relational model to semistructured

dat model

• from data processing to data/query

translation

• from storage to transport

What This Tutorial is About

• what the database community has done– semistructured data model– query languages, schemas

• what the Web community has done:– data formats/models: XML– transformation language (XSL), schemas

• where they meet and where they differ

Outline

• Semistructured and XML data models

• Query languages

• Schemas

• Systems issues

• Conclusions

Part 1Semistructured and XML Data

Models

Semistructured Data

Origins:

• integration of heterogeneous sources

• data sources with non-rigid structure

• biological data

• Web data

The Semistructured Data Model

&o1

&o12 &o24 &o29

&o43&96

&243 &206

&25

“Serge”“Abiteboul”

1997

“Victor”“Vianu”

122 133

paperbook

paper

references

referencesreferences

authortitle

yearhttp

author

authorauthor

title publisherauthor

authortitle

page

firstnamelastname

firstname lastname firstlast

Bib

Object Exchange Model (OEM) complex object

atomic object

[H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, J. Widom, 1997]

Syntax for Semistructured Data

Bib: &o1 { paper: &o12 { … },

book: &o24 { … },

paper: &o29

{ author: &o52 “Abiteboul”,

author: &o96 { firstname: &243 “Victor”,

lastname: &o206 “Vianu”},

title: &o93 “Regular path queries with constraints”,

references: &o12,

references: &o24,

pages: &o25 { first: &o64 122, last: &o92 133}

}

}

Syntax for Semistructured Data

May omit oid’s:

{ paper: { author: “Abiteboul”,

author: { firstname: “Victor”,

lastname: “Vianu”},

title: “Regular path queries …”,

page: { first: 122, last: 133 }

}

}

Characteristics of Semistructured Data

• missing or additional attributes

• multiple attributes

• different types in different objects

• heterogeneous collections

self-describing, irregular data, no a priori structure

Comparison with Relational Data

{ row: { name: “John”, phone: 3634 },

row: { name: “Sue”, phone: 6343 },

row: { name: “Dick”, phone: 6363 }

}

n a m e p h o n e

J o h n 3 6 3 4

S u e 6 3 4 3

D i c k 6 3 6 3

row row row

name name namephone phone phone

“John” 3634“Sue” “Dick”6343 6363

XML

• a W3C standard to complement HTML

• origins: structured text SGML

• motivation:– HTML describes presentation– XML describes content

• • http://www.w3.org/TR/2000/REC-xml-20001006

(version 2, 10/2000)

SGMLXMLHTML4.0

From HTML to XML

HTML describes the presentation

HTML

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteoul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

XML<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

…

</bibliography>XML describes the content

XML Terminology• tags: book, title, author, …• start tag: <book>, end tag: </book>• elements: <book>…<book>,<author>…</author>• elements are nested• empty element: <red></red> abbrv. <red/>• an XML document: single root element

well formed XML document: if it has matching tags

More XML: Attributes

<book price = “55” currency = “USD”>

<title> Foundations of Databases </title>


…

<year> 1995 </year>

</book>

attributes are alternative ways to represent data

More XML: Oids and References

<person id=“o555”> <name> Jane </name> </person>

<person id=“o456”> <name> Mary </name>

<children idref=“o123 o555”/>

</person>

<person id=“o123” mother=“o456”><name>John</name>

</person>

oids and references in XML are just syntax

XML Data Model

Several competing models:• Document Object Model (DOM):

– http://www.w3.org/TR/2001/WD-DOM-Level-3-CMLS-20010209/ (2/2001)

– class hierarchy (node, element, attribute,…)– objects have behavior– defines API to inspect/modify the document

• XSL data model• Infoset

– PSV (post schema validation)

• XML Query data model (next)

XML Query Data Model

• http://www.w3.org/TR/query-datamodel/2/2001

• Describes XML as a tree, specialized nodes

• Uses a functional-style notation (think ML)


Element node (simplified definition):

• elemNode : (QNameValue, {AttrNode }, [ ElemNode | ValueNode]) ElemNode

• QNameValue = means “a tag name”• {...} = means “set of...”• [...] = means “list of ...”


• Reads: “give me a tag, a set of attributes, a list of elements/values, and I will return an element”


Example

<book price = “55”

currency = “USD”>

<title> Foundations … </title>




<year> 1995 </year>

</book>







<year> 1995 </year>

</book>

book1= elemNode(book, {price2, currency3}, [title4, author5, author6, author7, year8])

price2 = attrNode(…) /* next */currency3 = attrNode(…)title4 = elemNode(title, string9)…

book1= elemNode(book, {price2, currency3}, [title4, author5, author6, author7, year8])

price2 = attrNode(…) /* next */currency3 = attrNode(…)title4 = elemNode(title, string9)…


Attribute node:

• attrNode : (QNameValue, ValueNode) AttrNode


Example







<year> 1995 </year>

</book>







<year> 1995 </year>

</book>

price2 = attrNode(price,string10) string10 = valueNode(…) /* next */currency3 = attrNode(currency, string11)string11 = valueNode(…)

price2 = attrNode(price,string10) string10 = valueNode(…) /* next */currency3 = attrNode(currency, string11)string11 = valueNode(…)


Value node:• ValueNode = StringValue |

BoolValue | FloatValue …

• stringValue : string StringValue• boolValue : boolean BoolValue• floatValue : float FloatValue


Example







<year> 1995 </year>

</book>







<year> 1995 </year>

</book>

price2 = attrNode(price,string10)string10 = valueNode(stringValue(“55”))currency3 = attrNode(currency, string11)string11 = valueNode(stringValue(“USD”))

title4 = elemNode(title, string9)string9 = valueNode(stringValue(“Foundations…”))

price2 = attrNode(price,string10)string10 = valueNode(stringValue(“55”))currency3 = attrNode(currency, string11)string11 = valueNode(stringValue(“USD”))

title4 = elemNode(title, string9)string9 = valueNode(stringValue(“Foundations…”))

XML Namespaces

• http://www.w3.org/TR/REC-xml-names (1/99)

• name ::= [prefix:]localpart

<book xmlns:isbn=“www.isbn-org.org/def”>

<title> … </title>

<number> 15 </number>

<isbn:number> …. </isbn:number>

</book>

<book xmlns:isbn=“www.isbn-org.org/def”>

<title> … </title>

<number> 15 </number>

<isbn:number> …. </isbn:number>

</book>

<tag xmlns:mystyle = “http://…”>

…

<mystyle:title> … </mystyle:title>

<mystyle:number> …

</tag>

<tag xmlns:mystyle = “http://…”>

…

<mystyle:title> … </mystyle:title>

<mystyle:number> …

</tag>

XML Namespaces

• syntactic: <number> , <isbn:number>

• semantic: provide URL for schema

defined here

XML v.s. Semistructured Data

• both described best by a graph

• both are schema-less, self-describing

Similarities and Differences

<person id=“o123”>

<name> Alan </name>

<age> 42 </age>

<email> ab@com </email>

</person>

{ person: &o123

{ name: “Alan”,

age: 42,

email: “ab@com” }

}

person

name age email

Alan 42 ab@com

person

name age email

Alan 42 ab@com

father father

<person father=“o123”> …</person>

{ person: { father: &o123 …}}

similar on trees, different on graphs

More Differences

• XML is ordered, ssd is not

• XML can mix text and elements: <talk> Making Java easier to type and easier to type

<speaker> Phil Wadler </speaker>

</talk>

• XML has lots of other stuff: entities, processing instructions, comments

Very important:these differences make XML data management harder

Summary of Data Models

• semistructured data, XML

• data is self-describing, irregular

• schema embedded with the data

Part 2Query Languages


• Query languages

• Schemas

• Systems issues

• Conclusions

Query Languages: Outline

• Path expressions:– XPath

• Declarative languages (query languages)– XML-QL– Quilt– XQuery

• Recursive languages– structural recursion– XSL

XPath• http://www.w3.org/TR/xpath (11/99)

• Building block for other W3C standards:– XSL Transformations (XSLT) – XML Link (XLink)– XML Pointer (XPointer)– XML Query

• Was originally part of XSL

Example for XPath Queries<bib>

<book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

XPath: Simple Expressions

/bib/book/year

Result: <year> 1995 </year>

<year> 1998 </year>

/bib/paper/year

Result: empty (there were no papers)

XPath: Restricted Kleene Closure

//author

Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author>

/bib//first-nameResult: <first-name> Rick </first-name>

Xpath: Text Nodes

/bib/book/author/text()

Result: Serge Abiteboul

Jeffrey D. Ullman

Rick Hull doesn’t appear because he has firstname, lastname

Xpath: Wildcard

//author/*

Result: <first-name> Rick </first-name>

<last-name> Hull </last-name>

* Matches any element

Xpath: Attribute Nodes

/bib/book/@price

Result: “55”

@price means that price is has to be an attribute

Xpath: Qualifiers

/bib/book/author[firstname]

Result: <author> <first-name> Rick </first-name>

<last-name> Hull </last-name>

</author>

Xpath: More Qualifiers

/bib/book/author[firstname][address[//zip][city]]/lastname

Result: <lastname> … </lastname>

<lastname> … </lastname>

Xpath: More Qualifiers

/bib/book[@price < “60”]

/bib/book[author/@age < “25”]

/bib/book[author/text()]

Xpath: Summarybib matches a bib element

* matches any element

/ matches the root element

/bib matches a bib element under root

bib/paper matches a paper in bib

bib//paper matches a paper in bib, at any depth

//paper matches a paper at any depth

paper|book matches a paper or a book

@price matches a price attribute

bib/book/@price matches price attribute in book, in bib

bib/book/[@price<“55”]/author/lastname matches…

Xpath: More Details

• An Xpath expression, p, establishes a relation between:– A context node, and– A node in the answer set

• In other words, p denotes a function:– S[p] : Nodes -> {Nodes}

• Examples:– author/firstname– . = self– .. = parent– part/*/*/subpart/../name = part/*/*[subpart]/name

The Root and the Root

• <bib> <paper> 1 </paper> <paper> 2 </paper> </bib>• bib is the “document element”• The “root” is above bib

• /bib = returns the document element• / = returns the root

• Why ? Because we may have comments before and after <bib>; they become siblings of <bib>

• This is advanced xmlogy

Xpath: More Details

• We can navigate along 13 axes:ancestorancestor-or-selfattributechilddescendantdescendant-or-selffollowingfollowing-siblingnamespaceparentprecedingpreceding-siblingself

Xpath: More Details

• Examples:– child::author/child:lastname =

author/lastname– child::author/descendant::zip = author//zip– child::author/parent::* = author/..– child::author/attribute::age = author/@age

XML-QL: A Query Language for XML

• http://www.w3.org/TR/NOTE-xml-ql (8/98)• Also:

– [Deutsch, Fernandez, Florescu, Levy, Suciu, 99]

• features:– regular path expressions– patterns, templates– Skolem Functions

• based on OEM data model

Pattern Matching in XML-QL

WHERE <book language=“french”> <publisher> <name> Morgan Kaufmann </name> </publisher> <author> $a </author> </book> in “www.a.b.c/bib.xml”CONSTRUCT $a

WHERE <book language=“french”> <publisher> <name> Morgan Kaufmann </name> </publisher> <author> $a </author> </book> in “www.a.b.c/bib.xml”CONSTRUCT $a

$a = a variable

Path Expressions in XML-QL

WHERE <book language=“french”> <publisher/name> Morgan Kaufmann </> <author> $a </> </> in “www.a.b.c/bib.xml”CONSTRUCT $a

WHERE <book language=“french”> <publisher/name> Morgan Kaufmann </> <author> $a </> </> in “www.a.b.c/bib.xml”CONSTRUCT $a

Note: </> abbreviates </book> or </result> or ...

Here: we use Xpath expressions; Original XML-QL: used math syntax

Simple Constructors in XML-QL

WHERE <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result> <author> $a </> <lang> $l </> </>

WHERE <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result> <author> $a </> <lang> $l </> </>

<result> <author>Smith</author><lang>English</lang></result><result> <author>Smith</author><lang>Mandarin</lang></result><result> <author>Doe</author><lang>English</lang></result>

Skolem Functions in XML-QL

WHERE <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result> <author id=F($a)> $a</> <lang> $l </> </>

WHERE <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”CONSTRUCT <result> <author id=F($a)> $a</> <lang> $l </> </>

<result> <author>Smith</author> <lang>English</lang> <lang>Mandarin</lang> </result><result> <author>Doe</author> <lang>English</lang> </result>

XQuery

• Based on Quilt (which is based on XML-QL)

• http://www.w3.org/TR/xquery/2/2001

• Xquery = XML-QL – patterns – Skolem + aggregates + duplicates + based on XML data model

XQuery

<result>

FOR $x in /bib/book

WHERE $x/year > 1995

RETURN <newtitle>

$x/title

</newtitle>

</result>

XQuery

• FOR $x in expr -- binds $x to each value in the list expr

• LET $x = expr -- binds $x to the entire list expr– Useful for common subexpressions and for

aggregations

XQuery

<big_publishers> FOR $p IN distinct(document("bib.xml")//publisher) LET $b := document("bib.xml")/book[publisher = $p] WHERE count($b) > 100 RETURN $p

</big_publishers>

distinct = a function that eliminates duplicatescount = a (aggregate) function that returns the number of elms

XQuery

Summary:

• FOR-LET-WHERE-RETURN = FLWR

FOR/LET Clauses

WHERE Clause

RETURN Clause

List of tuples

List of tuples

Instance of Xquery data model

Structural Recursion

Data as sets with a union operator:

{a:3, a:{b:”one”, c:5}, b:4} =

{a:3} U {a:{b:”one”,c:5}} U {b:4}


f(T1 U T2) = f(T1) U f(T2)f({L: T}) = f(T)f({}) = {}f(V) = if isInt(V) then {result: V} else {}

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = f(T)f({}) = {}f(V) = if isInt(V) then {result: V} else {}

Example: retrieve all integers in the data

a a b

b c3

“one” 5

4result result result

3 5 4

standard textbook programming on trees


Example: increase all engine prices by 10%

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= engine then {L: g(T)} else {L: f(T)}f({}) = {}f(V) = V

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= engine then {L: g(T)} else {L: f(T)}f({}) = {}f(V) = V

g(T1 U T2) = g(T1) U g(T2)g({L: T}) = if L= price then {L:1.1*T} else {L: g(T)}g({}) = {}g(V) = V

g(T1 U T2) = g(T1) U g(T2)g({L: T}) = if L= price then {L:1.1*T} else {L: g(T)}g({}) = {}g(V) = V

engine body

part price

price price

part price

100

1000

100

1000

engine body

part price

price price

part price

110

1100

100

1000

XSL

• http://www.w3.org/TR/xslt.html11/99

• purpose: stylesheet specification language:– stylesheet: XML -> HTML– in general: XML -> XML

XSL Templates and Rules

• query = collection of template rules

• template rule = match pattern + template

<xsl:template> <xsl:apply-templates/> </xsl:template>

<xsl:template match = “/bib/*/title”> <result> <xsl:value-of/> </result></xsl:template>


<xsl:template match = “/bib/*/title”> <result> <xsl:value-of/> </result></xsl:template>

Retrieve all book titles:

Flow Control in XSL


<xsl:template match=“a”> <A><xsl:apply-templates/></A></xsl:template>

<xsl:template match=“b”> <B><xsl:apply-templates/></B></xsl:template>

<xsl:template match=“c”> <C><xsl:value-of/></C></xsl:template>


<xsl:template match=“a”> <A><xsl:apply-templates/></A></xsl:template>

<xsl:template match=“b”> <B><xsl:apply-templates/></B></xsl:template>

<xsl:template match=“c”> <C><xsl:value-of/></C></xsl:template>

XSL is Structural Recursion

Equivalent to:

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t)f({}) = {}f(V) = V

f(T1 U T2) = f(T1) U f(T2)f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t)f({}) = {}f(V) = V

XSL query = single functionXSL query with modes = multiple function

XSL and Structural Recursion

XSL:• trees only• may loop

Structural Recursion:• arbitrary graphs• always terminates

<xsl:template match = “e”> <xsl:apply-patterns select=“/”/></xsl:template>

<xsl:template match = “e”> <xsl:apply-patterns select=“/”/></xsl:template>

stack overflow on IE 5.0

add the following rule:

Summary of Query Languages

• Path navigation– Xpath: restricted regular path expressions

• Declarative languages:– XML-QL– Quilt/ Xquery

• Structural recursion:– UnQL– XSLT

Part 3Schemas


• Query languages

• Schemas

• Systems issues

• Conclusions

Schemas

• why ?– XML: to describe semantics– semistructured data: to improve processing

• what ?– semistructured data: foundational – XML: several concrete proposals

Schemas

• when ?– semistructured data, XML: a posteriori– RDBMS: a priori, to interpret binary data

• how ?– semistructured data: schema is independent – XML: schema is hardwired with the data

Outline

• schemas for semistructured data:– foundations– schema extraction

• schemas for XML:– DTD– XML-Schema

Schemas: An Example

&r1

&c1 &c2

&s2 &s3 &s6 &s7

&s10

companycompany

nameaddress name

url

address

“Widget” “Trenton” “Gadget”

“www.gp.fr”

“Paris”

&p2&p1 &p3

&s0 &s1 &s4 &s5 &s8 &s9

personperson

person

“Smith”

nameposition name phonename

position

“Manager” “Jones” “5552121” “Dupont” “Sales”

employeemanages

c.e.o.works-for works-for

works-for

c.e.o.

&a1

&a2 &a3

&a4&a5

&a6&a7

description

description

procurement salesrep

contact

task

eval1997

1998

“on target”

“below target”

Some database:

Lower-Bound Schemas

Root

Company Employee

string

companyperson

works-for

c.e.o.

address

name

managed-by

name

The Two Questions to Ask

Conformance: does that data conform to this schema ?

Classification: if so, then which objects belong to what classes ?

Graph Simulation

Definition Two edge-labeled graphs G1, G2

A simulation is a relation R between nodes:• if (x1, x2) in R, and (x1,a,y1) in G1,

then exists (x2,a,y2) in G2 (same label)

s.t. (y1,y2) in R

x1 x2

a

R

G1 G2

y1

a

Ry2

Note: a simulation can be efficiently computed [Henzinger, et a. 1995]

Using Simulation

Data graph D, schema S

• upper bound schema:– conformance: find simulation R from D to S– classification: check if (x,c) in R

• lower bound schema– conformance: find simulation R from S to D– classification: check if (c,x) in R

[Buneman, Davidson, Fernandez, Suciu 1997]

Example

&r1

&c1 &c2

&s2 &s3 &s6 &s7

&s10

companycompany

nameaddress name

url

address

“Widget” “Trenton” “Gadget”

“www.gp.fr”

“Paris”

&p2&p1 &p3

&s0 &s1 &s4 &s5 &s8 &s9

personperson

person

“Smith”

nameposition namephonename

position

“Manager” “Jones” “5552121” “Dupont” “Sales”

employeemanages

c.e.o.works-for works-for

works-forc.e.o.

&a1

&a2 &a3

&a4&a5

&a6&a7

description

description

procurement salesrep

contact

task

eval1997

1998

“on target”

“below target”

Root

Company Employee

string

company

person

works-for

c.e.o.

address

name

managed-by

name

Root

Company Employee

string

company

person

works-for

c.e.o. | employee

name | address | url

managed-by

name | phone | position

Any

description

-

DatabaseLower Bound Upper Bound

simulation: efficient technique for checking conformance to schema

Application 1: Improve Secondary Storage

Root

Company Employee

string

company

person

works-for

c.e.o.

address

name

managed-by

name

o i d n a m e a d d r e s s c . e . o .… … … …… … … …

Company

o i d n a m e m a n a g e d - b y w o r k s - f o r… … … …… … … …

Employee

Store rest in overflow graph

Lower-bound schema

Application 2: Query Optimization

Bib

paper book

yearjournal

title

int string string

addressauthor

title

zip city street

lastname

firstname

string string string string string

string

select X.titlefrom Bib._ Xwhere X.*.zip = “12345”

select X.titlefrom Bib._ Xwhere X.*.zip = “12345”

select X.titlefrom Bib.book Xwhere X.address.zip = “12345”

select X.titlefrom Bib.book Xwhere X.address.zip = “12345”

Upper-bound schema[Fernandez, Suciu 1998]

Schema Extraction(From Data)

Problem statement

• given data instance D

• find the “most specific” schema S for D

In practice: S too large, need to relax

[Nestorov, Abiteboul, Motwani 1998]

Schema Extraction: Sample Data

&r

&p8&p1 &p2 &p3 &p4 &p5 &p6 &p7

&c

company

employeeemployee

employeeemployee employee employee

employeeemployee

worksfor

worksfor

worksforworksforworksfor

worksforworksfor

worksfor

manages

manages

manages

manages

managedby

managedbymanagedby

manages

managedby

managedby

Lower Bound Schema Extraction

Root&r

Bosses&p1,&p4,&p6

Regulars&p2,&p3,&p5,&p7,&p8

Company&c

company employee

manages

managedby

worksfor

worksfor

employee

Upper Bound Schema Extraction: Data Guides

Root&r

Employees&p1,&p1,&p3,P4

&p5,&p6,&p7,&p8

Bosses&p1,&p4,&p6

Regulars&p2,&p3,&p5,&p7,&p8

Company&c

company

employee

managesmanagedby

manages

managedby

worksfor

worksfor

worksfor

Schemas in XML

• Document Type Definition (DTD)

• XML Schema

Document Type Definition: DTD

• part of the original XML specification

• an XML document may have a DTD

• terminology for XML:– well-formed: if tags are correctly closed– valid: if it has a DTD and conforms to it

• validation is useful in data exchange

DTDs as Grammars

<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>

<!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)>]>

<paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section></paper>

DTDs as Schemas

Not so well suited:• impose unwanted constraints on order

<!ELEMENT person (name,phone)>

• references cannot be constrained

• can be too vague: <!ELEMENT person ((name|phone|email)*)>

XML Schemas

• http://www.w3.org/TR/xmlschema-1/10/2000

• generalizes DTDs• uses XML syntax• two documents: structure and datatypes

– http://www.w3.org/TR/xmlschema-1– http://www.w3.org/TR/xmlschema-2

• XML-Schema is very complex– often criticized– some alternative proposals

XML Schemas<xsd:element name=“paper” type=“papertype”/>

<xsd:complexType name=“papertype”>

<xsd:sequence>

<xsd:element name=“title” type=“xsd:string”/>

<xsd:element name=“author” minOccurs=“0”/>

<xsd:element name=“year”/>

<xsd: choice> < xsd:element name=“journal”/>

<xsd:element name=“conference”/>

</xsd:choice>

</xsd:sequence>

</xsd:element>

DTD: <!ELEMENT paper (title,author*,year, (journal|conference))>

Elements v.s. Types in XML Schema

<xsd:element name=“person”> <xsd:complexType> <xsd:sequence> <xsd:element name=“name” type=“xsd:string”/> <xsd:element name=“address” type=“xsd:string”/> </xsd:sequence> </xsd:complexType></xsd:element>

<xsd:element name=“person”> <xsd:complexType> <xsd:sequence> <xsd:element name=“name” type=“xsd:string”/> <xsd:element name=“address” type=“xsd:string”/> </xsd:sequence> </xsd:complexType></xsd:element>

<xsd:element name=“person” type=“ttt”><xsd:complexType name=“ttt”> <xsd:sequence> <xsd:element name=“name” type=“xsd:string”/> <xsd:element name=“address” type=“xsd:string”/> </xsd:sequence></xsd:complexType>

<xsd:element name=“person” type=“ttt”><xsd:complexType name=“ttt”> <xsd:sequence> <xsd:element name=“name” type=“xsd:string”/> <xsd:element name=“address” type=“xsd:string”/> </xsd:sequence></xsd:complexType>

DTD: <!ELEMENT person (name,address)>

• Types:– Simple types (integers, strings, ...)– Complex types (regular expressions, like in DTDs)

• Element-type-element alternation:– Root element has a complex type– That type is a regular expression of elements– Those elements have their complex types...– ...– On the leaves we have simple types

Local and Global Types in XML Schema

• Local type: <xsd:element name=“person”>

[define locally the person’s type] </xsd:element>

• Global type: <xsd:element name=“person” name=“ttt”/>

<xsd:complexType name=“ttt”> [define here the type ttt] </xsd:complexType>

Global types: can be reused in other elements

Local v.s. Global Elements inXML Schema

• Local element: <xsd:complexType name=“ttt”>

<xsd:sequence> <xsd:element name=“address” type=“...”/>... </xsd:sequence> </xsd:complexType>

• Global element: <xsd:element name=“address” type=“...”/>

<xsd:complexType name=“ttt”> <xsd:sequence> <xsd:element ref=“address”/> ... </xsd:sequence> </xsd:complexType>

Global elements: like in DTDs

Regular Expressions in XML Schema

Recall the element-type-element alternation: <xsd:complexType name=“....”>

[regular expression on elements] </xsd:complexType>

Regular expressions:• <xsd:sequence> A B C </...> = A B C• <xsd:choice> A B C </...> = A | B | C• <xsd:group> A B C </...> = (A B C)• <xsd:... minOccurs=“0” maxOccurs=“unbounded”> ..</...> = (...)*• <xsd:... minOccurs=“0” maxOccurs=“1”> ..</...> = (...)?

Summary of XML Schema

• Formal Expressive Power:– Can express precisely the regular tree

languages (over unranked trees)

• Lots of other stuff– Some form of inheritance– A “null” value– Large collection of data types

Summary of Schemas

• in SS data: – graph theoretic– data and schema are decoupled– used in data processing

• in XML– from grammar to object-oriented– schema wired with the data– emphasis on semantics for exchange

Part 4Systems Issues


• Query languages

• Schemas

• Systems issues

• Conclusions

Systems Issues

• XML publishing (relational XML)

• XML storage (XML relational)

• XML processing (XML XML)

• XML compression (XML )

• XML indexing (XML )

XML Publishing

Today:• legacy data

– fragmented into many flat relations– 3rd normal form– schema is proprietary

• XML data– nested– un-normalized– schema designed by agreement

XML Publishing

• XML = view of Relational/OO/OR sources– Virtual view = mediator based publishing– Materialized view = warehouse based publishing

• Two systems:– SilkRoute

• Virtual [Fernandez, Suciu, Tan, 2000]

• Materialized [Fernandez, Morishima, Suciu, 2001]

– Experanto• Virtual [Shanmugasundaram, Tufte, DeWitt, Naughton, Maier, 2000]

• Materalized [Shanmugasundaram, Shekita, Barr, Carey, Lindsay, Pirahesh, Reinwald, 2000]

XML Publishing in SilkRoute

• relational database:

• virtual XML view:

<store> <name> n1 </name> <book> ... </book> <book> ... </book> ... </store> <store> <name>n2 </name> <book> ... </book> <book> ... </book> …</store>

s i d n a m e… …… …

Stores i d b i d… …… …

SBb i d t i t l e… …… …

Book


• specify mediator declaratively (a view):

from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bidconstruct <store ID=f(Store.sid)> <name> Store.name </name> <book> Book.title </book> </store>

from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bidconstruct <store ID=f(Store.sid)> <name> Store.name </name> <book> Book.title </book> </store>


• users ask XML-QL queries:– find stores who sell “The Calculus”

where <store> <name> $n </name> <book> The Calculus </book> <store>construct <result> $n </result>

where <store> <name> $n </name> <book> The Calculus </book> <store>construct <result> $n </result>


• system composes query with view:

from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bid and Book.title=“The Calculus”construct <result> Store.name </result>

from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bid and Book.title=“The Calculus”construct <result> Store.name </result>

XML Storage

• Text file (XML)

• Use generic schema– [Florescu, Kossman 1999]

• Use DTD to derive schema– [Shanmugasundaram, et al. 1999]

• Use data mining to derive schema– [Deutsch, Fernandez, Suciu 1999]

• Build special purpose repository

XML Storage: Text File

• advantages– simple– less space than database !!!– reasonable clustering

• disadvantage– no updates– require special purpose query processor

&o1

&o3

&o2

&o4 &o5

paper

title author authoryear

&o6

“The Calculus” “…” “…” “1986”

XML Stoarge: Ternary Relation

[Florescu, Kossman 1999]

S o u r c e L a b e l D e s t

& o 1 p a p e r & o 2& o 2 t i t l e & o 3& o 2 a u t h o r & o 4& o 2 a u t h o r & o 5& o 2 y e a r & o 6

N o d e V a l u e

& o 3 T h e C a l c u l u s& o 4 …& o 5 …& o 6 1 9 8 6

Ref

Val

XML Storage: DTD to Schema

• DTD:

• Relational schema:Employee(eid, name, address)

Address(aid, eid, street, city, state, zip)

<!ELEMENT employee (name, address, project*)><!ELEMENT address (street, city, state, zip)>

<!ELEMENT employee (name, address, project*)><!ELEMENT address (street, city, state, zip)>

[Christophides, Abiteboul, Cluet, Scholl 1994]

[Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton 1999]

XML Storage: Data Mining to Schema

paperpaper paper

paper

authorauthor author author author

titletitle title title

year

fn fn fn fn lnlnlnln

a u t h o r t i t l eX X

f n 1 l n 1 f n 2 l n 2 t i t l e y e a r

X X X X X -X X - - X XX X - - X -

Paper1

Paper2

[Deutsch, Fernandez, Suciu 1999]

Query Evaluation

• Stream XML processing– Xscan

• [Halevy, Ives]

– Niagara • [Shanmugasundaram, Tufte, DeWitt, Naughton, Maier 2000]• [Chen, DeWitt, Tian, Wang, 2000]

• XML updates [Halevy, Ives, Tatarinov, Weld, 2001]

• Query optimization [McHugh, Widom 99]

XScan

• Store and query approach:– Store XML on disk, then index & query – Cannot amortize storage costs

• Streaming data approach (XScan):– Read & parse – Track nodes by ID Index XML graph structure

XScan

• Illustrated here: pattern matching• XML-QL pattern:

• Becomes: $x = (root)/doc/person

$n = $x/name $a = $x/address $s = $a/street $z = $a/zip

<doc/person> <name> $n </> <address> <street> $s </> <zip> $z </> </>

</>

<doc/person> <name> $n </> <address> <street> $s </> <zip> $z </> </>

</>

XScan

• Built a tree of binding tables:$x

elm1

elm2

...

$n $a

$a$n

Saves joinsCan free space when not needed

Compression: The Problem

• XML for exchange (space or time)

• but XML is verbose

• users prefer application specific formats:– Web Server Logs– EMBL– G2

• is XML doomed to fail ?

An Example:Web Server Logs

202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I)202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I)

<apache:entry>

<apache:host> 202.239.238.16 </apache:host>

<apache:requestLine> GET / HTTP/1.0 </apache:requestLine>

<apache:contentType> text/html </apache:contentType>

<apache:statusCode> 200</apache:statusCode>

<apache:date> 1997/10/01-00:00:02</apache:date>

<apache:byteCount> 4478</apache:byteCount>

<apache:referer> http://www.net.jp/ </apache:referer>

<apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent>

</apache:entry>

<apache:entry>

<apache:host> 202.239.238.16 </apache:host>

<apache:requestLine> GET / HTTP/1.0 </apache:requestLine>

<apache:contentType> text/html </apache:contentType>

<apache:statusCode> 200</apache:statusCode>

<apache:date> 1997/10/01-00:00:02</apache:date>

<apache:byteCount> 4478</apache:byteCount>

<apache:referer> http://www.net.jp/ </apache:referer>

<apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent>

</apache:entry>

ASCII File 15.9 MB (gzipped 1.6MB):

XML-ized inflates to 24.2 MB (gzipped 2.1MB):

XMill

• [Liefke and Suciu’2000]

• specialized compressor for XML data

• makes XML look “small”

• www.research.att.com/sw/tools/xmill

How Xmill Works: Three Ideas

<apache:entry>

<apache:host> </apache:host>

. . .

</apache:entry>

<apache:entry>

<apache:host> </apache:host>

. . .

</apache:entry>

202.239.238.16

GET / HTTP/1.0

text/html

200

…

202.239.238.16

GET / HTTP/1.0

text/html

200

…

gzip Structure gzip Data

=1.75MB+1

<apache:entry>

. . .

</apache:entry>

<apache:entry>

. . .

</apache:entry>

202.23.23.16

224.42.24.55

…

202.23.23.16

224.42.24.55

…

gzip Structure gzip Data1

=1.33MB+2GET / HTTP/1.0

GET / HTTP/1.1

…

GET / HTTP/1.0

GET / HTTP/1.1

…

gzip Data2

+

3 gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... =0.82MB

XML Compression

Compression Tradeoff

Indexing Semistructured Data

• coercions: 1995 v.s. “1995”

• regular path expressions– data guides [Goldman, Widom, 1997]– T-indexes [Milo, Suciu, 1999]

Indexing All Paths in the Data1

2 3 4 5 6

7 8 9 10 11 12 13

t t t t t

a b a c a d a a b

Semistructured Data

1

2 3 4 5 6

7 8 10 12 13 7 13 9 11

t

ab c

d

Data Guide

1

2 3 4 5 6

7 13 8 10 12 9 11

t

aa c db

T-Index

Summary of Systems

• XML processing– With relational engines– With XML engines

• Mapping XML to relations is still not well understood

• Indexing work is mostly missing

Part 5Conclusions


• Query languages

• Schemas

• Systems issues

• Conclusions

Summary

• XML = what is out there

• semistructured data = what we can process

• paradigm shift, for both Web and db

• covered in tutorial:– data models, queries, schemas

Current and Future Technologies

• Web applications possible today:– export relational data to XML (e.g. Oracle)– import XML into applications (DOM)

• Web applications in the future:– mediator technology (XML view)– store/process native XML data– mine/analyze XML

Why This Is Cool for Database Researchers

• put to work what you teach in CS101 !– tree traversals (structural recursion, XSL)– automata theory (DTD’s, path expressions)– graph theory (simulation)

• adapt old DB techniques to novel data

• save the trees: from fax to XML

The End

Further Readings

www. w3.org/XML

http://data.cs.washington.edu/xml/

http://cm.bell-labs.com/cm/cs/who/wadler/xml/

http://www.cs.washington.edu/homes/suciu

Abiteboul, Buneman, Suciu

Data on the Web: From Relational to Semistructured to XML

Morgan Kaufmann, 1999

managing xml data dan suciu university of washington

Documents

data xml

data management slide

relational data

irregular data

data formatsmodels

data processing

semistructured data

semistructured data