comp60411 semi-structured data and the web datatypes relax...

108
1 COMP60411 Semi-structured Data and the Web Datatypes Relax NG, XML Schema, and Tree Grammars Conny Hedeler and Uli Sattler University of Manchester

Upload: nguyendang

Post on 23-Aug-2019

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

1

COMP60411Semi-structured Data and the WebDatatypesRelax NG, XML Schema, and Tree Grammars

Conny Hedeler and Uli SattlerUniversity of Manchester

Page 2: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Datatypes and Representations

2

Page 3: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

SE3/M3: Evaluating Robustness

• Robustness in the face of change– A measure of evolvability

• If something changes, does our system break?• If it breaks, do we know that it broke?• If it broke, can we fix it?• If we “fixed” it, can we tell/how hard is it?

• Robustness is an organization-wide phenomenon– Fragility in one area can be compensated for by another

• E.g., by someone who never sleeps and knows the system– Different sorts of fragility

• With different probabilities and costs3

is the ability of a computer system to cope with errors during execution or the ability of an algorithm to continue to operate despite abnormalities in input, calculations, etc [wikipedia]

Page 4: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

SE3/M3: XQuery, schemas, and types

4

PSVI

(tree adorned with default values & types)

Schema-aware query processor

Schema-aware parser

Quer

XML doc.

Schem

Query processor

QueryAnswer

(0.5 * 2) cast as xs:integer)(validate {doc("el1.xml")})//element(*,AxiomType)

Page 5: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Some SE3 Questions• Which query is most robust to changes in the schema?

1. /*/(equivalent|subsumes|...)2. /*/*[ssd:axiom(.)]3. /*/element(*,el:Axiom)4. They are equi-robust (and fragile)5. They are equi-robust (and robust)

• Which query is most widely usable?1. /*/(equivalent|subsumes|...)2. /*/*[ssd:axiom(.)]3. /*/element(*,el:Axiom)4. They are equi-usable (and not widely usable)5. They are equi-usable (and widely usable)

5

Page 6: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Basics of Types

• What, in the most general sense, is a datatype?1. A set of (data) values2. A description of the arguments of a function3. Anything derived from xs:anyType4. An annotation of a variable/node/element

• Anything naming or describing a set– ...has an associated type!

• Types are just sets (of “values”)• The “extensional” view

• A Type System is a language for – describing types (the “intensional” view)– associating types with other linguistic entities

• E.g., literals, variables, expressions, programs

6

But we may or may not be able to express this type

Page 7: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

A Typical Type System & XSD • some primitive or built-in or “basic” types

– Integer, strings, etc.– xs:anyType, xs:string, xs:duration, ...

• some constructor to build composite types– Arrays, records, dictionaries, etc.– xs:list, xs:union,

• other constructors– To, for example, create other derived types– xs:restriction, xs:extension

• a syntax for associating types with variables, items,...– And functions, etc.– <xsd:element name="purchaseOrder" type="PurchaseOrderType"/>– <person xsi:type="LongPersonType" phone="5433">

• A set of conditions for success or failure (Type Errors)7

Page 8: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

A Brief Tour of Type Systems• Strong vs. Weak

– Type errors are caught/reported vs silently succeeding/causing havoc

• Static vs. Dynamic– Check type at compile time vs. at run time

• Explicit/Manifest vs. Implicit/Latent– Type of everything (vars, functions,element) has (not) to be declared – Implicit: requires type inference – Explicit: requires type checking

• Nominal vs. Structural– Nominal: type compatibility relies on features of the declaration

• I declare a two types, “miles” and “feet” whose values are integers• 1 as miles != 1 as feet

– structural: type compatibility relies entirely on value structure • 1 as miles == 1 as feet (1 is the same integer!)

8

Page 9: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Some questions• Java’s type system is primarily

• strong, manifest, and nominal

• XQuery’s type system is primarily• strong, latent, and nominal

9

Page 10: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Some Expression Examples

10

Page 11: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Some Expression Examples• if ($aBool) then 1+1 else "2"

its type depends on value of $aBool:

10

Page 12: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Some Expression Examples• if ($aBool) then 1+1 else "2"

its type depends on value of $aBool:

10

Page 13: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Some Expression Examples• if ($aBool) then 1+1 else "2"

its type depends on value of $aBool:

• if (true()) then 1+1 else "2" is an instance of xs:integer:

10

Page 14: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Some Expression Examples• if ($aBool) then 1+1 else "2"

its type depends on value of $aBool:

• if (true()) then 1+1 else "2" is an instance of xs:integer:

10

Page 15: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Some Expression Examples• if ($aBool) then 1+1 else "2"

its type depends on value of $aBool:

• if (true()) then 1+1 else "2" is an instance of xs:integer:

• if (false()) then 1+1 else "2" is an instance of xs:string

10

Page 16: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Some Expression Examples• if ($aBool) then 1+1 else "2"

its type depends on value of $aBool:

• if (true()) then 1+1 else "2" is an instance of xs:integer:

• if (false()) then 1+1 else "2" is an instance of xs:string

10

Page 17: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Some Expression Examples• if ($aBool) then 1+1 else "2"

its type depends on value of $aBool:

• if (true()) then 1+1 else "2" is an instance of xs:integer:

• if (false()) then 1+1 else "2" is an instance of xs:string

10

Page 18: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Some Expression Examples• if ($aBool) then 1+1 else "2"

its type depends on value of $aBool:

• if (true()) then 1+1 else "2" is an instance of xs:integer:

• if (false()) then 1+1 else "2" is an instance of xs:string

• (if ($aBool) then 1+1 else "2") instance of (xs:integer | xs:string)

10

Page 19: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Some Expression Examples• if ($aBool) then 1+1 else "2"

its type depends on value of $aBool:

• if (true()) then 1+1 else "2" is an instance of xs:integer:

• if (false()) then 1+1 else "2" is an instance of xs:string

• (if ($aBool) then 1+1 else "2") instance of (xs:integer | xs:string)

10

Not legal XQuery

(if (true()) then 3+1 else "2") instance of xs:integer

returns “true”

(if (false()) then 3+1 else "2") instance of xs:string

returns “true”

Page 20: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Mistyped• Obvious conflict

– "2" + 2

• Making this conflict less obvious: – (if (false()) then 1+1 else "2") + 2

• Same error as above– (if (true()) then 1+1 else "2") + 2

• This is accepted!

• Making this conflict even less obvious: – declare function ssd:test($x as xs:boolean) as xs:integer{

if ($x) then 1+1 else "2" + 2 };

– declare function ssd:test($x as xs:boolean) as xs:integer{ if ($x) then 1+1 else "2"};

My checker doesn’t flag this error

It does flag this one!

11

Arithmetic operator is not defined for arguments of types (xs:integer, xs:string)

Page 21: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Simple Promotion• Explicit

– (1.0 + ("1" cast as xs:integer)) instance of xs:decimal– True!

• Implicit– ((1.0 treat as xs:decimal) + 125E2) instance of xs:double– Also true– Note that treat as and cast as are not the same:

• ("1.0" treat as xs:decimal)– doesnʼt work

• ("1.0" cast as xs:decimal)– This results in 1

12

Required item type of value in 'treat as' expression is xs:decimal; supplied value has item type xs:string

Page 22: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Complex Casting

http://msdn.microsoft.com/en-us/library/ms191231.aspx 13

Page 23: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Getting to PSVI• Consider a very simple XQuery

– No results!

• Must validate!

– Returns: <atomic xmlns="..." name="Person"/>– validate generates a PSVI

14

import schema default element namespace "…” at "el-typed.xsd";<instance-of> <constant name="sally"/> <atomic name="Person"/></instance-of>/element(*, ClassExpression)

import schema default element namespace "…” at "el-typed.xsd";validate{<instance-of> <constant name="sally"/> <atomic name="Person"/></instance-of>}/element(*, ClassExpression)

Page 24: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

import schema namespace el="http://www.cs.manchester.ac.uk/pgt/COMP60411/el" at "el-typed.xsd";import schema namespace owl="http://www.w3.org/2002/07/owl#" at "owl2-xml.xsd";

declare namespace ex="http://ex.org";declare function ex:convertAxiom($ax as element(*, el:Axiom)) as element(*, owl:Axiom){ typeswitch ($ax) case schema-element(el:equivalent) return validate{<owl:EquivalentClasses>{ for $expr in $ax/* return ex:convertExpression($expr)}</owl:EquivalentClasses>} default return validate {<owl:EquivalentClasses><owl:Class IRI="http://BOGUS"/> <owl:Class IRI="http://BOGUS"/><owl:EquivalentClasses>}};declare function ex:convertExpression($expr as element(*, el:ClassExpression)) as element(*, owl:ClassExpression){ if ($expr instance of element(el:atomic)) then validate{<owl:Class IRI="{$expr/@name}"/>} else validate {<owl:Class IRI="http://BOGUS"/>} };declare function ex:convert($ont as element(*, el:Ontology)) as element(owl:Ontology, owl:Ontology){ validate{ <owl:Ontology> {for $e in $ont/element(*,el:Axiom) return ex:convertAxiom($e)} </owl:Ontology> }};ex:convert(validate{doc("el1.xml")/*}) 15

Page 25: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

...that was a Complex Typed “Cast”• where all input and output all typed

• ...how do we ensure that our system works correctly with types?

<?xml version="1.0" encoding="UTF-8"?><owl:Ontology xmlns:owl="http://www.w3.org/2002/07/owl#"> <owl:EquivalentClasses> <owl:Class IRI="http://BOGUS"/> <owl:Class IRI="http://BOGUS"/> </owl:EquivalentClasses> <owl:EquivalentClasses> <owl:Class IRI="http://BOGUS"/> <owl:Class IRI="http://BOGUS"/> </owl:EquivalentClasses> <owl:EquivalentClasses> <owl:Class IRI="Person"/> <owl:Class IRI="http://BOGUS"/> </owl:EquivalentClasses> <owl:EquivalentClasses> <owl:Class IRI="http://BOGUS"/> <owl:Class IRI="http://BOGUS"/> </owl:EquivalentClasses></owl:Ontology>

The only “proper” value

16

Page 26: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Static type check - type soundness

• A (statically verified) type safe program– has some guaranteed behavior

• and thus can be transformed or optimized in aggressive ways– may be more brittle

• fails hard on invalid input• accepts maybe less input than possible

Type-inference rules are written in such a way that any value that can be returned by an expression is guaranteed to conform to the static type inferred for the expression. This property of a type system is called type soundness. A consequence of this property is that a query that raises no type errors during static analysis will also raise no type errors during execution on valid input data. The importance of type soundness depends somewhat on which errors are classified as "type errors," as we will see below.

http://www.informit.com/articles/article.aspx?p=100667&seqNum=6 17

Page 27: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Data Representations• Data and data structures have representations

– (More or less) Physical embodiments– (Ultimately) Bits in a machine

• The “same” data can have distinct representations– 1 vs. “one”

• The “same” data structure can have distinct representations– At different levels of abstraction

• One key distinction– Internal (“in-memory”)– External (“on disk”)

• Generally:– External representations are for exchange between

(heterogeneous) systems

“Location” doesn’t really matter

18

Page 28: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Conversion• We can go from external to internal (e2i)

– Parsing, reading, loading, de-serializing, unmarshalling

• We can go from internal to external (i2e) – Serializing, writing, printing, saving, marshalling– Different systems may have different internals

• At least in detail– Different applications may behave differently

• There and back again: Roundtripping– Internal to external to internal (i2e2i)– External to internal to external (e2i2e)– Ideally preserves key properties

• Which?• When is ok not to preserve?

19

Page 29: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

What is an XML “Document”?

Errors here mean noXML! SAX ErrorHandler

Yay! XPath! XSLT! Etc.

20

Element

Element Element Attribute

Element

Element Element Attribute

LevelData unit examples

Information or Property

required

cognitive

application

tree adorned with...

namespace schema nothing a

schematree well-formedness

token

complex <foo:Name t=”8”>Bob

simple <foo:Name t=”8”>Bob

character < foo:Name t=”8”>Bob

which encoding(e.g., UTF-8)

bit 10011010

Page 30: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

What is an XML “Document”?

21

Element

Element Element Attribute

Element

Element Element Attribute

LevelData unit examples

Information or Property

required

cognitive

application

tree adorned with...

namespace schema nothing a

schematree well-formedness

token

complex <foo:Name t=”8”>Bob

simple <foo:Name t=”8”>Bob

character < foo:Name t=”8”>Bob

which encoding(e.g., UTF-8)

bit 10011010

validateeraseserialise

parse

Page 31: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

What is an XML “Document”?

22

Element

Element Element Attribute

Element

Element Element Attribute

LevelData unit examples

Information or Property

required

cognitive

application

tree adorned with...

namespace schema nothing a

schematree well-formedness

token

complex <foo:Name t=”8”>Bob

simple <foo:Name t=”8”>Bob

character < foo:Name t=”8”>Bob

which encoding(e.g., UTF-8)

bit 10011010

“Same” inputs canhave different “meanings”!(external validation)

Page 32: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

What is an XML “Document”?

23

Element

Element Element Attribute

Element

Element Element Attribute

LevelData unit examples

Information or Property

required

cognitive

application

tree adorned with...

namespace schema nothing a

schematree well-formedness

token

complex <foo:Name t=”8”>Bob

simple <foo:Name t=”8”>Bob

character < foo:Name t=”8”>Bob

which encoding(e.g., UTF-8)

bit 10011010

...we can have many...

For “the same” meaning

Page 33: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

24

The Essence of XML (with WXS)• Thesis:

– “XML is touted as an external format for representing data.”• Two properties

– Self-describing• Destroyed by external validation

– Round-tripping• Destroyed by defaults and union types

http://bit.ly/essenceOfXML2

Page 34: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

25

The Essence of XML (with WXS)• Roundtripping issues

– Internal to external and back• Take an element, foo, with content {“one”, “2”, 3}• It’s (simple) type is a list of union of integer and string• Serialize

– <foo>one 2 3</foo>• Parse and validate

– Content is {“one”, 2, “3”}– External to internal and back

• “001” to 1 to “1”

http://bit.ly/essenceOfXML2

Page 35: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

26

The Essence of XML (with WXS)• Conclusion:

– “So the essence of XML is this: the problem it solves is not hard, and it does not solve the problem well.”

• Itʼs not obvious– That the issues are serious (enough)– That the problem solved is all that easy– That there arenʼt other, worse issues

http://bit.ly/essenceOfXML2

Page 36: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Tree Grammars

27

Page 37: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Observations/Q3:

• Documents/trees are finite structures • A Schema/grammar can describe no/finitely/infinitely many

documents/trees• For a given set of documents/trees, we can design various

schemas/grammars

28

<?xml version="1.0" encoding="UTF-8"?><!ELEMENT cartoon (prolog, panels)> <!ATTLIST cartoon

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE cartoon SYSTEM "cartoon.dtd"><cartoon

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE cartoon SYSTEM "cartoon.dtd"><cartoon copyright="Bill

.

.

.

0 1

0 0 0

BAA

A BB

ε0 1

0 0 0 1

BAA

A BB B

N = {Book, PA, Editor, A, Paper, F, L}Σ = {B, Name, F, L, A, P}S = {Book, Paper}

P = { Book → B Editor|PA, Paper → P PA+, Editor → Name F,L, PA → Name L,A, F → F ε, L → L ε, A → A ε }

Page 38: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Remember: Tree Grammars

๏ A set of trees is called a tree language (like sets of strings are languages)

• A tree language can be empty, finite, or infinite

๏ A tree language TS is if there exists a

tree grammar G such that L(G) = TS.

‣ for one TS, there can be different tree grammars accepting exactly TS…

localsingle-type

regular

localsingle-type

(any)

29

Page 39: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Properties of Local and Single-Type Tree Languages

- the following observation is an immediate consequence of the definitions of local and single-type tree languages

★ Every local tree language is single-type, and every single-type tree language is regular.

- the next observation is a bit more tricky: ★ There are regular tree languages that are not single-type, and

there are single-type tree languages that are not local.

Loc ⊊ ST ⊊ Reg LocSTReg

30

Page 40: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Single-Typedness and PSVIs...

31

• Imagine, the following XML Schema was legal,• and you’d ask (a schema-aware XQuery processor) to return all

elements of type NewPersonType

• ...as in //*element(*, NewPersonType)

• from the little document below…

• the answer would depend on the PSVI constructed for little.xml – what is the type of /A/person?– NewPersonType or OldPersonType?

• To avoid such confusion/nondeterminism, UPAc ensures single-typedness ensures unique PSVI

<A> <person> .... </person> </A>

little.xml

<xs:element name="A"><xs:complexType> <xs:sequence> <xs:element name="person" type="NewPersonType" minOccurs="0" maxOccurs="1"/> <xs:element name="person" type="OldPersonType" minOccurs="0" maxOccurs="1"/> </xs:sequence> </xs:complexType></xs:element>

Page 41: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Why & when single-type matters

★ A single-type grammar can have no more than one run on a tree. • a run corresponds to a PSVI

– as it labels input tree/document nodes with non-terminals/types

• ..hence validation against a schema that corresponds to a single-type grammar results in a unique PSVI– (PSVI = DOM tree adorned with default values & types) – hence schema-aware queries know/agree on what to return!

★ A regular grammar can have more than one run on a tree. • ..hence validation against a schema that does not correspond to a

single-type grammar may result in one of many PSVIs– hence schema-aware queries may differ in their answer!

✴ Use single-type schema language for schema-aware querying!32

Page 42: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Tree Grammars: 1 more thing

• BTW, w.l.o.g., we can assume that no two production rules have the same non-terminal on the left hand side and the same terminal. I.e., no N → P PA and N → P (Editor,Editor*).

We can also rewrite those, e.g., to

N → P (PA | (Editor,Editor*))

• ...so, how did we get here? From DTDs and XML schemas!

33

Page 43: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Tree Grammars ⇆ DTDs• since DTDs don’t have “types”, just element names, they correspond

to grammars of a peculiar, simple kind:

★ Tree grammars for DTDs are always local...even if the DTD has a non-deterministic content model <!ELEMENT N1 (M|(M,M))> is not deterministic and thus illegal (but can be replaced with <!ELEMENT N1 (M,(M|ε))>)

<!ELEMENT T (N1,N2*)><!ELEMENT N1 (M|(M,M))><!ELEMENT N2 (#PCDATA)><!ELEMENT M (#PCDATA)>

F = (N, Σ, S, P) withN = {T, N1, N2, M, pcdata}Σ = {T, N1, N2, M, pcdata}S = {T}P = { T → T (N1,N2*), N1 → N1 (M|(M,M)), N2 → N2 pcdata, M → M pcdata, pcdata → pcdata ε}

ε

0

0,0

T

N1

1,0

Mpcdata

1 N2

0,0,0 pcdata

34

Page 44: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Remember?!• in DTDs and in WXS, content models are further restricted

(for compatibility with SGML)– [DTD] determistic (or 1-unambiguous),

e.g., (M|(M,M)) is not deterministic, (M,(M|ε)) is.e.g., ((b, c) | (b, d)) is not deterministic, b,(c|d) is.From http://www.w3.org/TR/REC-xml/:

35

As noted in 3.2.1 Element Content, it is required that content models in element type declarations be deterministic. This requirement is for compatibility with SGML (which calls deterministic content models "unambiguous"); XML processors built using SGML systems may flag non-deterministic content models as errors.

More formally: a finite state automaton may be constructed from the content model using the standard algorithms, e.g. algorithm 3.5 in section 3.9 of Aho, Sethi, and Ullman [Aho/Ullman]. In many such algorithms, a follow set is constructed for each position in the regular expression (i.e., each leaf node in the syntax tree for the regular expression); if any position has a follow set in which more than one following position is labeled with the same element type name, then the content model is in error and may be reported as an error.

Page 45: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Tree Grammars and DTDs• so, DTDs are local (and thus single-type) because they don’t

have any types at all– and not because their content model is deterministic!– they are single-type even with non-deterministic content model

• hence we could extend DTDs with types and still be single-type...provided we impose suitable restrictions

36

Page 46: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Tree Grammars ⇆ WXS

• tree grammars also capture the basic, structural part of WXS:✓ types (complex and anonymous)‣ model groups (we ignore them)‣ derivation by extension and restriction (we ignore them)‣ substitution groups (we ignore them)‣ integrity constraints like keys (must be ignored, don’t fit into tree

grammars) • we only deal with simple XML schemas, but general approach works

for more

37

Page 47: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Tree Grammars ⇆ WXS

• one stupid problem with this: in XSD, we can have – named types, e.g., <xs:complexType name="BBlist">– unnamed types, e.g., <xs:element name="mylist">...

• ...hence we invent a lot of type names for unnamed types,

– eg MYLIST for mylist

• we use a two-stage approach:• to transform an XML schema S into a tree grammar G,

1. we translate S into a generalized tree grammar 2. then flatten the generalized tree grammar into a tree grammar G

• this will be done such that T validates against S iff T is accepted by G.

38

Page 48: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating WXS into Tree Grammars• take a simple XML Schema S and translate it into grammar G(S): ➡ for each top-level element in S of the form

– <xs:element name="mylist" type="Blist"></xs:element>• add the following production rule to G(S):

– MYLIST → mylist BLIST^TYPE– add MYLIST, BLIST^TYPE to non-terminals, add mylist to terminals

➡ for each top-level element in S of the form – <xs:element name="mylist">

<xs:complexType> <xs:sequence> <xs:element name="ename" type="Comp" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element>

• add the following production rules to G(S):– MYLIST → mylist ENAME,ENAME*– ENAME → ename COMP^TYPE

what is the default for minOccurs?

39

Page 49: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating WXS into Tree Grammars

➡ for each top-level element in S of the form – <xs:complexType name="Blist">

<xs:sequence> <xs:element name="friend" type='Person' minOccurs = ʻ1ʼ maxOccurs ='2'/> </xs:sequence> </xs:complexType>

• add the following production rules to G(S):– BLIST^TYPE → (FRIEND | (FRIEND,FRIEND)) – FRIEND → friend PERSON^TYPE– add BLIST^TYPE, FRIEND, PERSON^TYPE to non-terminals,

add friend to terminals

38

%% generalized rule: to be expanded!

40

Page 50: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating WXS into Tree Grammars➡ for each top-level element in S of the form

- <xs:complexType name="BBlist"> <xs:choice> <xs:sequence> <xs:element name="A" type="xs:string"/> <xs:element name="B" type="xs:string"/> </xs:sequence> <xs:sequence> <xs:element name="A" type="xs:string"/> <xs:element name="C" type="xs:string"/> </xs:sequence> </xs:choice> </xs:complexType>

• add the following production rules to G(S):– BBLIST^TYPE → (A,B) | (A,C)– A → A STRING^TYPE– B → B STRING^TYPE– C → C STRING^TYPE– add BBLIST^TYPE, A, B, C, STRING^TYPE to non-terminals,

add A, B, C to terminals

%% generalized rule -- to be expanded!

%% UPA - violation:%% Oxygen complains!

41

Page 51: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating WXS into Tree Grammars• Consider the following case:

• To handle cases like the one above we can’t always add rules – AT^TYPE → N*, BT^TYPE → N* – N → N ??LIST^TYPE

• Instead, we translate these as – AT^TYPE → N^AS^ALIST^TYPE* BT^TYPE → N^AS^BLIST^TYPE*– N^AS^ALIST^TYPE → N ALIST^TYPE– N^AS^BLIST^TYPE → N BLIST^TYPE

<xs:complexType name="AT"> <xs:sequence> <xs:element name="N" type="Alist" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>

<xs:complexType name="BT"> <xs:sequence> <xs:element name="N" type="Blist" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>

42

Page 52: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating WXS into Tree GrammarsOur translation yields almost a tree grammar:• it produces illegal rules of the form X → e, i.e., without non-terminal

– e.g., BLIST^TYPE → (FRIEND | (FRIEND,FRIEND))

• our grammar model doesn’t handle those (check definition of a run)๏ hence we expand these illegal rules:

• e.g., MYLIST → mylist BLIST^TYPE would be transformed into – MYLIST → mylist (FRIEND | (FRIEND,FRIEND))

• ...and if we had <xs:element name="yourlist" type="Blist"/> then we also had – YOURLIST → yourlist BLIST^TYPE and thus– YOURLIST → yourlist (FRIEND | (FRIEND,FRIEND))

pick illegal rule X → e:– remove X → e from rule set – replace all occurrences of X in rule set with e

until no illegal rules are left in rule set

43

Page 53: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating WXS into Tree Grammars• Expanding illegal rules even works with cyclic type definitions - try

• This gives you these rules, including 2 illegal rules

• that can be expanded as follows:

<xs:complexType name="NType"> <xs:choice> <xs:element name="test2" type="AType"/> <xs:element name="EndElement" type="xs:string"/> </xs:choice> </xs:complexType>

<xs:complexType name="AType"> <xs:choice> <xs:element name="test1" type="NType"/> <xs:element name="EndElement" type="xs:string"/> </xs:choice> </xs:complexType>

NType^TYPE → (TEST2 | ENDELEMENT)TEST2 → test2 AType^TYPEENDELEMENT → EndElement STRING^TYPE...

AType^TYPE → (TEST1 | ENDELEMENT)TEST1 → test1 NType^TYPEENDELEMENT → EndElement STRING^TYPE...

TEST2 → test2 (TEST1 | ENDELEMENT)ENDELEMENT → EndElement STRING^TYPE...

TEST1 → test1 (TEST2 | ENDELEMENT)ENDELEMENT → EndElement STRING^TYPE...

44

Page 54: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating WXS into Tree Grammars

• So, to transform an XML schema S into a tree grammar G, 1. we translate S into a generalized tree grammar G’2. then expand G’ into a tree grammar G

★ Then any tree T validates against S iff T is accepted by G.

• So, what are the tree grammars we get as results?– they are tree grammars– are they single-type?– are they local?

★ Tree grammars corresponding to WXS are not local.• E.g., consider

– N^AS^ALIST^TYPE → N ALIST^TYPE– N^AS^BLIST^TYPE → N BLIST^TYPE– .. N^AS^ALIST^TYPE and N^AS^BLIST^TYPE are competing!

LocSTReg

45

Page 55: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating WXS into Tree Grammars ★ Tree grammars corresponding to WXS are single-type.

– This is ensured by the Unique Particle Attribution constraint in WXS. • Tree grammars corresponding to DTDs are local, ….hence

★ DTDs are less expressive than XML schemata.

• That is, there are tree languages that we can describe in WXS, but not in DTDs, e.g.,

LocSTReg

N = {Book, PA, Editor, A, Paper, F, L}Σ = {B,N,A,P,C}S = {Book, Paper}P = { Book → B Editor|PA, Paper → P PA, Editor → N F,L, PA → N L,A, F → F ε, L → L ε, A → A ε }

L

ε

0

0,0

B

N

0,1F

ε

0

0,0

P

N

0,1AL

Page 56: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Remember:

47

A content model must be formed such that during validation of an element information item sequence, the particle component contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#cos-nonambig

Rephrasing: a content model M must be formed such that, during validation of an element E’s childnode sequence E1...Ek, we can, starting from i = 1 and increasing, associate each Ei with a single particle contained (possibly implicitly) in M without examining the content or attributes of Ei, and without any information about any Ej with j >i.

Page 57: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Remember: • In XML Schema, content model is constrained as well

– to make validation easier & for compatibility with SGML– e.g., through Unique Particle Attribute Constraint:

47

A content model must be formed such that during validation of an element information item sequence, the particle component contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#cos-nonambig

Rephrasing: a content model M must be formed such that, during validation of an element E’s childnode sequence E1...Ek, we can, starting from i = 1 and increasing, associate each Ei with a single particle contained (possibly implicitly) in M without examining the content or attributes of Ei, and without any information about any Ej with j >i.

Page 58: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Remember: • In XML Schema, content model is constrained as well

– to make validation easier & for compatibility with SGML– e.g., through Unique Particle Attribute Constraint:

47

A content model must be formed such that during validation of an element information item sequence, the particle component contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#cos-nonambig

Rephrasing: a content model M must be formed such that, during validation of an element E’s childnode sequence E1...Ek, we can, starting from i = 1 and increasing, associate each Ei with a single particle contained (possibly implicitly) in M without examining the content or attributes of Ei, and without any information about any Ej with j >i.

Page 59: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating WXS into Tree Grammars ★ Tree grammars corresponding to WXS are single-type.

– This is ensured by the Unique Particle Attribution constraint in WXS.

• We know: validation against a schema that corresponds to a single-type grammar results in a unique PSVI– (PSVI = DOM tree adorned with default values & types) – hence schema-aware queries know/agree on what to return! – hence WXS can be used for schema-aware querying!

Page 60: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

49

Using more than 1 schema:

PSVI(tree adorned with default values & types)

Your application

Schema-aware parser for rich schema language, e.g. RelaxNG

Queryor other input

XML doc.

rich Schema 1 Schema-

aware Query processor

QueryAnswer

single-typeSchema 2

Schema-aware parser for s-t schema language, e.g. XSD

doesn’t validateErrorHandler

validates

….

….

Page 61: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Content models & types in DTD & WXS

• (we already know that) in WXS, we have a type hierarchy– an element of a type X derived by restriction or extension

from Y can be used in place of an element of type Y • but you have to say so explicitly:

– we call this ‘named’ typing: • sub-types are declared (restriction

or extension), and not inferred (by comparing structure)

– in DTDs, we don’t have types!

• In order to prevent difficulties in WXS as caused by types, Element Declarations Consistent constraint is imposed:

<xs:complexType> <xs:sequence> <xs:element name="person" type= "NewPersonType" minOccurs="0" maxOccurs="1"/> <xs:element name="person" type= "OldPersonType" minOccurs="0" maxOccurs="1"/> </xs:sequence> </xs:complexType> 50

<person phone="2"> <Name>Peter</Name> <DoB>1966-05-04</DoB></person><person xsi:type="LongPersonType" phone="5432"> <Name>Paul</Name> <DoB>1967-05-04</DoB> <address>Manchester</address></person>

Page 62: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Summary

• So far, we have seen how to translate schema languages in tree grammars: we saw that– each DTD can be faithfully translated into a local tree grammar,

and therefor in a single-type one• hence each DTD corresponds to a single-type grammar• hence there is exactly 1 PSVI for each document that validates against

a DTD – each XML schema can be faithfully translated into a single-type

tree grammar, • hence there is exactly 1 PSVI for each document that validates against

an XML schema• ...we also saw that parts of the UPA constraint helps to generate PSVI:

do we need other parts?

51

LocSTReg

Page 63: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Relax NG, a very powerful schema language

52

Page 64: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

53

Relax NG: yet another schema language

• Relax NG was designed to be a simpler schema language• (described in a readable on-line book by Eric Van der Vlist)• and allows us to describe XML documents in terms of their

tree abstractions:– no default attributes– no entity declarations– no key/uniqueness constraints– minimal datatypes: only “token” and “string” like DTDs

(but a mechanism to use XSD datatypes)

• since it is so simple/flexible– it’s (claimed to be) easy to use– it doesn’t have complex constraints on description of element content like

determinism/1-unambiguity– it’s claimed to be reliable– but you need other tools to do other things (like datatypes and attributes)

Page 65: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

54

Relax NG: another side of Determinism

• remember that DTDs and WXS required their content models to be – [DTD] deterministic (and thus look-ahead-free)– [WXS] deterministic (EDC, every matching child node sequence

matches in exactly one way only)– [WXS] UPA constraint expresses both and other constraints even more

• determinism & single-typeness have a reason:– some tools annotate a (valid) document while parsing:

• type information -- to be exploited, e.g., for concise queries (remember assignment?)

• default attribute values – if your schema is not single-type, then

• tools validating the same document against the same schema may construct different PSVIs

• this can happen with different tools or different runs of the same tool

Page 66: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

55

Relax NG: another side of ValidationReasons why one would want to validate an XML document:• ensure that structure is ok• ensure that values in elements/attributes are of the correct type• generate PSVI to work with• check constraints on co-occurrence of elements/how they are related • check other integrity constraints, eg. a person age vs. their mother’s

age• check constraints on elements/their value against external data

– postcode correctness– VAT/tax/other numeric constraints– spell checking

...only few of these checks can be carried out by validating against schemas...

Relax NG was designed to 1. validate structure and 2. link to datatype validators to type check values of elements/attributes

Page 67: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

56

Relax NG: basic principles • both DTDs and XSD allow the user to describe documents

– by descriptions of its elements and attributes, e.g., an element “person” must have two element child nodes, name and address, and ....

• Relax NG is based on patterns (similar to XPath expressions): – a pattern is a description of a set of valid node sets– we can view our example

as different combinationsof different parts, and design patterns for each

– enhanced flexibility

<?xml version="1.0" encoding="UTF-8"?><people> <person age="41"> <name> <first>Harry</first> <last>Potter</last> </name> <address>4 Main Road </address> <project type="epsrc" id="1"> DeCompO </project> <project type="eu" id="3"> TONES </project> </person> <person>.... </people>

Page 68: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

57

Relax NG: good to knowRelax NG comes in 2 syntaxes• the compact syntax

– succinct– human readable

• the XML syntax– verbose– machine readable

Trang converts betweenthe two, pfew!(and also into/from other schema languages)

Trang can be used from Oxygen

grammar { start = element name { element first { text }, element last { text } }}

<grammar xmlns="http:...” xmlns:a="http:.." datatypeLibrary="http:...> <start> <element name="name"> <element name="first"><text/></element> <element name="first"><text/></element> </element> </start></grammar>

Page 69: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

58

Relax NG - structure validation:• 3 kinds of patterns, for the 3 “central” nodes:

– text <text/>– attribute <attribute name=”age"/>

<attribute name=”type"/>– element <element name="name">

<element name="first"> <text/></element> <element name="last"> <text/></element> </element>

• these can be combined– ordered groups– unordered groups– choices

• we can constrain cardinalities of patterns • text nodes

– can be marked as “data” and linked• we can specify libraries of patterns

<?xml version="1.0" encoding="UTF-8"?><people> <person age="41"> <name> <first>Harry</first> <last>Potter</last> </name> <address>4 Main Road </address> <project type="epsrc" id="1"> DeCompO </project> <project type="eu" id="3"> TONES </project> </person> <person>.... </people>

element name { element first { text }, element last { text }}

Page 70: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

59

Relax NG - structure validation: ordered groups• we can name patterns• in strange “chains”• we can use ?, *, and +:

<?xml version="1.0" encoding="UTF-8"?><people> <person age="41"> <name> <first>Harry</first> <last>Potter</last> </name> <address>4 Main Road </address> <project type="epsrc" id="1"> DeCompO </project> <project type="eu" id="3"> TONES </project> </person> <person>.... </people>

grammar { start = people-element

people-element = element people { person-element+ }

person-element = element person { attribute age { text }, name-element, address-element+, project-element*}

name-element = element name { element first { text }, element middle { text }?, element last { text } }

address-element = element address { text }

project-element = element project { attribute type { text }, attribute id {text}, text }}

use “?” if optional

Page 71: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Relax NG - structure validation: ordered groups in XML syntax (Trang knows…)

<?xml version="1.0" encoding="UTF-8"?><grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="people"><ref name="people-content"/> </element></start> <define name="people-content"> <oneOrMore> <element name="person"><ref name="person-content"/> </element></oneOrMore></define>

<define name="person-content"> <attribute name="age"/> <element name="name"><ref name="name-content"/> </element> <oneOrMore> <element name="address"><text/></element> </oneOrMore> <zeroOrMore> <element name="project"><ref name="project-content"/> </element></zeroOrMore></define>

<define name="name-content"> <element name="first"><text/></element> <optional><element name="middle"><text/></element> </optional> <element name="last"><text/></element> </define> <define name="project-content"> <attribute name="type"/><attribute name="id"/><text/> </define></grammar>

grammar { start = people-element

people-element = element people { person-element+ }

person-element = element person { attribute age { text }, name-element, address-element+, project-element*}

name-element = element name { element first { text }, element middle { text }?, element last { text } }

address-element = element address { text }

project-element = element project { attribute type { text }, attribute id {text}, text }}

60

Page 72: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

61

Relax NG - structure validation: different styles

grammar { start = element people {people-content}

people-content = element person { person-content }+

person-content = attribute age { text }, element name {name-content}, element address { text }+, element project {project-content}*

name-content = element first { text }, element middle { text }?, element last { text }

project-content = attribute type { text }, attribute id {text}, text }

grammar { start = people-element

people-element = element people { person-element+ }

person-element = element person { attribute age { text }, name-element, address-element+, project-element*}

name-element = element name { element first { text }, element middle { text }?, element last { text } }

address-element = element address { text }

project-element = element project { attribute type { text }, attribute id {text}, text }}

• so far, we modelled ‘element centric’...we can model ‘content centric’:

Page 73: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

62

Relax NG - structure validation: ordered groups

• we can combine patterns in fancy ways:

grammar {start = element people {people-content}people-content = element person { person-content }+

person-content = HR-stuff, contact-stuff

HR-stuff = attribute age { text }, project-content

contact-stuff = attribute phone { text }, element name {name-content}, element address { text } name-content = element first { text }, element middle { text }?, element last { text } project-content = element project { attribute type { text }, attribute id {text}, text }+}

<?xml version="1.0" encoding="UTF-8"?><people> <person age="41"> <name> <first>Harry</first> <last>Potter</last> </name> <address>4 Main Road </address> <project type="epsrc" id="1"> DeCompO </project> <project type="eu" id="3"> TONES </project> </person> <person>.... </people>

Page 74: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

63

Relax NG: structure validation summary • Relax NG’s specification of structure differs from DTDs and XSD:

– grammar oriented– 2 syntaxes with automatic translation– flexible: we can gather different aspects of elements into different patterns– unconstrained: no constraints regarding

unambiguity/1-ambiguity/deterministic content model/Unique Particle Constraints/Element Declarations Consistent

– like for XSD, we have an “ALL” construct for unordered groups, “interleave” &:

element person { attribute age { text}, attribute phone { text}, name-element , address-element+ , project-element*}

here, the patterns must appear in the specified order, (except for attributes, which are allowed to appear in any order in the start tag):

here, the patterns can appear any order:

element person { attribute age { text } & attribute phone { text} & name-element & address-element+ & project-element*}

Page 75: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating Relax NG into tree grammarsby example 1

• ...let’s see one more64

grammar {start = AddressBookAddressBook = element addressBook { Card* }Card = element card { Inline }Inline = Name, Email+Name = element name { text }Email = element email { text } }

Translate into G=(N, Σ, S, P) with N = {AddressBook, Card, Inline, Name, Email, Pcdata}Σ = {addressBook, card, name, email, pcdata}S = {AddressBook}P = {AddressBook → addressBook Card*, Card → card Inline, Inline → Name, Email+, Name → name Pcdata, Email → email Pcdata, Pcdata → pcdata ϵ }

“element y” ➟ y ∈ Σ...possibly also “uppercased copy” ➟ Y ∈ Nall other user defined symbols X ➟ X ∈ N...translate Relax NG rules easy(depending on Relax NG style)

Page 76: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating Relax NG into tree grammarsby example 2

65

grammar { start = p-el

p-el = element people { per-el+ }

per-el = element person { attribute age { text }, na-el, ad-el+, pro-el*}

na-el = element name { element first { text }, element middle { text }?, element last { text } }

ad-el = element address { text }

pro-el = element project { attribute type { text }, attribute id {text}, text }}

Translate into G = (N, Σ, S, P) with N = {P-EL, PER-EL, NA-EL, AD-EL, PRO-EL, FIRST, MIDDLE, LAST, Pcdata}Σ = {people, person, name, first, middle, last, address, project}S = {P-EL}P = {P-EL → people PER-EL, PER-EL*, PER-EL → person NA-EL,AD-EL, AD-EL*,PRO-EL* NA-EL → name FIRST, (MIDDLE|ε), LAST, FIRST → first Pcdata, MIDDLE → middle Pcdata, LAST → last Pcdata, AD-EL → address Pcdata, PRO-EL → project Pcdata, Pcdata → pcdata ϵ }

Ignore!

Ignore! This Relax NG style makes translation of rules easy

Page 77: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating Relax NG into tree grammarsby example 3

66

grammar { start = element people {people-content}

people-content = element person { person-content }+

person-content = attribute age { text }, element name {name-content}, element address { text }+, element project {project-content}*

name-content = element first { text }, element middle { text }?, element last { text }

project-content = attribute type { text }, attribute id {text}, text }

Translate into G=(N, Σ, S, P) with N = {PEOPLE, P-C, PER-C, NA, NA-C, PERSON, PRO-C,ADR, PROJ, PRO-C, FIRST, MIDDLE,LAST, Pcdata}Σ = {people, person, name, first, middle, last, address, project}S = {PEOPLE}P = {PEOPLE → people P-C, P-C → PERSON, PERSON*, PERSON → person PER-C, PER-C → NA, ADR, ADR*,PROJ, NA → name NA-C, ADR → address Pcdata, PROJ → project PRO-C, PRO-C → pcdata ϵ, NA-C → FIRST,(MIDDLE|ϵ),LAST FIRST → first Pcdata, MIDDLE → middle Pcdata, LAST → last Pcdata, Pcdata → pcdata ϵ }

expand!

expand!

This Relax NG style makes translation of rules less easy… and leads to generalized rules!

Page 78: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating Relax NG into tree grammarsby example 3

Two things we have already seen when translating WXS:• “generalized” rules -- which can & need to be expanded, as for WXS:

• we might have to “contextualise” names and types of elements: ... 67

...people-content = element person { person-content }+.....person-content = attribute age { text }, element name {name-content}, element address { text }+, element project {project-content}*

... PERSON → person PER-C, PER-C → NA, ADR, ADR*,PROJ, NA → name NA-C, ADR → address Pcdata, ...

expand!

for each illegal rule X → e:– remove X → e from rule set – replace all occurrences of X in rule set with e

Page 79: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating Relax NG into tree grammarsby example 4

68

...people-content = element person { person-content }+, element friend {friend-content }+ .....person-content = attribute age { text }, element name {name-content}, ...friend-content = attribute age { text }, element name {friend-name-content},...

... P-C → PERSON, PERSON*,FRIEND,FRIEND* PERSON → person PER-C, FRIEND → friend FRIE-C, PER-C → NA^NA-C, ... FRIE-C → NA^FRIE-NA-C, ... NA^NA-C → name NA-C, NA^FRIE-NA-C → name FRIE-NA-C, ...

2. we might have to “contextualise” names and types of elements, to handle schemas where the same element name is used in different contexts with different types:

Page 80: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Translating Relax NG into tree grammars• each Relax NG schema can be faithfully translated into a tree grammar:

– local? no: example on previous slide leads to competing non-terminals (NA^PER-C and NA^FRIE-C)

– single-type? no: see example belowNA^NA-C and NA^FO-NA-C compete and occur in the same RHS

– so is Relax NG as powerful as tree grammars?

69

... NA^PER-C → name NA-C, NA^FRIE-C → name NA-C,...

...person-content = attribute age { text }, element name {name-content} | element name {foreign-name-content}, ...

... PER-C → NA^NA-C | NA^FO-NA-C NA^NA-C → name NA-C, NA^FO-NA-C → name FO-NA-C,...

Page 81: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Relax NG schema is indeed as powerful as tree grammars★ Every tree grammar can be faithfully translated into a Relax NG schema.

• Proof (not too hard): given a tree grammar G = (N, Σ, S, P), 1. translate each production rule N → t regexp in P into

(fortunately, the tree grammar regular expression syntax is very close to and more strict than Relax NG regular expression syntax)

2. Put the resulting statements intoa grammar, where N1 , ... , Nk areall start symbols, i.e., S = {N1 , ... , Nk}

3. Call the resulting schema GS

★ Then T ∈ L(G) if and only if T validates against GS.

70

N = element t { regexp }

grammar {start = N1 | ... | Nk ..... }

Page 82: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Tree Grammars and Schema Languages• Harvest Time!

• but then, isn’t validation of an XML document against a Relax NG schema really complicated and complex (i.e., space and/or time consuming)?

• perhaps it’s even undecidable or intractable?

71

LocSTReg DTDWXSRelax NG with our knowledge

Page 83: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

How costly is validity testing?…

Does it matter against which kind of schema?

…Is Single-Type cheaper than

general?

72

Page 84: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

73

How costly is schema validation?

PSVI(tree adorned with default values & types)

Your application

Schema-aware parser for rich schema language, e.g. RelaxNG

Queryor other input

XML doc.

rich Schema 1 Schema-

aware Query processor

QueryAnswer

single-typeSchema 2

Schema-aware parser for s-t schema language, e.g. XSD

doesn’t validateErrorHandler

validates

….

….

Page 85: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Schema Languages and Tree Grammars• We have learned about a third, flexible, liberal schema

language, Relax NG– how to translate Relax NG schemas into tree grammars➡ more liberal than single-type/XSD

• Now, we will look at: – the problem of – algorithms for

74

validating a document against a schema!

algorithmTree TGrammar G

“yes”, if T ∈ L(G)

“no”, otherwise

See the paper by Murata, Lee, Mani, Kawaguchi

Page 86: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• To design our “schema validator”,1. we start with the easy case: assume that G is local

(this gives us automatically a validator for structural aspect of DTDs)2. then expand algorithm to single-type

(this gives us automatically a validator for structural aspect of WXS)3. then expand to general tree grammars (...Relax NG)

– we also assume that we have a subroutine

– ...if time permits, we will see later how to build that one (it’s based on a translation of regular expressions into finite state machines (aka automata), otherwise

• remember your undergraduate studies (?)• read it up, e.g., in the textbook by Hopcroft, Ullman

75

ValAlgoTree TGrammar G

“yes”, if T ∈ L(G)

“no”, otherwise

MatchAlgoString wregular expression e

“yes”, if w ∈ L(e), (w matches e)

“no”, otherwise

Page 87: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Input: DOM Tree for T, local tree grammar G = (N, Σ, S, P),NT is a stack of strings of non-terminalsR is a stack of production rulesTraverse T in a depth-first, left-2-to-right mannerWhen an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

report “accepted” and stop

76

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

See the paper by Murata, Lee, Mani, Kawaguchi

locality

store rule for E’s content in Rstart remembering E’s child nodes

retrieve rule for E’s content in Rretrieve E’s child nodes

add E’s terminal node to its predecessor siblings

to store NTs of child nodes

Page 88: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Stacks and tree traversal, observations• our algorithm visits a tree in a depth-first, left-2-to-right manner• whenever we visit a node

on our way – down, we

push relevant informationfor this node on stacks

– up, we pop relevant informationfor this node from stacks

• hence, whenever we are at a node n during this traversal, allrelevant information regarding all ancestors of n are (in reverseorder), on our stacks

77

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

Page 89: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

78

a

c c

b

c

b

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

R NT

Stack of rules

Stack of NT strings

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 90: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

7915

a

c c

b

c

b

R NTS → a B,B* ϵ

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 91: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

80

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

ϵϵ

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 92: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

81

a

c c

b

c

b

R NT

C → c ϵ|C B → b (C,C)|C S → a B,B*

ϵϵϵ

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 93: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

82

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

ϵϵ

C → c ϵ|C ϵyes, ϵ ∈ L(ϵ|C)

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 94: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

83

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

C → c ϵ|C

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 95: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

84

a

c c

b

c

b

R NT

C → c ϵ|C B → b (C,C)|C S → a B,B*

ϵ

ϵ

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 96: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

85

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B* ϵ

C → c ϵ|C ϵyes, ϵ ∈ L(ϵ|C)

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 97: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

86

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

CC ϵ

C → c ϵ|C

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 98: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

87

a

c c

b

c

b

R NTS → a B,B* ϵ

B → b (C,C)|C CCyes, CC ∈ L((C,C)|C)

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 99: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

88

a

c c

b

c

b

R NTS → a B,B* B

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 100: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

89

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

ϵB

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 101: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

90

a

c c

b

c

b

R NT

C → c ϵ|C B → b (C,C)|C S → a B,B*

ϵϵB

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 102: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

91

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

ϵB

C → c ϵ|C ϵyes, ϵ ∈ L(ϵ|C)

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 103: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

92

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

CB

C → c ϵ|C

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 104: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

93

a

c c

b

c

b

R NTS → a B,B* B

B → b (C,C)|C C

yes, C ∈ L((C,C)|C)

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 105: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

94

a

c c

b

c

b

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

R NTS → a B,B* BB

B → b (C,C)|C

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 106: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

95

a

c c

b

c

b

R NT

BBS → a B,B*

yes, BB ∈ L(B,B*)

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 107: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

96

a

c c

b

c

b

R NT

“accepted” (“yes”), T ∈ L(G)

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

report “accepted” and stop ☜ Check slide 74

ValAlgoXML doc/Tree Tlocal Grammar G

“yes”, if T ∈ L(G)

“no”, otherwise

Page 108: COMP60411 Semi-structured Data and the Web Datatypes Relax ...studentnet.cs.manchester.ac.uk/pgt/2013/COMP60411/slides/week4.pdf · 1 COMP60411 Semi-structured Data and the Web Datatypes

Validating trees against tree grammars• want to implement this algorithm?

– walk the DOM tree in a depth-first, left-2-right way, or

– use a SAX parser and do it in a streaming fashion• no need to keep whole tree in memory• validate-while-u-parse!

• ...and we can use this algorithm for general DTDs!• ...next week, we’ll see how this works for

– single-type tree grammars (and WXS)• rather straightforward because we still only have at most one run of our tree

grammar on the input tree

– general tree grammars (and Relax NG)…

– ...all validate-while-u-parse!

97