sdpl 20062: document instances and grammars1 2 document instances and grammars fundamentals of...

58
SDPL 2006 2: Document Instances and G rammars 1 2 Document Instances and 2 Document Instances and Grammars Grammars Fundamentals of hierarchical Fundamentals of hierarchical document structures, or document structures, or Computer Scientist’s view of XML Computer Scientist’s view of XML 2.1 XML and XML documents 2.1 XML and XML documents 2.2 Basics of document grammars 2.2 Basics of document grammars 2.3 Basics of XML DTDs 2.3 Basics of XML DTDs 2.4 XML Namespaces 2.4 XML Namespaces 2.5 XML Schemas 2.5 XML Schemas

Upload: alvin-brown

Post on 06-Jan-2018

230 views

Category:

Documents


0 download

DESCRIPTION

SDPL 20062: Document Instances and Grammars3 What is XML? n Extensible Markup Language is not a markup language! –does not fix a tag set nor its semantics (unlike markup languages, e.g. HTML) n XML documents have no inherent (processing or presentation) semantics –even though many think that XML is semantic or self- describing; See next

TRANSCRIPT

Page 1: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 1

2 Document Instances and Grammars2 Document Instances and Grammars

Fundamentals of hierarchical document Fundamentals of hierarchical document structures, or structures, or Computer Scientist’s view of XMLComputer Scientist’s view of XML

2.1 XML and XML documents2.1 XML and XML documents2.2 Basics of document grammars 2.2 Basics of document grammars 2.3 Basics of XML DTDs2.3 Basics of XML DTDs2.4 XML Namespaces2.4 XML Namespaces2.5 XML Schemas2.5 XML Schemas

Page 2: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 2

2.1 XML and XML documents2.1 XML and XML documents

XML - Extensible Markup Language,XML - Extensible Markup Language,W3C Recommendation, February 1998W3C Recommendation, February 1998– not an official standard, but a stable industry standardnot an official standard, but a stable industry standard– 22ndnd Ed 6/10/2000, 3 Ed 6/10/2000, 3rdrd Ed 4/2/2004 Ed 4/2/2004

» editorial revisions, editorial revisions, notnot new versions of XML 1.0 new versions of XML 1.0

a simplified subset of SGML, Standard a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987Generalized Markup Language, ISO 8879:1987– what is said later about what is said later about validvalid XML documents applies XML documents applies

to SGML documents, tooto SGML documents, too

Page 3: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 3

What is XML?What is XML?

ExtensibleExtensible Markup Language Markup Language is is notnot a markup a markup language! language! – does not fix a tag set nor its semantics does not fix a tag set nor its semantics

(unlike markup languages, e.g. HTML)(unlike markup languages, e.g. HTML) XML documents have XML documents have no inherent no inherent (processing or (processing or

presentation) presentation) semanticssemantics– even though many think that XML is semantic or self-even though many think that XML is semantic or self-

describing; See nextdescribing; See next

Page 4: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 4

Semantics of XML MarkupSemantics of XML Markup

Meaning of this XML fragment?Meaning of this XML fragment?

– The computer cannot know it either!The computer cannot know it either!– Implementing the semantics is the topic of this courseImplementing the semantics is the topic of this course

Page 5: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 5

What is XML (2)?What is XML (2)?

XML XML isis– a way to use markup to represent informationa way to use markup to represent information– a a metalanguagemetalanguage

» supports definition of specific markup languages through XML supports definition of specific markup languages through XML DTDs or SchemasDTDs or Schemas

» E.g. XHTML a reformulation of HTML using XMLE.g. XHTML a reformulation of HTML using XML Often “XML” Often “XML” XML + XML technology XML + XML technology

– that is, processing models and languages we’re that is, processing models and languages we’re studying (and many others ...)studying (and many others ...)

Page 6: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 6

How does it look?How does it look?

<?xml version=’1.0’ encoding=”iso-8859-1” ?><?xml version=’1.0’ encoding=”iso-8859-1” ?><invoice num=”1234”><invoice num=”1234”><client clNum=”00-01”><client clNum=”00-01”> <name>Pekka Kilpeläinen</name> <name>Pekka Kilpeläinen</name> <email>[email protected]</email><email>[email protected]</email></client></client><item price=”60” unit=”EUR”><item price=”60” unit=”EUR”>XML Handbook</item> XML Handbook</item> <item price=”350” unit=”FIM”><item price=”350” unit=”FIM”>XSLT Programmer’s Ref</item>XSLT Programmer’s Ref</item>

</invoice></invoice>

Page 7: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 7

Essential Features of XMLEssential Features of XML

Overview of XML essentialsOverview of XML essentials– many details skippedmany details skipped

» some to be discussed in exercises or with other some to be discussed in exercises or with other topics when the need arisestopics when the need arises

– Learn to consult original sources (specifications, Learn to consult original sources (specifications, documentation etc) for details!documentation etc) for details!» The XML specification is easy to browseThe XML specification is easy to browse

Page 8: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 8

XML Document CharactersXML Document Characters

XML documents are made of ISO-10646 (32-bit) XML documents are made of ISO-10646 (32-bit) characterscharacters; in practice of their 16-bit Unicode ; in practice of their 16-bit Unicode subset (used, e.g., in Java)subset (used, e.g., in Java)

Three aspects or charactersThree aspects or characters::– identification as numeric identification as numeric code pointscode points– representationrepresentation by bytes by bytes– visual presentationvisual presentation

Page 9: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 9

External Aspects of CharactersExternal Aspects of Characters

Documents are stored/transmitted as a sequence Documents are stored/transmitted as a sequence of bytes (of 8 bits). An of bytes (of 8 bits). An encodingencoding determines how determines how characters are characters are representedrepresented by bytes. by bytes.– UTF-8 (UTF-8 ( 7-bit ASCII) is the XML default encoding 7-bit ASCII) is the XML default encoding– encoding="iso-8859-1"encoding="iso-8859-1"

~>~> 256 Western European chars as single bytes 256 Western European chars as single bytes A A fontfont (collection of character images called (collection of character images called

glyphsglyphs) determines the ) determines the visual presentation visual presentation of of characterscharacters

Page 10: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 10

XML Encoding of Structure 1XML Encoding of Structure 1

XML is, essentially, just a textual encoding XML is, essentially, just a textual encoding scheme of scheme of labelledlabelled, , orderedordered and and attributedattributed treestrees, in which, in which– internal nodes are internal nodes are elementselements labelled by type names labelled by type names– leaves are leaves are text nodestext nodes labelled by string values, or labelled by string values, or

empty element nodesempty element nodes– the left-to-right order of children of a node mattersthe left-to-right order of children of a node matters– element nodes may carry element nodes may carry attributesattributes

(= name-string-value pairs)(= name-string-value pairs)

Page 11: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 11

XML Encoding of Structure 2XML Encoding of Structure 2

XML encoding of a treeXML encoding of a tree– corresponds to a pre-order walkcorresponds to a pre-order walk– start of an element node with type name A start of an element node with type name A

denoted by a denoted by a start tagstart tag <A>, and its end denoted <A>, and its end denoted by by end tagend tag </A> </A>

– possible attributes written within the start tag:possible attributes written within the start tag:<A attr<A attr11=“value=“value11” … attr” … attrnn=“value=“valuenn”>”>

» names must be unique: attrnames must be unique: attrkk attrattrh h when k when k h h

– text nodes written as their string valuetext nodes written as their string value

Page 12: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 12

XML Encoding of Structure: XML Encoding of Structure: ExampleExample

<S><S>

SS

EE

<W><W> <W><W></W></W> <E A=‘1’/><E A=‘1’/>HelloHello world!world!

WW

HelloHello

WW

world!world!

</W></W> </S></S>

A=1A=1

Page 13: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 13

XML: Logical Document StructureXML: Logical Document Structure

ElementsElements – indicated by matching (case-sensitive!) tagsindicated by matching (case-sensitive!) tags<ElementTypeName> <ElementTypeName> … … </ElementTypeName></ElementTypeName>

– can contain text and/or subelementscan contain text and/or subelements– can be can be emptyempty::

<elem-type></elem-type><elem-type></elem-type> or or <elem-type/><elem-type/> (e.g. (e.g. <br/><br/> in in

XHTML)XHTML)– unique root element unique root element document a single tree document a single tree

Page 14: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 14

Logical document structure (2)Logical document structure (2)

AttributesAttributes– name-value pairs attached to elementsname-value pairs attached to elements– ‘‘‘‘metadata’’, usually not treated as contentmetadata’’, usually not treated as content– in start-tag after the element type namein start-tag after the element type name

<div class="preface" date='990126'><div class="preface" date='990126'> … … Also:Also:

– <!--<!-- commentscomments outsideoutside other markup other markup -->-->– <?note this would be passed to the <?note this would be passed to the application as a application as a processing instructionprocessing instruction named ‘note’?>named ‘note’?>

Page 15: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 15

CDATA SectionsCDATA Sections

““CDATA Sections” to include XML markup characters as CDATA Sections” to include XML markup characters as textual contenttextual content

<![CDATA[<![CDATA[ Here we could include for example code Here we could include for example code fragments: fragments: <example>if (Count < 5 && Count > 0) <example>if (Count < 5 && Count > 0) </example></example>

]]>]]>

Page 16: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 16

Two levels of correctness (1)Two levels of correctness (1)

Well-formedWell-formed documents documents – roughly: follows the syntax of XML,roughly: follows the syntax of XML,

markup correct (elements properly nested, tag markup correct (elements properly nested, tag names match, attributes of an element have names match, attributes of an element have unique names, ...)unique names, ...)

– violation is a fatal errorviolation is a fatal error ValidValid documentsdocuments

– (in addition to being well-formed) (in addition to being well-formed) obey an associated grammar (DTD/Schema)obey an associated grammar (DTD/Schema)

Page 17: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 17

XML docs and valid XML docsXML docs and valid XML docs

XML documents = well-formed XML documentsXML documents = well-formed XML documents

DTD-valid documentsDTD-valid documents Schema-valid documentsSchema-valid documents

Page 18: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 18

An XML Processor (Parser)An XML Processor (Parser)

Reads XML documents and reports their contents Reads XML documents and reports their contents to an application to an application – relieves the application from details of markup relieves the application from details of markup – XML Recommendation specifies, partly, the behaviour of XML Recommendation specifies, partly, the behaviour of

XML processors: XML processors: – recognition of characters as markup or data; what recognition of characters as markup or data; what

information to pass to applications; information to pass to applications; how to check the correctness of documents; how to check the correctness of documents;

– validation based on comparing document against its validation based on comparing document against its grammar grammar Next: Basics of document grammarsNext: Basics of document grammars

Page 19: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 19

2.2 Basics of document grammars2.2 Basics of document grammars

Regular expressionsRegular expressions describe regular languages: describe regular languages:– relatively simple relatively simple sets of stringssets of strings over a finite over a finite alphabet alphabet ofof

» characters, events, characters, events, document elementsdocument elements, …, … Used as: Used as:

– text searching patterns (e.g., emacs, ed, awk, grep, Perl)text searching patterns (e.g., emacs, ed, awk, grep, Perl)– part of grammatical formalisms (for programming part of grammatical formalisms (for programming

languages, XML, in XML DTDs and schemas, …)languages, XML, in XML DTDs and schemas, …) Let us describe accepted contents of document Let us describe accepted contents of document

componentscomponents

Page 20: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 20

Regular Expressions: SyntaxRegular Expressions: Syntax

A A regular expressionregular expression ( (säännöllinen lausekesäännöllinen lauseke) ) over an alphabetover an alphabet is eitheris either an empty set, an empty set, – lambdalambda (sometimes (sometimes epsilonepsilon ), ), – aa any alphabet symbolany alphabet symbol aa – ((R | SR | S)) choicechoice, , – ((R SR S)) concatenationconcatenation, or, or– ((RR))** Kleene closure or Kleene closure or iterationiteration,,

where where RR and and SS are regular expressions are regular expressions N.B: different syntaxes exist but the idea is same N.B: different syntaxes exist but the idea is same

Page 21: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 21

Regular Expressions: GroupingRegular Expressions: Grouping

Conventions to simplify operator expressions by Conventions to simplify operator expressions by eliminating extra parentheses:eliminating extra parentheses:– outermost parentheses may be eliminated: outermost parentheses may be eliminated:

» ((EE) = ) = EE– binary operations are associative:binary operations are associative:

» ((AA | ( | (BB | | CC)) = (()) = ((AA | | BB) | ) | CC) = () = (AA | | BB | | CC))» ((AA ( (B CB C)) = (()) = ((A BA B) ) CC) = () = (A B CA B C))

– operations have priorities: operations have priorities: » iteration first, concatenation next, choice lastiteration first, concatenation next, choice last» for example,for example,

((AA | ( | (BB ( (C)C)*)) = *)) = AA | | B CB C**

Page 22: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 22

Regular Expressions: SemanticsRegular Expressions: Semantics

Regular expressionRegular expression EE denotes a denotes a languagelanguage (set of strings) (set of strings) LL((EE), ), defined inductively as followsdefined inductively as follows::

– LL(() = {}) = {} (empty set)(empty set)– LL(( (singleton set of empty string(singleton set of empty string = ‘‘) = ‘‘)– LL((aa) = {) = {aa}, }, ( (aa , singleton set of , singleton set of word "word "a"a"– LL((((R | SR | S)) = )) = LL((RR) ) LL((SS) = {) = {w | ww | w LL((RR) ) or or ww LL((SS)} )} – LL((((R SR S)) = )) = LL((RR) ) LL((SS) = {) = {xy | xxy | x LL((RR) ) and and yy LL((SS)})}– LL((RR**) = ) = LL((RR))** = { = {xx11……xxnn | | xxkk LL((RR), ), kk = 1, …, = 1, …, nn;; n n 0}0}

Page 23: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 23

Regular Expressions: ExamplesRegular Expressions: Examples

L(L(AA | | B B ((C DC D)*) = ?)*) = ?

= L(= L(AA)) L( L( B B ((C DC D)*))*)

= {= {AA}} L(L(BB)) L((L((C DC D)*))*)

= {= {AA}} {{BB}} L(L(C DC D)*)*

= {= {AA}} {{BB}} {{CD, CDCD, CDCDCD, …}CD, CDCD, CDCDCD, …}

= {= {AA, , BB, BCD, BCDCD, BCDCDCD, …}, BCD, BCDCD, BCDCDCD, …}

= {= {AA}} {{BB}} {{CDCD}*}*

Page 24: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 24

Regular Expressions: ExamplesRegular Expressions: Examples

Simplified top-level structure of a document: Simplified top-level structure of a document: – = {= {title, auth, date, secttitle, auth, date, sect}}– titletitle followed by an optional list of followed by an optional list of authorsauthors, ,

fby an optional fby an optional datedate, fby one or more , fby one or more sectionssections::title auth* title auth* ((date |date |)) sect sect* sect sect*

Commonly used abbreviations: Commonly used abbreviations: – EE? = (? = (EE | |);); EE++ = = EE E*E* the above more compactly:the above more compactly:

title auth* datetitle auth* date?? sect+ sect+

Page 25: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 25

Context-Free Grammars (CFGs)Context-Free Grammars (CFGs)

Used widely to syntax specification (programming Used widely to syntax specification (programming languages, XML, …) and to parser/compiler languages, XML, …) and to parser/compiler generation (e.g. YACC/GNU Bison)generation (e.g. YACC/GNU Bison)

CFG CFG GG formally a 4-tuple ( formally a 4-tuple (nelikkonelikko) (V, ) (V, P, S) P, S)– V is the V is the alphabetalphabet of the grammar of the grammar GG– VViis a set of s a set of terminal symbols terminal symbols ((päätesymbolitpäätesymbolit));;

» N = VN = V is a set of is a set of nonterminal symbols nonterminal symbols ((välikesymbolitvälikesymbolit))– P set of P set of productionsproductions– S S V the V the start symbol start symbol ((lähtösymbolilähtösymboli))

Page 26: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 26

Productions and DerivationsProductions and Derivations

Productions: Productions: A A where where A A N, N, V* V*– E.g:E.g: A A aBa aBa ((production 1production 1))

LetLet V*. V*. StringString derives derives directlydirectly, , ifif– for somefor someV*V*

and and PP– E.g.E.g. A A => A => AaBaaBa (assuming production 1 above)(assuming production 1 above)

NBNB: CFGs are often given simply by listing the productions : CFGs are often given simply by listing the productions (P)(P)

Page 27: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 27

Language Generated by a CFGLanguage Generated by a CFG

derives derives , , if there’s a sequence of (0 if there’s a sequence of (0 or more) direct derivations that transformsor more) direct derivations that transforms to to

The language generated by a CFGThe language generated by a CFG GG::– L(L(GG) = {w ) = {w * | S =>* w }* | S =>* w }

NB:NB: L( L(GG) ) is a set of is a set of stringsstrings;; – To model document To model document structuresstructures, we consider , we consider

syntax trees syntax trees ((rakennepuurakennepuu))

Page 28: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 28

Syntax TreesSyntax Trees

A.k.a A.k.a parse treesparse trees ( (jäsennyspuujäsennyspuu) or ) or derivation trees derivation trees ((johtopuujohtopuu) )

CChild nodes are hild nodes are orderedordered left-to-right left-to-right Nodes are Nodes are labelledlabelled by symbols of by symbols of VV: :

– internal nodes by nonterminals, root by start symbolinternal nodes by nonterminals, root by start symbol– leaves by terminal symbols (or empty stringleaves by terminal symbols (or empty string ))

A node labelled A node labelled AA can have children labelled by can have children labelled by XX11, …, X, …, Xkk only if only if AA XX11, …, X, …, Xkk PP

Page 29: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 29

Syntax Trees: ExampleSyntax Trees: Example

CFG for simplified arithmetic expressions:CFG for simplified arithmetic expressions:V = {E, +, *, I}; V = {E, +, *, I}; = {+, *, I}; N = {E}; S = E = {+, *, I}; N = {E}; S = E

((II stands for an arbitrary integer) stands for an arbitrary integer)P = { E P = { E E+E, E+E,

E E E*E, E*E, E E I, I, E E (E) } (E) }

Syntax tree for 2*(3+4)?Syntax tree for 2*(3+4)?

Page 30: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 30

Syntax Trees: ExampleSyntax Trees: Example

2 * ( 3 + 4 )2 * ( 3 + 4 )

EE

EE EE

E E E*E E*E

II

E E I I

II IIE E I I

EEEEE E E+E E+EEE

E E ( E ) ( E )

Page 31: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 31

CFGs for Document StructuresCFGs for Document Structures

Nonterminals represent document elementsNonterminals represent document elements– E.g. model for items (E.g. model for items (Ref Ref ) of a bibliography list:) of a bibliography list:

Ref Ref AuthorList Title PublData AuthorList Title PublDataAuthorList AuthorList Author AuthorList Author AuthorList AuthorList AuthorList

Notice:Notice:– right-hand-side of a CFG production is a right-hand-side of a CFG production is a fixed stringfixed string of of

grammar symbolsgrammar symbols– Repetition simulated using recursion Repetition simulated using recursion

» e.g. e.g. AuthorList AuthorList aboveabove

Page 32: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 32

Example: List of Three AuthorsExample: List of Three Authors

AuthorListAuthorList

Author Author

Author Author

Author Author

AuthorListAuthorList

AuthorListAuthorList

AuthorListAuthorList

Aho Aho Hopcroft Hopcroft Ullman Ullman

RefRef

TitleTitle PublDataPublData

. . . . . . . . . . . .

Page 33: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 33

ProblemsProblems

"Auxiliary nonterminals" (like "Auxiliary nonterminals" (like AuthorListAuthorList) ) obscure the modelobscure the model– the last the last AuthorAuthor is several levels apart from its is several levels apart from its

intuitive parent elementintuitive parent element RefRef– clumsy to access and to count clumsy to access and to count AuthorsAuthors of a of a

referencereference– avoided by avoided by extendedextended context-free grammars context-free grammars

Page 34: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 34

Extended CFGs (ECFGs)Extended CFGs (ECFGs)

like CFGs, but right-hand-sides of like CFGs, but right-hand-sides of productions are productions are regular expressionsregular expressions over over VV– E.g:E.g: Ref Ref Author* Title PublData Author* Title PublData

Let Let V*. V*. StringString derives derives directlydirectly, , ifif– for somefor someV*V*

and and P P such thatsuch that L( L())– E.g.E.g. Ref => Author Author Author Title PublDataRef => Author Author Author Title PublData

L(Author* Title PublData)L(Author* Title PublData)

Page 35: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 35

Language Generated by an Language Generated by an EECFGCFG

L(G)L(G) defined similarly to CFGs: defined similarly to CFGs:– derivesderives , if , if

nn= = (for (for nn 0)0)– L(G) = {w L(G) = {w * | S =>* w }* | S =>* w }

Theorem:Theorem: Extended and ordinary CFGs allow to Extended and ordinary CFGs allow to generate the same (string) languages.generate the same (string) languages.

But syntax trees of ECFGs and CFGs differ! (Next)But syntax trees of ECFGs and CFGs differ! (Next)

Page 36: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 36

Syntax Trees of an ECFGSyntax Trees of an ECFG

Similar to parse trees of an ordinary CFG, Similar to parse trees of an ordinary CFG, except that ..except that ..– node with label A can have children labelled Xnode with label A can have children labelled X11, ,

…, X…, Xkk when A when A E E P such that P such that XX11…X…Xkk L(E) L(E)

an internal node may have arbitrarily many an internal node may have arbitrarily many children (e.g., children (e.g., AuthorsAuthors below a below a RefRef node) node)

Page 37: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 37

Example: Three Authors of a RefExample: Three Authors of a Ref

RefRef

Author Author Author Author Author Author TitleTitle

. . .. . .

PublDataPublData

Aho Aho Hopcroft Hopcroft Ullman Ullman The Design and Analysis ...The Design and Analysis ...

Ref Ref Author* Title PublData Author* Title PublData P, P,Author Author Author Title PublData Author Author Author Title PublData L( L(Author* Title PublDataAuthor* Title PublData))

Page 38: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 38

Terminal Symbols in PractiseTerminal Symbols in Practise

In (extended) CFGs: In (extended) CFGs: – Leaves of parse trees are labelled by single Leaves of parse trees are labelled by single

terminal symbols (terminal symbols ()) Too granular for practise; instead terminal symbols Too granular for practise; instead terminal symbols

which stand for all values of a typewhich stand for all values of a type– XML DTDs: XML DTDs: #PCDATA#PCDATA for variable length string content for variable length string content– In XML Schema:In XML Schema:

» string, byte, integer, boolean, date,string, byte, integer, boolean, date, ... ...

– Explicit string literals rare in document grammarsExplicit string literals rare in document grammars

Page 39: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 39

DTD/ECFG CorrespondenceDTD/ECFG Correspondence

DTDDTD----------------------------------------------------------------XML documentXML document

element typeelement typeelement type declarationelement type declaration#PCDATA#PCDATA

ECFGECFG------------------------------------------------parse/syntax treeparse/syntax treenonterminalnonterminalproductionproductionterminalterminal

Page 40: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 40

2. 3 Basics of XML DTDs2. 3 Basics of XML DTDs

A A Document Type DeclarationDocument Type Declaration provides a grammar provides a grammar ((document type definitiondocument type definition,, DTD DTD) for a class of ) for a class of documents [Defined in XML Rec]documents [Defined in XML Rec]

Syntax (in the prolog of a document instance):Syntax (in the prolog of a document instance):<!<!DOCTYPEDOCTYPE rootElemType rootElemType SYSTEMSYSTEM "ex.dtd" "ex.dtd"<!-- "<!-- "external subsetexternal subset" in file ex.dtd --> " in file ex.dtd --> [[ <!-- " <!-- "internal subsetinternal subset" may come here --> " may come here --> ]]>>

DTD is union of the external and internal subset; DTD is union of the external and internal subset; internal has higher precedenceinternal has higher precedence– can override entity and attribute declarations (see next)can override entity and attribute declarations (see next)

Page 41: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 41

Markup DeclarationsMarkup Declarations

DTD consists of DTD consists of markup declarationsmarkup declarations – element type declarationselement type declarations

» similar to productions of ECFGssimilar to productions of ECFGs– attribute-list declarationsattribute-list declarations

» for declared element typesfor declared element types– entity declarationsentity declarations (see later) (see later)– notation declarationsnotation declarations

» to pass information about external (binary) objects to to pass information about external (binary) objects to the applicationthe application

Page 42: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 42

How do Declarations Look Like?How do Declarations Look Like?

<!ELEMENT invoice (client, item+)><!ELEMENT invoice (client, item+)><!ATTLIST invoice num NMTOKEN #REQUIRED><!ATTLIST invoice num NMTOKEN #REQUIRED><!ELEMENT client (name, email?)> <!ELEMENT client (name, email?)> <!ATTLIST client num NMTOKEN #REQUIRED><!ATTLIST client num NMTOKEN #REQUIRED><!ELEMENT name (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ELEMENT item (#PCDATA)><!ELEMENT item (#PCDATA)><!ATTLIST item <!ATTLIST item

priceprice NMTOKEN #REQUIREDNMTOKEN #REQUIREDunit (FIM | EUR) ”EUR” >unit (FIM | EUR) ”EUR” >

Page 43: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 43

Element type declarationsElement type declarations

The general form isThe general form is<!ELEMENT<!ELEMENT elementTypeName elementTypeName ((EE)>)>

where where EE is a is a content modelcontent model regular expression of element namesregular expression of element names Content model operators:Content model operators:

E | F : alternationE | F : alternation EE,, F: concatenation F: concatenationE? : optionalE? : optional E* : zero or moreE* : zero or moreE+ : one or moreE+ : one or more (E) : grouping(E) : grouping

No prioritiesNo priorities: either (A,B)|C or A,(B|C), but : either (A,B)|C or A,(B|C), but nono A,B|C A,B|C

Page 44: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 44

Attribute-List DeclarationsAttribute-List Declarations

Can declare attributes for elements:Can declare attributes for elements:– Name, data type and possible default value:Name, data type and possible default value: <!<!ATTLIST FIGATTLIST FIG

idid IDID #IMPLIED#IMPLIEDdescr CDATA #REQUIREDdescr CDATA #REQUIREDclass (a | b | c) class (a | b | c) "a">"a">

Semantics mainly up to the applicationSemantics mainly up to the application– processor checks that processor checks that IDID attributes are unique and that attributes are unique and that

targets of targets of IDREFIDREF and and IDREFSIDREFS attributes exist attributes exist

Page 45: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 45

Mixed, Empty and Arbitrary ContentMixed, Empty and Arbitrary Content

Mixed contentMixed content::<!ELEMENT P<!ELEMENT P (#PCDATA | I | IMG)*>(#PCDATA | I | IMG)*>

– may contain text (may contain text (#PCDATA#PCDATA) and elements) and elements Empty contentEmpty content::

<!ELEMENT IMG <!ELEMENT IMG EMPTYEMPTY>> Arbitrary content:Arbitrary content:

<!ELEMENT HTML <!ELEMENT HTML ANYANY>>(= (= <!ELEMENT HTML<!ELEMENT HTML (#PCDATA |(#PCDATA |

choice-of-all-declared-element-typeschoice-of-all-declared-element-types)*>)*>))

Page 46: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 46

Entities (1)Entities (1)

Named storage units or XML fragments (Named storage units or XML fragments (~~ macros macros in some languages)in some languages)– character entitiescharacter entities::

» &lt;&lt; &#60;&#60; and and &#x3C;&#x3C; all expand to ‘ all expand to ‘<<‘‘(treated as data, not as start-of-markup)(treated as data, not as start-of-markup)

» other other predefined entitiespredefined entities: : &amp; &gt; &apos; &quote;&amp; &gt; &apos; &quote;

forfor & & > > ' ' ""– general entitiesgeneral entities are shorthand notations: are shorthand notations:<!<!ENTITYENTITY UKU UKU "University of Kuopio">"University of Kuopio">

Page 47: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 47

Entities (2)Entities (2) physical storage units comprising a documentphysical storage units comprising a document

– parsed entitiesparsed entities<!ENTITY chap1 SYSTEM "http://myweb/ch1"><!ENTITY chap1 SYSTEM "http://myweb/ch1">

– document entity document entity is the starting point of processingis the starting point of processing– entities and elements must nest properly:entities and elements must nest properly:

<!DOCTYPE doc [<!DOCTYPE doc [<!ENTITY chap1 <!ENTITY chap1

((… as above …)… as above …) > ]> ]<doc><doc>

&chap1;&chap1;</doc></doc>

<sec num="1"><sec num="1">… …

</sec></sec><sec num="2"><sec num="2">

… … </sec></sec>

Page 48: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 48

Unparsed EntitiesUnparsed Entities

For connecting external binary objects to XML For connecting external binary objects to XML documents; (XML processor handles only text)documents; (XML processor handles only text)

Declarations:Declarations:<!NOTATION TIFF "bin/gimp"><!NOTATION TIFF "bin/gimp"><!ENTITY fig123 SYSTEM "figs/f123.tif" <!ENTITY fig123 SYSTEM "figs/f123.tif"

NDATANDATA TIFF> TIFF><!ATTLIST IMG<!ATTLIST IMG

filefile ENTITYENTITY #REQUIRED>#REQUIRED> Usage:Usage: <IMG file="fig123"/><IMG file="fig123"/>

– application receives information about the notationapplication receives information about the notation

Page 49: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 49

Parameter EntitiesParameter Entities

Way to parameterize and modularize DTDsWay to parameterize and modularize DTDs

<!ENTITY <!ENTITY %% tableDTD SYSTEM "dtds/tab.dtd"> tableDTD SYSTEM "dtds/tab.dtd">%tableDTD; <!-- include this sub-dtd -->%tableDTD; <!-- include this sub-dtd -->

<!ENTITY <!ENTITY %% stAttr "ready (yes | no) 'no'"> stAttr "ready (yes | no) 'no'"><!ATTLIST CHAP <!ATTLIST CHAP %%stAttr; >stAttr; ><!ATTLIST SECT <!ATTLIST SECT %%stAttr; >stAttr; >

(Parameter entities in a markup declaration allowed only in the (Parameter entities in a markup declaration allowed only in the externalexternal DTD subset) DTD subset)

Page 50: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 50

Speculations about XML ParsingSpeculations about XML Parsing

Parsing involves two things:Parsing involves two things:1. Pulling the entities together, and checking the syntactic 1. Pulling the entities together, and checking the syntactic correctness of the inputcorrectness of the input2. Building a parse tree for the input (a'la DOM), 2. Building a parse tree for the input (a'la DOM), or or otherwise informing the application about document otherwise informing the application about document content and structure (e.g. a'la SAX)content and structure (e.g. a'la SAX)

Task 2 is simple (Task 2 is simple ( simplicity of XML markup; simplicity of XML markup; see next) see next)

Slightly more difficult to implement areSlightly more difficult to implement are– checking the well-formednesschecking the well-formedness– checking the validity wrt the DTD (or a Schema)checking the validity wrt the DTD (or a Schema)

Page 51: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 51

Building an XML Parse TreeBuilding an XML Parse Tree

<S><S>

SS

</S></S><W><W> <W><W> </W></W></W></W> <E><E> </E></E>HelloHello world!world!

WW

HelloHello

EE WW

world!world!

Page 52: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 52

Sketching the validation of Sketching the validation of XMLXML Treat the document as a tree Treat the document as a tree d d Document is valid w.r.t. a grammar Document is valid w.r.t. a grammar

(DTD/Schema) G iff (DTD/Schema) G iff dd is a syntax tree over G is a syntax tree over G How to check this?How to check this?

– Check that the root is labelled by start symbol S of GCheck that the root is labelled by start symbol S of G– For each element node For each element node nn of the tree, check that of the tree, check that

» the attributes of the attributes of nn match attributes declared for the type of match attributes declared for the type of nn

» the contents of the contents of nn matches the content model for the type of matches the content model for the type of nn. That is, if . That is, if nn is of type A and its children of type B is of type A and its children of type B11, …, B, …, Bnn, , check thatcheck that

word B word B11 … B … Bn n L(E) L(E) (1)(1)for some production A for some production A E of G. E of G.

Page 53: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 53

Sketching the validation… Sketching the validation… (2)(2) How to check condition (1; matching of How to check condition (1; matching of

children with a content model)?children with a content model)?– by an automaton built from content model Eby an automaton built from content model E

Normally, validation proceeds in Normally, validation proceeds in document document order order (= depth-first order of tree)(= depth-first order of tree). . For For example, to validate the content ofexample, to validate the content of

<A><B>…</B><C>...<C/><D/></A><A><B>…</B><C>...<C/><D/></A>the content of B is checked before continuing the content of B is checked before continuing to verify that BCD matches the content model to verify that BCD matches the content model for A for A

Page 54: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 54

2.4 XML Namespaces2.4 XML Namespaces

Document often comprise parts processed by Document often comprise parts processed by different applications (and/or defined in different different applications (and/or defined in different grammars)grammars)– for example, in XSLT scripts:for example, in XSLT scripts:

<xsl:template match="doc/title"> <xsl:template match="doc/title"> <H1><H1> <xsl:apply-templates /><xsl:apply-templates /> </H1> </H1> </xsl:template></xsl:template>

– How to manage multiple sets of names?How to manage multiple sets of names?

HTML HTML elementselements

XSLT XSLT elements/elements/

instructionsinstructions

Page 55: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 55

XML Namespaces (2/5) XML Namespaces (2/5)

Solution: XML Namespaces, W3C Rec. 14/1/1999 Solution: XML Namespaces, W3C Rec. 14/1/1999 for separating possibly overlapping “vocabularies” for separating possibly overlapping “vocabularies” (sets of element type and attribute names) within a (sets of element type and attribute names) within a single documentsingle document

by introducing (arbitrary) local name by introducing (arbitrary) local name prefixesprefixes, and , and binding them to (fixed) globally unique URIsbinding them to (fixed) globally unique URIs– For example, the local prefix “For example, the local prefix “xsl:xsl:” ”

conventionally used in XSLT scriptsconventionally used in XSLT scripts

Page 56: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 56

XML Namespaces briefly (3/5)XML Namespaces briefly (3/5)

Namespace identified by a URI (through Namespace identified by a URI (through the associated local prexif) the associated local prexif) e.g.e.g. http://www.w3.org/http://www.w3.org/1999/XSL/Transform1999/XSL/Transform for XSLTfor XSLT

– conventional but not required to use URLsconventional but not required to use URLs– the identifying URI has to be unique, but it does not the identifying URI has to be unique, but it does not

have to be an existing addresshave to be an existing address Association inherited to sub-elementsAssociation inherited to sub-elements

– see the next example (of an XSLT script)see the next example (of an XSLT script)

Page 57: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 57

XML Namespaces (4/5) XML Namespaces (4/5)

<xsl:stylesheet version=<xsl:stylesheet version="1.0""1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/TR/xhtml1/strict">xmlns="http://www.w3.org/TR/xhtml1/strict">

<!-- XHTML is the ’default namespace’ --><!-- XHTML is the ’default namespace’ --><xsl:template match="doc/title"> <xsl:template match="doc/title"> <H1><H1>

<xsl:apply-templates /><xsl:apply-templates /> </H1> </H1> </xsl:template> </xsl:template>

</xsl:stylesheet></xsl:stylesheet>

Page 58: SDPL 20062: Document Instances and Grammars1 2 Document Instances and Grammars Fundamentals of hierarchical document structures, or Computer Scientist’s

SDPL 2006 2: Document Instances and Grammars 58

XML Namespaces briefly (5/5)XML Namespaces briefly (5/5)

Mechanism built on top of basic XMLMechanism built on top of basic XML– overloads attribute syntax (overloads attribute syntax (xmlns:xmlns:) to introduce ) to introduce

namespacesnamespaces– does not affect validation does not affect validation

» namespace attributes have to be declared for DTD-namespace attributes have to be declared for DTD-validityvalidity

» all element type names have to be declared (with their all element type names have to be declared (with their initial prefixes!)initial prefixes!)

– > Other schema languages (XML Schema, Relax NG) > Other schema languages (XML Schema, Relax NG) better for validating documents with Namespacesbetter for validating documents with Namespaces