sdpnotes 2: document grammars and instances 2 document grammars and instances a look at the...

42
SDP Notes 2: Document Grammars and Instances 2 Document Grammars and 2 Document Grammars and Instances Instances A look at the foundations of A look at the foundations of hierarchical document structures hierarchical document structures 2.1 Language-Theoretic Basis 2.1 Language-Theoretic Basis (a quick review) (a quick review) regular expressions regular expressions extended context-free grammars (as extended context-free grammars (as document document schemas schemas ) and ) and parse trees (as parse trees (as document instances document instances ) ) 2.2 Review of XML Basics 2.2 Review of XML Basics Practical realisation of the general model: Practical realisation of the general model: XML document instances and DTDs XML document instances and DTDs

Upload: erin-francis

Post on 05-Jan-2016

236 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

2 Document Grammars and Instances2 Document Grammars and Instances

A look at the foundations of hierarchical A look at the foundations of hierarchical document structuresdocument structures

2.1 Language-Theoretic Basis 2.1 Language-Theoretic Basis (a quick review)(a quick review)– regular expressionsregular expressions– extended context-free grammars (as extended context-free grammars (as document document

schemasschemas) and ) and – parse trees (as parse trees (as document instancesdocument instances))

2.2 Review of XML Basics2.2 Review of XML Basics– Practical realisation of the general model: Practical realisation of the general model:

XML document instances and DTDsXML document instances and DTDs

Page 2: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

2.1 Regular Expressions2.1 Regular Expressions

Formalism to describe Formalism to describe regular languages:regular languages:– relatively simple relatively simple sets of stringssets of strings over a finite over a finite alphabet alphabet ofof

» characters, events, characters, events, document elementsdocument elements, …, …

Many uses: Many uses: – text searching patterns (e.g., emacs, ed, awk, grep, text searching patterns (e.g., emacs, ed, awk, grep,

Perl)Perl)– as a part of grammatical formalisms (for programming as a part of grammatical formalisms (for programming

languages, XML, in XML DTDs and schemas, …)languages, XML, in XML DTDs and schemas, …) Relevance for document structures: what kind of structural Relevance for document structures: what kind of structural

content is allowed within various document componentscontent is allowed within various document components

Page 3: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Regular Expressions: SyntaxRegular Expressions: Syntax

A regular expression over an alphabetA regular expression over an alphabet is eitheris either an empty set, an empty set, – lambdalambda (sometimes (sometimes epsilonepsilon ), ), – aa any alphabet symbolany alphabet symbol aa – ((R | SR | S)) choicechoice; sometimes; sometimes ( (R R SS), ), – ((R SR S)) concatenationconcatenation, or, or– RR** Kleene closure or Kleene closure or iterationiteration,,

where where RR and and SS are regular expressions are regular expressions N.B: different syntaxes exist but the idea is same N.B: different syntaxes exist but the idea is same

Page 4: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Regular Expressions: GroupingRegular Expressions: Grouping

Conventions to simplify operator expressionsConventions to simplify operator expressions– outermost parentheses may be eliminated: outermost parentheses may be eliminated:

» ((EE) = ) = EE

– binary operations are associative:binary operations are associative:» ((AA | ( | (BB | | CC)) = (()) = ((AA | | BB) | ) | CC) = () = (AA | | BB | | CC))

» ((AA ( (B CB C)) = (()) = ((A BA B) ) CC) = () = (A B CA B C))

– operations have priorities: operations have priorities: » iteration first, concatenation next, choice lastiteration first, concatenation next, choice last» for example,for example,

((AA | ( | (BB ( (C D)C D)*)) = *)) = AA | | B B ((C DC D)*)*

Page 5: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Regular Expressions: SemanticsRegular Expressions: Semantics

Regular expressionRegular expression EE denotes a denotes a languagelanguage (set of strings) (set of strings) LL((EE), ), defined inductively as followsdefined inductively as follows::– LL(() = {}) = {} (empty set)(empty set)– LL(( (singleton set of empty string(singleton set of empty string = ‘‘) = ‘‘)

– LL((aa) = {) = {aa}, }, ( (aa , singleton set of , singleton set of word "word "a"a"– LL((((R | SR | S)) = )) = LL((RR) ) LL((SS) = {) = {w | ww | w LL((RR) ) or or ww LL((SS)} )}

– LL((((R SR S)) = )) = LL((RR) ) LL((SS) = {) = {xy | xxy | x LL((RR) ) and and yy LL((SS)})}

– LL((RR**) = ) = LL((RR))** = { = {xx11……xxnn | | xxkk LL((RR), ), kk = 1, …, = 1, …, nn;; n n 0}0}

Page 6: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Regular Expressions: ExamplesRegular Expressions: Examples

L(L(AA | | B B ((C DC D)*) = ?)*) = ?

= L(= L(AA)) L( L( B B ((C DC D)*))*)

= {= {AA}} L(L(BB)) L((L((C DC D)*))*)

= {= {AA}} {{BB}} L(L(C DC D)*)*

= {= {AA}} {{BB}} {{CD, CDCD, CDCDCD, …}CD, CDCD, CDCDCD, …}

= {= {AA, , BB, BCD, BCDCD, BCDCDCD, …}, BCD, BCDCD, BCDCDCD, …}

Page 7: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Regular Expressions: ExamplesRegular Expressions: Examples

Simplified top-level structure of a document: Simplified top-level structure of a document: – = {= {title, auth, date, secttitle, auth, date, sect}}– titletitle followed by an optional list of followed by an optional list of authorsauthors, ,

fby an optional fby an optional datedate, fby one or more , fby one or more sectionssections::

title auth* title auth* ((date |date |)) sect sect* sect sect* Commonly used abbreviations: Commonly used abbreviations:

– EE? = (? = (EE | |);); EE++ = = EE E*E* the above more compactly:the above more compactly:

title auth* datetitle auth* date?? sect+ sect+

Page 8: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Context-Free Grammars (CFGs)Context-Free Grammars (CFGs)

Used widely to syntax specification (programming Used widely to syntax specification (programming languages, XML, …) and to parser/compiler languages, XML, …) and to parser/compiler generation (e.g. YACC/GNU Bison)generation (e.g. YACC/GNU Bison)

CFG CFG GG formally a quadruple (V, formally a quadruple (V, P, S) P, S)– V is the V is the alphabetalphabet of the grammar of the grammar GG– VViis a set of s a set of terminal symbolsterminal symbols;;

» N = VN = V is a set of is a set of nonterminal symbolsnonterminal symbols

– P set of P set of productionsproductions– S S V the V the start symbolstart symbol

Page 9: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Productions and DerivationsProductions and Derivations

Productions: Productions: A A where where A A N, N, V* V*– E.g:E.g: A A aBa aBa ((production 1production 1))

LetLet V*. V*. StringString derives derives directlydirectly, , ifif

– for somefor someV*V*and and PP

– E.g.E.g. A A => A => AaBaaBa (assuming production 1 above)(assuming production 1 above)

NBNB: CFGs are often given simply by listing the productions : CFGs are often given simply by listing the productions (P); The start symbol (S) is then conventionally the left-hand-(P); The start symbol (S) is then conventionally the left-hand-side of the first productionside of the first production

Page 10: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Language Generated by a CFGLanguage Generated by a CFG

derives derives , , if there’s a sequence of (0 or if there’s a sequence of (0 or more) direct derivations that transformsmore) direct derivations that transforms to to

The language generated by a CFGThe language generated by a CFG GG::– L(L(GG) = {w ) = {w * | S =>* w }* | S =>* w }

NB:NB: L( L(GG) ) is a set of is a set of stringsstrings; ; – To model document To model document structuresstructures, we consider , we consider syntax treessyntax trees

Page 11: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Syntax TreesSyntax Trees

Also called Also called parse treesparse trees or or derivation treesderivation trees Ordered treesOrdered trees

– consist of nodes that may have child nodes which are consist of nodes that may have child nodes which are ordered left-to-rightordered left-to-right

nodes labelled by symbols of nodes labelled by symbols of VV: : – internal nodes by nonterminals, root by start symbolinternal nodes by nonterminals, root by start symbol– leaves by terminal symbols (or empty stringleaves by terminal symbols (or empty string ))

A node with label A node with label AA can have children labelled by can have children labelled by XX11, …, X, …, Xkk only if only if AA XX11, …, X, …, Xkk PP

Page 12: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Syntax Trees: ExampleSyntax Trees: Example

CFG for simplified arithmetic expressions:CFG for simplified arithmetic expressions:V = {E, +, *, I}; V = {E, +, *, I}; = {+, *, I}; N = {E}; S = E = {+, *, I}; N = {E}; S = E

((II stands for an arbitrary integer) stands for an arbitrary integer)

P = { E P = { E E+E, E+E, E E E*E, E*E, E E I, I, E E (E) } (E) }

Syntax tree for 2*(3+4)?Syntax tree for 2*(3+4)?

Page 13: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Syntax Trees: ExampleSyntax Trees: Example

2 * ( 3 + 4 )2 * ( 3 + 4 )

EE

EE EE

E E E*E E*E

II

E E I I

II IIE E I I

EEEE

E E E+E E+EEE

E E ( E ) ( E )

Page 14: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

CFGs for Document StructuresCFGs for Document Structures

Nonterminals represent document elementsNonterminals represent document elements– E.g. model for items (E.g. model for items (Ref Ref ) of a bibliography list:) of a bibliography list:

Ref Ref AuthorList Title PublData AuthorList Title PublDataAuthorList AuthorList Author AuthorList Author AuthorList AuthorList AuthorList

Notice:Notice:– right-hand-side of a production is a fixed string of right-hand-side of a production is a fixed string of

grammar symbolsgrammar symbols– Repetition simulated using recursion Repetition simulated using recursion

» e.g. e.g. AuthorList AuthorList aboveabove

Page 15: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Example: List of Three AuthorsExample: List of Three Authors

AuthorListAuthorList

Author Author

Author Author

Author Author

AuthorListAuthorList

AuthorListAuthorList

AuthorListAuthorList

Aho Aho Hopcroft Hopcroft Ullman Ullman

RefRef

TitleTitle PublDataPublData

. . . . . . . . . . . .

Page 16: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

ProblemsProblems

"Auxiliary nonterminals" (like "Auxiliary nonterminals" (like AuthorListAuthorList) ) obscure the modelobscure the model– the last the last AuthorAuthor several levels apart from its several levels apart from its

intuitive parent elementintuitive parent element RefRef

– awkward to access and to count awkward to access and to count AuthorsAuthors of a of a referencereference

– avoided by avoided by extendedextended context-free grammars context-free grammars

Page 17: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Extended CFGs (ECFGs)Extended CFGs (ECFGs)

like CFGs, but right-hand-sides of like CFGs, but right-hand-sides of productions are productions are regular expressionsregular expressions over over VV– E.g:E.g: Ref Ref Author* Title PublData Author* Title PublData

Let Let V*. V*. StringString derives derives directlydirectly, ,

if if

– for somefor someV*V*and and P P such thatsuch that L( L())

– E.g.E.g. Ref => Author Author Author Title PublDataRef => Author Author Author Title PublData

Page 18: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Language Generated by an Language Generated by an EECFGCFG

L(G)L(G) defined similarly to CFGs: defined similarly to CFGs:– derivesderives , if , if

nn= = (for (for nn 0)0)

– L(G) = {w L(G) = {w * | S =>* w }* | S =>* w } Theorem:Theorem: Extended and ordinary CFGs Extended and ordinary CFGs

allow to generate the same languages. allow to generate the same languages.

• Syntax trees of ECFGs and CFGs differ! (Next)Syntax trees of ECFGs and CFGs differ! (Next)

Page 19: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Syntax Trees of an ECFGSyntax Trees of an ECFG

Similar to parse trees of an ordinary CFG, Similar to parse trees of an ordinary CFG, except that ..except that ..– node with label A can have children labelled Xnode with label A can have children labelled X11, ,

…, X…, Xkk when A when A E E P such that P such that

XX11…X…Xkk L(E) L(E)

an internal node may have arbitralily many an internal node may have arbitralily many children (e.g., children (e.g., AuthorsAuthors below a below a RefRef node) node)

Page 20: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Example: Three Authors of a RefExample: Three Authors of a Ref

RefRef

Author Author Author Author Author Author TitleTitle

. . .. . .

PublDataPublData

Aho Aho Hopcroft Hopcroft Ullman Ullman The Design and Analysis ...The Design and Analysis ...

Ref Ref Author* Title PublData Author* Title PublData P, P,Author Author Author Title PublData Author Author Author Title PublData L( L(Author* Title PublDataAuthor* Title PublData))

Page 21: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Terminal Symbols in PractiseTerminal Symbols in Practise

(Extended) CFGs: (Extended) CFGs: – Leaves of parse trees are labelled by single Leaves of parse trees are labelled by single

terminal symbols (terminal symbols ()) Too granular for practise; instead terminal Too granular for practise; instead terminal

symbols which stand for all values of a typesymbols which stand for all values of a type– XML DTDs: XML DTDs: #PCDATA#PCDATA for variable length string content for variable length string content– Proposed XML schema formalisms:Proposed XML schema formalisms:

» string, byte, integer, boolean, date,string, byte, integer, boolean, date, ... ...

– Explicit string constants rare in document grammarsExplicit string constants rare in document grammars

Page 22: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

2.2 XML, eXtensible Markup Language2.2 XML, eXtensible Markup Language

W3C Recommendation 10-Feb-1998W3C Recommendation 10-Feb-1998– not an official standard, but a stable industry standardnot an official standard, but a stable industry standard– Second Edition, 6-Oct-2000Second Edition, 6-Oct-2000

» a revision, a revision, notnot a new version of XML a new version of XML

a simplified subset of SGML, Standard a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987Generalized Markup Language, ISO 8879:1987– what is said below about what is said below about validvalid XML documents applies XML documents applies

to SGML documents, tooto SGML documents, too

Page 23: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

What is XML?What is XML?

Extensible Markup LanguageExtensible Markup Language is is notnot a markup a markup language! language! – does not fix semantics or a tag set (like, e.g., HTML does not fix semantics or a tag set (like, e.g., HTML

does)does)

A way to use markup to represent informationA way to use markup to represent information A A metalanguagemetalanguage

– supports definition of specific markup languagessupports definition of specific markup languages– E.g. XHTML a reformulation of HTML using XMLE.g. XHTML a reformulation of HTML using XML

Page 24: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Next: Essential Features of XMLNext: Essential Features of XML

An overview of the essentials of XMLAn overview of the essentials of XML– many details skippedmany details skipped

» some to be discussed in exercises or with other some to be discussed in exercises or with other topics when the need arisestopics when the need arises

– learn to consult the original sources learn to consult the original sources (specifications, documentation etc) for details(specifications, documentation etc) for details

Page 25: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

XML Encoding of StructureXML Encoding of Structure

XML document essentially a parenthesized XML document essentially a parenthesized linear encoding of a parse tree (see next)linear encoding of a parse tree (see next)– corresponds to a pre-order walkcorresponds to a pre-order walk– start of inner node (or start of inner node (or elementelement) A denoted by a ) A denoted by a

start tagstart tag <A>, end denoted by <A>, end denoted by end tagend tag </A> </A>– leaves are strings (or leaves are strings (or empty elementsempty elements))

+ certain extensions (especially certain extensions (especially attributesattributes))

Page 26: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

XML Encoding of Structure: ExampleXML Encoding of Structure: Example

<S><S>

SS

WW WW

world!world!HelloHello

EE

</S></S><W><W> <W><W> </W></W></W></W> <E><E> </E></E>HelloHello world!world!

Page 27: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

An XML Processor (Parser)An XML Processor (Parser)

Reads XML documentsReads XML documents Passes data to an application Passes data to an application XML Recommendation XML Recommendation

– tells how to read, what to passtells how to read, what to pass

– check the XML Rec for details; quite readable!check the XML Rec for details; quite readable!

Page 28: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

XML: Logical Document StructureXML: Logical Document Structure

ElementsElements– correspond to internal nodes of the parse tree correspond to internal nodes of the parse tree – unique root element unique root element document a single parse document a single parse

treetree– indicated by matching (case-sensitive!) tagsindicated by matching (case-sensitive!) tags<ElementTypeName> <ElementTypeName> … … </ElementTypeName></ElementTypeName>

– can contain text and/or subelementscan contain text and/or subelements– can be can be emptyempty::

<elem-type></elem-type><elem-type></elem-type> OR (e.g.)OR (e.g.)<br/><br/>

Page 29: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Logical document structure (2)Logical document structure (2)

AttributesAttributes– name-value pairs attached to elementsname-value pairs attached to elements– ‘‘‘‘metadata’’, usually not treated as contentmetadata’’, usually not treated as content– in start-tag after the element type namein start-tag after the element type name

<div class="preface" date='990126'><div class="preface" date='990126'> … …

Also:Also:– <!--<!-- commentscomments outside other markup outside other markup --> -->– <?note this would be passed to the <?note this would be passed to the application as a application as a processing instructionprocessing instruction named ‘note’?>named ‘note’?>

Page 30: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

CDATA SectionsCDATA Sections

““CDATA Sections” to include XML markup CDATA Sections” to include XML markup characters as textual contentcharacters as textual content

<![CDATA[ <![CDATA[ Here we can easily include markup Here we can easily include markup characters and, for example, code characters and, for example, code fragments:fragments:

<example>if (Count < 5 && Count > 0) <example>if (Count < 5 && Count > 0) </example></example>

]]>]]>

Page 31: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Two levels of correctnessTwo levels of correctness

Well-formedWell-formed documents documents – roughly: follows the syntax of XML,roughly: follows the syntax of XML,

markup correct (elements properly nested, tag markup correct (elements properly nested, tag names match, attributes of an element have names match, attributes of an element have unique names, ...)unique names, ...)

– violation is a fatal errorviolation is a fatal error ValidValid documentsdocuments

– (in addition to being well-formed) (in addition to being well-formed) obey their obey their document type definitiondocument type definition

Page 32: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Document Type DeclarationDocument Type Declaration

Provides a grammar (Provides a grammar (document type definitiondocument type definition,, DTDDTD) for a class of documents) for a class of documents

SyntaxSyntax<!<!DOCTYPEDOCTYPE rootElemType rootElemType SYSTEMSYSTEM "ex.dtd" "ex.dtd"<!-- "<!-- "external subsetexternal subset" in file ex.dtd --> " in file ex.dtd --> [ <!-- "[ <!-- "internal subsetinternal subset" may come here --> " may come here --> ]>]>

DTD is the union of the external and internal subset; DTD is the union of the external and internal subset; internal subset has higher precedenceinternal subset has higher precedence– can override entity and attribute declarations (see next)can override entity and attribute declarations (see next)

Page 33: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Markup DeclarationsMarkup Declarations

DTD consists of DTD consists of markup declarationsmarkup declarations – element type declarationselement type declarations

» similar to productions of ECFGssimilar to productions of ECFGs

– attribute-list declarationsattribute-list declarations » for declared element typesfor declared element types

– entity declarationsentity declarations (see later) (see later)– notation declarationsnotation declarations

» to pass information about external (binary) objects to to pass information about external (binary) objects to the applicationthe application

Page 34: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Element type declarationsElement type declarations

The general form isThe general form is<!ELEMENT<!ELEMENT elementTypeName elementTypeName ((EE)>)>

where where EE is a is a content modelcontent model regular expression of element namesregular expression of element names Content model operators:Content model operators:

E | F : alternationE | F : alternation EE,, F: concatenation F: concatenationE? : optionalE? : optional E* : zero or moreE* : zero or moreE+ : one or moreE+ : one or more (E) : grouping(E) : grouping

Page 35: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Attribute-List DeclarationsAttribute-List Declarations

Can declare attributes for elements:Can declare attributes for elements:– Name, data type and possible default value Name, data type and possible default value

Example:Example:<!ATTLIST FIG<!ATTLIST FIG

idid IDID #IMPLIED#IMPLIEDdescr CDATA #REQUIREDdescr CDATA #REQUIREDclass (a | b | c) class (a | b | c) "a">"a">

Semantics mainly up to the applicationSemantics mainly up to the application– processor checks that processor checks that IDID attributes are unique and that attributes are unique and that

targets of targets of IDREFIDREF attributes exist attributes exist

Page 36: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Mixed, Empty and Arbitrary ContentMixed, Empty and Arbitrary Content

Mixed contentMixed content::<!ELEMENT P<!ELEMENT P (#PCDATA | I | IMG)*>(#PCDATA | I | IMG)*>

– may contain text (may contain text (#PCDATA#PCDATA) and elements) and elements Empty contentEmpty content::

<!ELEMENT IMG <!ELEMENT IMG EMPTY>EMPTY>

Arbitrary content:Arbitrary content:<!ELEMENT HTML <!ELEMENT HTML ANY>ANY>

(= (= <!ELEMENT HTML<!ELEMENT HTML (#PCDATA |(#PCDATA | choice-of-all-declared-element-typeschoice-of-all-declared-element-types)*>)*>))

Page 37: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Entities (1)Entities (1)

Multiple uses:Multiple uses:– character entitiescharacter entities: :

» &lt;&lt; &#60;&#60; and and &#x3C;&#x3C; all expand to ‘ all expand to ‘<<‘‘

» other other predefined entitiespredefined entities: : &amp; &gt; &apos; &quote;&amp; &gt; &apos; &quote;expand toexpand to &, >, ' &, >, ' and and ""

– general entitiesgeneral entities are shorthand notations: are shorthand notations:<!ENTITY UKU<!ENTITY UKU “University of Kuopio">“University of Kuopio">

Page 38: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Entities (2)Entities (2)

physical storage units comprising a documentphysical storage units comprising a document– parsed entitiesparsed entities

<!ENTITY chap1 SYSTEM <!ENTITY chap1 SYSTEM "http://myweb/ch1">"http://myweb/ch1">

– document entity document entity is the starting point of processingis the starting point of processing– entities and elements must nest properly:entities and elements must nest properly:

<!DOCTYPE doc [<!DOCTYPE doc [<!ENTITY chap1 <!ENTITY chap1

((… as above …)… as above …) > ]> ]<doc><doc>

&chap1;&chap1;</doc></doc>

<sec num="1"><sec num="1">… …

</sec></sec><sec num="2"><sec num="2">

… … </sec></sec>

Page 39: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Unparsed EntitiesUnparsed Entities

External (binary) filesExternal (binary) files Declarations:Declarations:

<!NOTATION TIFF "bin/xv"><!NOTATION TIFF "bin/xv"><!ENTITY fig123 SYSTEM "figs/f123.tif" <!ENTITY fig123 SYSTEM "figs/f123.tif"

NDATA TIFF>NDATA TIFF><!ATTLIST IMG<!ATTLIST IMG

filefile ENTITYENTITY #REQUIRED>#REQUIRED>

Usage:Usage: <IMG file="fig123"/><IMG file="fig123"/>– application receives information about the notationapplication receives information about the notation

Page 40: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Parameter entitiesParameter entities

Way to parameterize and modularize DTDsWay to parameterize and modularize DTDs

<!ENTITY % table-dtd SYSTEM "dtds/tab.dtd"><!ENTITY % table-dtd SYSTEM "dtds/tab.dtd">%table-dtd; <!-- include external sub-dtd -->%table-dtd; <!-- include external sub-dtd -->

<!ENTITY % stAttr "status (draft | ready) <!ENTITY % stAttr "status (draft | ready) 'draft'">'draft'"><!ATTLIST CHAP %stAttr; ><!ATTLIST CHAP %stAttr; ><!ATTLIST SECT %stAttr; ><!ATTLIST SECT %stAttr; >

(The latter, parameter entities as a part of a markup (The latter, parameter entities as a part of a markup declaration, is allowed only in the external, and not in the declaration, is allowed only in the external, and not in the internal DTD subset)internal DTD subset)

Page 41: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Speculations about XML ParsingSpeculations about XML Parsing

Parsing involves two things:Parsing involves two things:1. Checking the syntactic correctness of the input1. Checking the syntactic correctness of the input2. Building a parse tree for the input (a'la DOM), 2. Building a parse tree for the input (a'la DOM), or or otherwise passing the document content to the application otherwise passing the document content to the application (e.g. a'la SAX)(e.g. a'la SAX)

Task 2 is simple, thanks to the simplicity of XML Task 2 is simple, thanks to the simplicity of XML markup (see next) markup (see next)

Slightly more difficult (?) to implement areSlightly more difficult (?) to implement are– pulling the entities togetherpulling the entities together– checking the well-formednesschecking the well-formedness– checking the validity wrt the DTD (or a Schema)checking the validity wrt the DTD (or a Schema)

Page 42: SDPNotes 2: Document Grammars and Instances 2 Document Grammars and Instances A look at the foundations of hierarchical document structures 2.1 Language-Theoretic

SDP Notes 2: Document Grammars and Instances

Building an XML Parse TreeBuilding an XML Parse Tree

<S><S>

EE

SS

</S></S><W><W> <W><W> </W></W></W></W> <E><E> </E></E>HelloHello world!world!

WW

HelloHello

WW

world!world!