xml for information management

40
XML for Information Management – Day 4: Logical and Physical Structure of XML Documents Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/ 12.1.-16.1. 2009

Upload: idalia

Post on 21-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

XML for Information Management. 12.1.-16.1. 2009. University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/. Day 4: Logical and Physical Structure of XML Documents. Outline. 1. Components of the logical structure - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

XML for Information Management

University of Erlangen-NurembergComputational Linguistics

Instructor: Professor Airi Salminenhttp://users.jyu.fi/~airi/

12.1.-16.1. 2009

Page 2: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

2

Day 4: Logical and Physical Structure of XML Documents

1. Components of the logical structure2. XML documents as trees3. Entity types4. Entity declarations and references5. XML processor treatment of entity

references6. Motivations for the use of entities

Outline

Page 3: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

3

1. Components of the logical structure

• declarations

• elements

• comments

• processing instructions

Page 4: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

4

1. Components of the logical structure

document ::= prolog element Misc*

declarationscommentsprocessing instructions

elementscommentsprocessing instructions

commentsprocessing instructions

Page 5: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

5

‣ XML declaration [23]

‣ document type declaration [28]

‣ markup declaration [29]

• element type declaration [45]

• attribute list declaration [52]

• entity declaration [70]

• notation declaration [82]

‣ encoding declaration [80]

‣ standalone document declaration [32]

‣ text declaration [77]

Declarations:

1. Components of the logical structure

to constrain the logical structure

to constrain the physical structure

Page 6: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

6

Typical element type declarations:

1. Components of the logical structure

mixed content defined

element content defined

<!ELEMENT product (mfg, model, description, clock?)><!ELEMENT model (#PCDATA)><!ELEMENT description (#PCDATA | feature)*><!ELEMENT clock EMPTY>

empty element defined

Page 7: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

7

1. Components of the logical structure

empty element defined:

<clock></clock><clock/>

<!ELEMENT clock EMPTY>

two forms of the element allowed in a well-formed document:

Page 8: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

8

1. Components of the logical structure

element content: definition by content models with metasymbols

* iteration (none or more)+ iteration (once or more)| alternatives? optional, successive( ) grouping

#PCDATA is not accepted in the content model!

<!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>

Example from XHTML 1.0 Strict DTD:

Page 9: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

9

1. Components of the logical structure

mixed content: definition has basically two forms

(#PCDATA)(#PCDATA | e1 | … | en)*

<!ELEMENT text (#PCDATA)><!ELEMENT section (#PCDATA | subsection)*><!ELEMENT section (#PCDATA | subsection | paragraph)*>

#PCDATA is always included in the content specification and comes first in the list of alternatives

examples:

Page 10: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

10

• to define the set of attributes pertaining to a given elemen type

• to establish type constraints for these attributes

• to provide default values for attributes

Attribute list declarations

1. Components of the logical structure

Page 11: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

11

attribute name

<!ATTLIST poem author CDATA #REQUIRED >

attribute type: string

constraint: the attribute must be specified for all elements of type poem

element type

1. Components of the logical structure

Page 12: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

12

Defining constraints

#REQUIRED: attribute must always be provided in all elements of the given type

#IMPLIED: attribute can be provided in a element; no default value is provided

AttValue: default value is given between single or double quotes

#FIXED AttValue: instances of the attribute must match the given default value

[60] DefaultDecl ::= '#REQUIRED' | '#IMPLIED'| (('#FIXED' S) ? AttValue)

1. Components of the logical structure

Page 13: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

13

Attribute types

[54] AttType ::= StringType | TokenizedType | EnumeratedType

• ENTITY, ENTITIES: entity names

• NMTOKEN, NMTOKENS: text tokens consisting of characters accepted in names

• ID: names that uniquely identify elements

• IDREF, IDREFS: references to ID type identifiers

tokenized types:

enumerated types:• NOTATION, NOTATIONS: identify notations• enumeration

1. Components of the logical structure

Page 14: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

14

<?xml version=”1.0”?><!DOCTYPE text [<!ELEMENT text (line+)><!ELEMENT line (#PCDATA)><!ATTLIST line

id ID #REQUIREDseeline IDREFS #IMPLIED> ]>

<text><line id=”r1”>This is the first line</line><line id=”r2” seeline=”r1”>This is the second line, but look at the first too</line></text>

1. Components of the logical structure

Page 15: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

15

2. XML documents as trees

<Chapter section = '1' ><Narration narrator='Benjy'><Imagery place='tree' mode=simile sense='smell'><Fragment code='1.12'><Paragraph id='143'><Subject person='Caddy'>She</Subject>smelled like trees.</Paragraph></Fragment></Imagery></Narration></Chapter>

XML-aware web browsers support the visualization of the hierarchic structure: example

Page 16: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

16

2. XML documents as trees

XML specification defines a concrete syntax for XML documents.

W3C has defined four slightly different abstract models to decribe the abstract syntax of XML documents:

• XML Information Set• DOM model• XPath 1.0 model• XQuery 1.0 and XPath 2.0 data model

Analysis of differences in the models: Salminen, A., & Tompa, F.W. (2001). Requirements for XML document database systems. Proc. of the ACM Symposium on Document Engineering (DocEng '01) (pp. 85-94). New York: ACM Press.

Page 17: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

17

<poem author = ”Murasaki Shikibu” born = ”974”><!-- The poem is translated from Japanese by Kenneth Rexroth --><line>This life of ours would not cause you sorrow</line><line>if you thought of it as like</line><line>the mountain cherry blossoms</line><line>which bloom and fade in a day. </line></poem>

2. XML documents as trees

Page 18: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

18

poem

line

line

lineAuthorMurasaki Shikibu

line

born 974

This life of ours would not cause you sorrow

if you thought of it as like

which bloom and fade in a day.

the mountain cherry blossoms

Root node

Element node

Attribute node

The poem is translated from Japanese by Kenneth Rexroth

Text node

Comment node

poem

2. XML documents as trees

Node types of XPath 1.0

Page 19: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

19

3. Entity types

Physical structure of XML documents consists of entities.

An entity is a unit recognized by the XML processor, the content of an entity is text or other kind of data.

Page 20: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

20

parsed entities -- unparsed entities

internal entities -- external entities

general entities -- parameter entities

3-dimensional categorization:

3. Entity types

Page 21: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

21

parsed entity

intended to be parsed by the XML processor, content consists of marked-up text

unparsed entity

not intended to be parsed by the XML processor, content can be whatever data

3. Entity types

Page 22: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

22

internal entity

name and value given in an entity declaration

always a parsed entity

external entity

not internal

parsed or unparsed

3. Entity types

Page 23: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

23

general entity

used in elements and attributes

parsed or unparsed

internal or external

parameter entity

used in the document type definition

always parsed

internal or external

3. Entity types

Page 24: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

24

Alternatives

parsed internal parameter

internal general

external parameter

internal general

unparsed external general

3. Entity types

Page 25: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

25

• root entity, external subset of DTD

• other files intended for XML processing

INPUT FILES for XML processing:

UNPARSED ENTITIES:

XMLprocessor

Information about: application

• elements and attributes

• comments• processing instructions• character data• namespaces• notations and

locations of unparsed entities

• files not intended for XML processing but referred to by entity references in the INPUT FILES

INTERNAL ENTITIES:

• name and textual content given in DTD

3. Entity types

Page 26: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

26

4. Entity declarations and references

EntityDecl ::= GEDecl | PEDecl

GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>'

PEDecl ::= '<!ENTITY' S '%' Name S PEDef S? '>'

EntityDef ::= EntityValue | ( ExternalID NDataDecl?)

PEDef ::= EntityValue | ExternalID

entity definition for external entityentity definition for internal entity

Page 27: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

27

internal entity

name and value ( = literal value) given

<!ENTITY % Shape "(rect | circle | poly | default )">

<!ENTITY JY "Jyväskylän yliopisto">

name literal value

4. Entity declarations and references

Page 28: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

28

name and system identifier (possibly together with public identifier) given, for an unparsed entity also notation

external entity

<!ENTITY % HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN"

"xhtml-symbol.ent"><!ENTITY % HTMLspecial PUBLIC "-//W3C//ENTITIES Special for XHTML//EN"

"xhtml-special.ent">http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html

Declarations from XHTML specification:

<!ENTITY virtuaaliyliopistouutiset SYSTEM "http://virtuaaliyliopisto.jyu.fi/kotisivut/sisalto/etusivu/newsfeed.xml">

4. Entity declarations and references

Page 29: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

29

Unparsed entity

notation name

The notation must have been declared, for example:

<!ENTITY image1 SYSTEM "../images/birdnest.gif” NDATA gif>

4. Entity declarations and references

<!NOTATION gif PUBLIC "-//ISBN 0-7923-9432-1::Graphic Notation//NOTATION CompuServe Graphic Interchange Format//EN" >

Page 30: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

30

References to parameter entities:

%Shape;

&JY;

%HTMLsymbol;

&virtuaaliyliopistouutiset;

References to parsed general entities:

Reference to an unparsed general entity:

<poem image="image1">

The type of the attribute has to be ENTITY or ENTITIES

4. Entity declarations and references

Page 31: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

31

In addition to entity references, XML documents may contain character references.

Refers to a specific character of Unicode

Provides a decimal or hexadecimal representation of the character’s code point in Unicode

&#34;Example:

One-character entity defined: <!ENTITY quot "&#34;">

4. Entity declarations and references

Page 32: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

32

Where an entity or character reference can occur?

reference to

can occur inparameter entity ‣document type definition

parsed general entity ‣element content‣attribute value (either in the start-

tag or in the attribute definition)‣entity value

unparsed general entity ‣attribute value (either in the start-tag or in the attribute definition)

character ‣element content‣attribute value (either in the start-

tag or in the attribute definition)‣entity value

4. Entity declarations and references

Page 33: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

33

5. XML processor treatment of entity references

References to unparsed entities

Validating processor makes the identifiers for the entities and associated notations available to the application.

<poem image=”figure1"><!-- From a poem of Aale Tynni --><line>Seisoin ikkunassa ja nauroin. Ihana puu.</line><line>Ihana pesä.</line></poem>

Page 34: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

34

References to parsed entities

Dealing with two kinds of entity values:

literal value - the character string written between quotes in the entity definition

replacement text - derived by replacing the character references and parameter entity references in the literal value by their character values and replacement texts, respectively.

The XML processor replaces the entity reference by its replacement text.

5. XML processor treatment of entity references

Page 35: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

35

<!ENTITY rhyme1 "<rhyme xml:lang="fi"><line>Ole aina iloinen</line><line>niin kuin pikku varpunen</line></rhyme>">

replacement text = literal value

entity declaration

entity reference <rhymecollection>&rhyme1; </rhymecollection>

<rhyme xml:lang="fi"><line>Ole aina iloinen</line><line>niin kuin pikku varpunen</line></rhyme>

5. XML processor treatment of entity references

Page 36: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

36

<!ENTITY % StyleSheet ”CDATA”> <!-- style sheet data -->

<!ENTITY % Text ”CDATA”> <!-- used for titles etc. -->

<!ENTITY % coreattrs ”id ID #IMPLIED class CDATA #IMPLIED

style %StyleSheet; #IMPLIED title %Text; #IMPLIED”>

http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html

Declarations from XHTML specification:

literal value of coreattrs: id ID #IMPLIED class CDATA #IMPLIED

style %StyleSheet; #IMPLIED title %Text; #IMPLIED

replacement text of coreattrs: id ID #IMPLIED class CDATA #IMPLIED

style CDATA #IMPLIED title CDATA #IMPLIED

5. XML processor treatment of entity references

Page 37: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

37

<!ENTITY % Block ”(%block; | form | %misc; )*”>

Exercise 10 (Course Text, Chapter 5)

Entity declaration from XHTML Strict-DTD:

What is the (a) literal value(b) replacement text

of entity Block

(a) literal value: (%block; | form | %misc; )*

5. XML processor treatment of entity references

Page 38: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

38

<!ENTITY % heading ”h1| h2| h3| h4| h5| h6”><!ENTITY % lists ”ul | ol | dl”><!ENTITY % blocktext ”pre | hr | blockquote | address”><!ENTITY % block ”p | %heading; | div | %lists; | %blocktext; | fieldset | table”><!ENTITY % misc.inline ”ins | del | script”><!ENTITY % misc ”noscript | %misc.inline;”>

http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html

Declarations from XHTML specification:

Other entity declarations needed from the DTD:

5. XML processor treatment of entity references

Page 39: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

39

Deriving the replacement text of Block : references to parameter entities in the literal value (%block; | form | %misc;)* replaced by their replacement texts.

p | %heading; | div | %lists; | %blocktext; | fieldset | table

Literal value of block:

Replacement text of block:p | h1| h2| h3| h4| h5| h6 | div | ul | ol | dl | pre | hr | blockquote | address | fieldset | table

Literal value of misc : noscript | %misc.inline;

Replacement text of misc : noscript | ins | del | scriptReplacement text of Block : (p | h1| h2| h3| h4| h5| h6 | div | ul | ol | dl | pre | hr | blockquote |

address | fieldset | table | form | noscript | ins | del | script )*

5. XML processor treatment of entity references

Page 40: XML for Information Management

XML for Information Management – Day 4: Logical and Physical Structure of XML DocumentsAiri Salminen

40

6. Motivations for the use of entities

• use of non-textual data (audio, graphics, etc.) in XML documents (but can be added also in stylesheets)

• modularization of documents

• consistency

• multiuse of definitions

• adding semantic information by informative entity names and comments attached to entity declarations

The use of entities supports: