xml concepts - campus.unibo.itcampus.unibo.it/3996/2/w2-xml.pdf · xml concepts prof. andrea...
TRANSCRIPT
XML ConceptsProf Andrea Omicini
Distributed Systems Sistemi Distribuiti L-AAY 2007-2008
Alma Mater StudiorumndashUniversitagrave di Bologna a Cesena
Outline
Introducing XMLXML FundamentalsDocument Types Definitions (DTDs)NamespacesInternationalisationXML amp CSSDOM amp SAX
2
Introducing XML
What is XMLA W3C Standardhttpwwww3orgXML
A mark-up language for text documentsderived from SGML (Standard General Markup Language)
ISO 8879 httpwwwisochcated16387htmleXtensible Markup Language
A meta-markup languageto define markup languagessuch as XHTML XSLT XML Schemahellip
A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies
4
What XML is not
XML is nota programming languagea network-transport protocola document presentation languagea database (manager)
It can be used (and it is actually) in all of those contexts but it remains a markup language
5
Why Markup LanguagesMarkup
encoding embodied in the document specifying document properties as well as properties of information contained
for instance formatting instructionsmore generally structural semantic information
knowledge vs dataMarks Markups
tag used to qualify label text chunkseg HTML tags
XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt ltcoursegt2036ltcoursegt ltstudentgt
6
XML X for eXtensibilityBasic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTMLthey need to be defined
who does define themcan we do this how
7
Hey too many Languages already
Application domains are more and morenumerouscomplexspecific
Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo
8
XML Applications
XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable Data
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like from a Browser
12
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Outline
Introducing XMLXML FundamentalsDocument Types Definitions (DTDs)NamespacesInternationalisationXML amp CSSDOM amp SAX
2
Introducing XML
What is XMLA W3C Standardhttpwwww3orgXML
A mark-up language for text documentsderived from SGML (Standard General Markup Language)
ISO 8879 httpwwwisochcated16387htmleXtensible Markup Language
A meta-markup languageto define markup languagessuch as XHTML XSLT XML Schemahellip
A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies
4
What XML is not
XML is nota programming languagea network-transport protocola document presentation languagea database (manager)
It can be used (and it is actually) in all of those contexts but it remains a markup language
5
Why Markup LanguagesMarkup
encoding embodied in the document specifying document properties as well as properties of information contained
for instance formatting instructionsmore generally structural semantic information
knowledge vs dataMarks Markups
tag used to qualify label text chunkseg HTML tags
XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt ltcoursegt2036ltcoursegt ltstudentgt
6
XML X for eXtensibilityBasic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTMLthey need to be defined
who does define themcan we do this how
7
Hey too many Languages already
Application domains are more and morenumerouscomplexspecific
Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo
8
XML Applications
XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable Data
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like from a Browser
12
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Introducing XML
What is XMLA W3C Standardhttpwwww3orgXML
A mark-up language for text documentsderived from SGML (Standard General Markup Language)
ISO 8879 httpwwwisochcated16387htmleXtensible Markup Language
A meta-markup languageto define markup languagessuch as XHTML XSLT XML Schemahellip
A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies
4
What XML is not
XML is nota programming languagea network-transport protocola document presentation languagea database (manager)
It can be used (and it is actually) in all of those contexts but it remains a markup language
5
Why Markup LanguagesMarkup
encoding embodied in the document specifying document properties as well as properties of information contained
for instance formatting instructionsmore generally structural semantic information
knowledge vs dataMarks Markups
tag used to qualify label text chunkseg HTML tags
XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt ltcoursegt2036ltcoursegt ltstudentgt
6
XML X for eXtensibilityBasic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTMLthey need to be defined
who does define themcan we do this how
7
Hey too many Languages already
Application domains are more and morenumerouscomplexspecific
Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo
8
XML Applications
XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable Data
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like from a Browser
12
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
What is XMLA W3C Standardhttpwwww3orgXML
A mark-up language for text documentsderived from SGML (Standard General Markup Language)
ISO 8879 httpwwwisochcated16387htmleXtensible Markup Language
A meta-markup languageto define markup languagessuch as XHTML XSLT XML Schemahellip
A formally-defined text-based languageverifiable for well-formedness and validityusable across platform and technologies
4
What XML is not
XML is nota programming languagea network-transport protocola document presentation languagea database (manager)
It can be used (and it is actually) in all of those contexts but it remains a markup language
5
Why Markup LanguagesMarkup
encoding embodied in the document specifying document properties as well as properties of information contained
for instance formatting instructionsmore generally structural semantic information
knowledge vs dataMarks Markups
tag used to qualify label text chunkseg HTML tags
XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt ltcoursegt2036ltcoursegt ltstudentgt
6
XML X for eXtensibilityBasic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTMLthey need to be defined
who does define themcan we do this how
7
Hey too many Languages already
Application domains are more and morenumerouscomplexspecific
Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo
8
XML Applications
XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable Data
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like from a Browser
12
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
What XML is not
XML is nota programming languagea network-transport protocola document presentation languagea database (manager)
It can be used (and it is actually) in all of those contexts but it remains a markup language
5
Why Markup LanguagesMarkup
encoding embodied in the document specifying document properties as well as properties of information contained
for instance formatting instructionsmore generally structural semantic information
knowledge vs dataMarks Markups
tag used to qualify label text chunkseg HTML tags
XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt ltcoursegt2036ltcoursegt ltstudentgt
6
XML X for eXtensibilityBasic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTMLthey need to be defined
who does define themcan we do this how
7
Hey too many Languages already
Application domains are more and morenumerouscomplexspecific
Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo
8
XML Applications
XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable Data
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like from a Browser
12
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Why Markup LanguagesMarkup
encoding embodied in the document specifying document properties as well as properties of information contained
for instance formatting instructionsmore generally structural semantic information
knowledge vs dataMarks Markups
tag used to qualify label text chunkseg HTML tags
XML example ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt0000145678ltstudentnumbergt ltcoursegt2036ltcoursegt ltstudentgt
6
XML X for eXtensibilityBasic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTMLthey need to be defined
who does define themcan we do this how
7
Hey too many Languages already
Application domains are more and morenumerouscomplexspecific
Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo
8
XML Applications
XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable Data
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like from a Browser
12
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML X for eXtensibilityBasic idea of XML
a simple meta-language for humans and automatato build electronic documentsallowing users to define ad hoc markup languages
ThenXML is quite free in generalit can be ldquoextended
actually specialisedto define more specific ad hoc markup languages
No predefined XML markups as it happens instead in HTMLthey need to be defined
who does define themcan we do this how
7
Hey too many Languages already
Application domains are more and morenumerouscomplexspecific
Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo
8
XML Applications
XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable Data
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like from a Browser
12
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Hey too many Languages already
Application domains are more and morenumerouscomplexspecific
Special specialised languages as the engineers toolsto represent denote amp express behaviours and computations
Engineers working with computational ICT systems will be called to use a number of different artificial languages but also
to know and understand computational models and paradigmsto select languages and paradigmsto define and build new languages
ldquoLaurea Specialistica in InformaticardquoldquoLinguaggi e modelli computazionalirdquo ldquoIngegneria del SWrdquo
8
XML Applications
XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable Data
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like from a Browser
12
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML Applications
XML per se is ldquosmallrdquo amp simplelanguages defined via XML are instead so many and complex
XML ApplicationsXML-defined markup languages
defined through a precise syntaxDTD or XML Schema
they may be either standard or customMost standard XML applications are W3C
such asXSLTXML SchemaXHTML
9
XML for Portable Data
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like from a Browser
12
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML for Portable Data
Cross-platform long-term data formatpassing XML data through space and timealong with Unicode and text-base standard format
Text text textboth data and markupall in the XML file
XML document structure simple amp cleareasy to parsewell-documented
That is why XML is already everwhere
10
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like from a Browser
12
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
How XML Looks likeltxml version=10 encoding=utf-8gt
ltdocrootgt
ltheadgt lttitlegtThis is my documentlttitlegt ltheadgt
ltbodygt
ltpgtA list of things I likeltpgt
ltlistgt ltitemgtweekendsltitemgt ltitemgtgood beerltitemgt ltitemgtmidnight snacksltitemgt ltitemgtice cream ltlistgt ltitemgtchocolateltitemgt ltitemgtcookie doughltitemgt ltitemgtwhite russianltitemgt ltlistgt ltitemgt ltitemgtshade treesltitemgt ltlistgt
ltbodygtltdocrootgt
11
How XML Looks like from a Browser
12
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
How XML Looks like from a Browser
12
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
How to Work with XML
XML is textso any text-editor is perfectly fine
A number of XML editors aroundbut typically general text editors with some programming Web-oriented capabilities are good enough and often even better
Visualisation is a different matterbrowsers do something
but XML is not a presentation language sohellipwe need to understand
what an XML document ishow XML works
13
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
What is an XML Document
It can beA text fileA record in a databaseA run-time construction in memoryhellip
In any case it can be handled and trasmitted by any system capable of dealing with text documents
ltstudentgt ltstudentnamegt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltstudentnamegt ltstudentnumbergt 0000145678 ltstudentnumbergt ltcoursegt2036ltcoursegtltstudentgt
14
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
How does XML WorkWho handles XML documents
after it has been producedhow why
XML parsersdevising out the structure of the XML documentverifying well-formedness and basic respect of XML syntax
XML validating parserswhen applicable
there is either a DTD or a Schemachecking validity
Examplesweb browsers word processors database servers drawing programs spreadsheets programs in some language etc
15
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Where is XML actually used
Everywhere already
16
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Some History of XML amp RelatedLot to be written stillhellipSGML is where it comes from
HTML was the first successful application of SGMLbut had obvious limitations
too complexmore than 150 pagesnever implemented fully
too complex for the InternetSGML ldquoLiterdquo (1996 Bosak Bray et al)
XML 10 (February 1998)Then a flow
namespaces XSL (then XSLT + XSL-FO) XHTML CSS integration XLink + XPointer XML Schema DOM etc
17
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML Fundamentals
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
A Simple XML Document
ltplayergt Carlo Nervoltplayergt
19
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML Document amp Files
This is a complete XML documentIt can be stored recorded built in the form of a number of different files or even in other forms
Carlonervoxml playertxta record in a databasea memory area built by a CGI and then transmittedsent by a Web server with MIME type applicationxml or textxml
ltplayergt Carlo Nervoltplayergt
20
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML Elements amp Tags
The document contains a single elementof type player
Such an element is delimited by the tag playerbetween start tag ltplayergt and end tag ltplayergt
In between the tags lays the elementrsquos content Carlo Nervotags are markup
the most common form of markup but there are other kindscontent is character data
including the white space between Carlo amp Nervo
ltplayergt Carlo Nervoltplayergt
21
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Tag Syntax
Very similar to HTML tagsat least superficially
lttaggt for start tags lttaggt for end tagslttag gt for empty tags
tags with no content like ltbr gt or lthr gtXML is case sensitive
so ltplayergt can not be closed by end tag ltPlayergtNOTE thus pay attention to non-case sensitive technologies when combined with XML
HTML JavaScript amp XHTML hellip
22
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML Trees A Simple Example
player
name surname team team
Carlo Nervo Bologna Mantova
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
23
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
An XML Document is an XML Tree
An XML Document has a tree-like structureone and only one root
root element or document elementeach node element can have one or more child elements
each element has at least one parentchild elements from the same parent are siblingsleaves are either content or empty elements
Well-formedness stems from hereltemgtltbgtWrong ltemgt XMLltbgt is not permitted
nesting needs to be perfect overlapping not allowed24
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Narrative-Organised XMLltbiographygtltnamegtltfirst_namegtCarloltfirst_namegt ltlast_namegtNervoltlast_namegtltnamegt was born somewhere and did nothing really meaningful before becoming a football player
After playing many years in minor teams such as ltfootball_teamgtMantovaltfootball_teamgt he finally moved to ltfootball_teamgtBolognaltfootball_teamgt where he exploded to become one of the most respected leaders of the team and also a member of the ltfootball_teamgtItalian National Teamltfootball_teamgt
hellip
ltbiographygt
XML Documents for written narrative such as articles reports blogs books novels
elements with mixed contentnot easy for automated processing and exchange
25
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML Attributes
Elements can be labelled by attributesattributes are specified in the start tag
and in the only tag of empty elementsany number of attributes can be in principle associated to an element
An attribute is a name-value pair of the form name=valuealternative forms use single quotes instead of double quotes and spaces before after the equals (=) signonly one attribute with a given name allowed per element
Attributes do not change the tree structures of an XML documentbut they are qualifiers for the nodes and leaves of the tree
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
26
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Using Elements or Attributes
Attributes are for meta-data about the element and content is information of the element
maybe but then it is not easy to clearly distinguish between the twoElement-based structure is more flexible than attribute-based
attributes provide for a flat data structure elements can be nested as neededattributes are unique within an element any number of elements of the same type can be used within an element
Attributes are quite useful in narrative-based XML documents
ltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yes value=Bologna gt ltteam current=no value=Mantova gtltplayergt
27
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML NamesXML Names are used and are the same for the names of elements attributes and some other constructs
to increase efficiency and abate complexityAn XML name can include
any letterlatin or even non-latin like ideographs
any digitunderscore hyphen and period (_ - )a colon () is reserved to namespaces
An XML name may not include other punctuation signs nor any sort of white spaces
and can begin only with letters ideographs or underscore28
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Parsed Character DataAn XML Parser interprets the character sequences it is fed with trying to devise out its tree-like structure
so for instance lt always taken as the beginning of a tagwhat if we need a lt character in the document as in a JavaScript code
All characters are interpreted as character data to be parsedunless an escape character amp is encounteredcharacter data to parse start again after char
Eg the content of the elementltsuperheroesgtBatman ampamp Robinltsuperheroesgt
becomes the parsed character dataBatman amp Robin
29
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Entity References
ampentityreferencean entity is something defined outside the normal flow of the XML document
out of the XML treeused for constants common values external values etc
through an entity referenceUsers of any sort may define their own entities
well see how soon for instance through DTDs
30
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Pre-defined XML Entities
Markup Entity Descriptionamplt lt less-thenampgt gt grater-than
ampamp amp ampersandampquot double quoteampapos single quote
31
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
CDATA Sections
Including code chunks from any language with lt or can be tediouswe need to say the parser do not parse thisgood for instance to include segments of XML code to show
CDATA Sectionbetween lt[CDATA[ and ]]gtcan contain anything but its own delimiters
After parsing no way to tell where a text came from a CDATA section or not
32
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Comments
Easylt-- Comment --gt
It cannot contain -- nor it can end with ---gtComments do not affect the document tree-structure
they can appear anywhere even before the root elementbut not inside a tag or a comment
Parsers may either drop or keep them at their willComments are meant to improve human legibility of XML docs
to give info to a computational agents processing instructions
33
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML Processing Instructions
Need to pass information for a given application through the parsercomments may disappear at any stage of the process
Processing instructions have this very endlttarget hellip gt
The target may be the application that has to handle or just an identifier for the particular processing instructionltphp hellip gtltxml-stylesheet hellip gt
A processing instruction is markup not an elementit can appear everywhere out of a tag even before or after the root
34
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
The XML Declaration
Looks like an XML processing instructionbut it is not just the XML declaration
It is optionalbut if there should be the first thing in the document absolutely
not even comments allowed beforeltxml version=10 encoding=utf-8 standalone=nogt
Version is the XML version (10 11 hellip)Encoding is the form of the text (Unicode in the example)
optional default UnicodeStandalone means that it has no external DTD
optional default no
35
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Checking Well-FormednessMain rules
perfect match between start and end tagsno overlapping elementsone and only one root elementsattribute values are always quotedat most one attribute with a given name per elementneither comments nor processing instructions within tagsno unescaped gt or amp signs in the character data of elements or attributeshellip
Tools on the WebJust look around
36
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DTD
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Flexibility or Rigidity
XML is flexiblewhatever this meansbut sometimes flexibility is not a feature within a given application scenario
Sometimes some strict rule is requiredsome control over syntax should be enforced
like a football player should have at least one teamDocument Type Definition (DTD)
to define which XML documents are validValidity is not mandatory as well-formedness
how to handle errors is optional
38
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Validation
A valid XML Document includes a DTD the document satisfiesMain principle
everything not permitted is forbiddenthat is DTDs specifies positive examples
Everything in the XML document must match a DTD declarationthen the document is validotherwise the document is invalid
Many things a DTD does not saywe stick with what we can specify
39
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DTD ishellip
SGML-basedsyntax a bit awkwardbut after all easy to understandand quite suited for short and expressive descriptions
It allows XML designers to define a grammar for their documentstypical syntax-based approach
maybe limited but easy to implementMaybe DTD is not the future of XML document validation
XML Schema should be thatbut understanding DTDs how to modify them how to write your own ones is likely to be useful or maybe necessary for a while still
40
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
A Simple DTD Example
We do not go too deep into DTD syntaxwe just look at the example above and comment
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
41
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DTD Declaration
DTD is declared here as internalbut could be declared separately
ltDOCTYPE football_player SYSTEM football_playerdtdgteven referring to an external shared resource
ltDOCTYPE football_player SYSTEM httphellipgt
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
42
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DTD Declarations Define or Use
So you maydefine your own DTD and
either include it in your XML documentor save it as an independent document and refer from one or more XML docs
or use an external DTD defined by someone elselike a working group you belong to or a standardisation body of any sortby referring to that externally-defined syntax for your XML docs
43
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Element Declarations
A player element contain one name one surname and one or more teams
in that precise orderand they are just parsed character data (PCDATA)
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
44
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Some Syntax is for sequence
to define ordered lists| is for choice
to provide for alternativessuffixes
for zero or more occurrences+ for one or more occurrences for zero or one occurrence
parenthesis for groupingat any level of indentationoperators and suffixes applicable to any level
ANY for free-form contentEMPTY for empty element
45
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Attribute Declarations
A team element has a current attributewhich is mandatory
IMPLIED would say optional insteadand can be either yes or no
enumeration as an attribute type
ltxml version=10 standalone=yesgtltDOCTYPE football_player [ ltELEMENT player (name surname team+)gt ltELEMENT name (PCDATA)gt ltELEMENT surname (PCDATA)gt ltELEMENT team (PCDATA)gt ltATTLIST team current (yes | no) REQUIREDgt]gtltplayergt ltnamegtCarloltnamegt ltsurnamegtNervoltsurnamegt ltteam current=yesgtBolognaltteamgt ltteam current=nogtMantovaltteamgtltplayergt
46
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Attribute Defaults
IMPLIEDthe attribute is optional
REQUIREDthe attribute is mandatory
FIXEDeither it is explicitly specified or not it has a given value
literalthe default value is the literal quoted string
47
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Attribute TypesCDATA
any string of text acceptable in a well-formed XML attribute valueNMTOKEN NMTOKENS
more than an XML name anything accepted as the first characterthe plural form accepts more than one separated by whitespaces
ENTITY ENTITIESname(s) of unparsed entities declared elsewhere in the document
IDan XML name unique in the document working as an identifier
IDREF IDREFSreference(s) to IDs in the documents
NOTATIONname of a notation used amp defined in the document (rare)
enumeration
48
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Other DTD Declarations etc
ENTITY declarationsltENTITY footer SYSTEM httpliadeisuniboit~aofootergt
NOTATION declarationswho cares actually
We stop heremore only for those who need it
49
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Namespaces
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
What are Namespaces for
Distinguishdifferent XML applications may use the same names
at any scale from personal to world-widea namespace allows them to be clearly distinguished
Groupnames of elements and attributes of the same XML application can be grouped together
to be more easily recognised and handledExample set is an element in both SVG and MathML applications
what if I have to use them togethernamespaces can be used to disambiguate names
51
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Syntax for Namespace Use
Qualified namesprefix local_part
Examples of qualified namesor QNames or raw names
rdfdescription xlinktype xsltemplateUsed for both element and attribute names
52
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Associating Prefixes to URIExample
a large firm could have a number of namespaces for different purposesltcompany xmlnslocal=httpwwwcompanyitxml xmlnseuro =httpwwwcompanyeuxml xmlnsworld=httpwwwcompanycomxmlgt
then you can use local euro and world everywhere as prefixestypically declared in the topmost element but could be declared anywhereexample ltrdfRDF xmlnsrdf=httpwwww3corgTRREC-rdf-syntaxgt
URI are standardised not prefixesbut usually svg rdf and other prefixes are not re-definedalso they are conventional names
not necessarily pointing to an actually resource53
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Setting Default Namespaces
xmlns attributealone no suffix
ltsvg xmlns=httpwwww3corg2000svg width=hellip height=hellipgthellipltsvggt
all the elements inside (including svg) are implicitly associated to the httpwwww3corg2000svg namespace
no need for the svg prefix made explicity
54
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Internationalisation
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
What does Text Mean
ldquoTextrdquo can be encoded according so many different alphabetsmapping between characters and integers (code points)
character setASCII being the most (un)famous now Unicode
A character encoding determines how code points are mapped onto bytes
so a character set can have multiple encodingsUTF-8 and UTF-16 are both Unicode encodings
Any XML document is a text documentso encoding should be declared
56
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
The XML Encoding Part of the XML Declaration
ltxml version=10 encoding=utf-8 standalone=nogtMost common valuesutf-8 utf-16 (Unicode)ISO-8859-1 (Latin-1)
See also XML-Defined Character SetsUnicode and ISO are the most used families
Used also for external parsed entitieslike DTD fragments or XML chunkswhich may have different encodingsthere version may be dropped
it is a text declaration but no longer a XML declaration
57
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Multi-Lingual DocumentsExample a spell-checker or a voice-reader parsing an XML docHow to determine the language of a subpart
for multi-lingual docsxmllang attribute
can be associated to any elementdetermines the language of the element
Values are to be found in ISO 639standard two letters for each language knownif not there IANA
prefix i-such as i-navajo i-klingon hellip
if not there too such as for user-defined tagsprefix x-
such as x-quenya58
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Encoding for Portability
Working around encoding is not simply an ldquointernationalisationrdquo issueit is also about portability
When transmitting communicating through text-based files many errors typically occur
which are often not easy to catchXML abilities to
handle encoding precisely and accuratelyembody encoding information within each document
make it a powerful tool for easy and hassle-free portabilityacross platforms across applications across time
59
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML amp CSS
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML on Browsers
Different experiences with different browserswhen trying to visualise an XML document
XML however can be transformedto become easier to handle by standard browsers
Two main approachesWeb-based one XML + CSSXML-based one XSL
In the following we explore the XML + CSS issue
61
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Cascading Style Sheets
Cascading Style Sheets (CSS)a simple mechanism for adding style (eg fonts colors spacing) to Web documents
Standard W3Chttpw3corgStyleCSS
Goalsdescribing how to present elements of a document
spanning over a range of different mediaseparating style description from content and structure
In this course we assume that you already know the basicsif not look at httpwwww3orgStyleCSSlearning
62
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
CSS An Example
63
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML + CSSAny XML documents can be prepared for browser visualisation via CSSTwo things needed
a CSS style sheet referring to the proper elements types of the XML documentthe association between the XML document and the CSS style sheet
Processing directiveto associate CSS to XML
ltxml-stylesheet type=textcss href=nomefilecss gt
CSS style sheet defining presentation style for the XML document tagsnometag attributo1 valore1 hellip
No need for DTD or Schemaeven though the browser could anyway complainhellip
64
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
XML + CSS Example The XML Doc
65
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Example How Mozilla Visualises it[without CSS Style Sheet]
66
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Example How Mozilla Visualises it[with CSS Style Sheet]
67
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DOM amp SAX
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Manipulating XML Representing information in an XML Document
and presenting it somehowis not enough for most non-trivial application scenarios
Mostly we often need to manipulateaccess delete modify
parts of an XML documentwhich either may or may not be and XML file
This is typically dome through programming language of many sortsthrough ad hoc API
The most used hated deprecated widespread areDOMSAX
69
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Document Object Model
httpwwww3orgDOMstandard W3C as usual
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content structure and style of documents
It applies to HTML as well as XMLIt is essentially an API
standardised for Java amp ECMAScriptbut can be extended to other languages
There is no time here to go deep into DOMwe just try to understand its nature goals and scope
70
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DOM amp LevelsDOM views an XML tree as a data structure
similar to the DOM from JavascriptDOM loads the whole XML document in memory to manipulate it
maybe huge memory consumptionIt is quite large and complexhellip
Level 1 Core W3C Recommendation October 1998primitive navigation and manipulation of XML treesother Level 1 parts HTML
Level 2 Core W3C Recommendation November 2000adds Namespace support and minor new featuresother Level 2 parts Events Views Style Traversal and Range
Level 3 Core W3C Working Draft April 2002adds minor new features
71
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
DOM Nodes
An XML document is a treeThe tree contains nodes
one of them is a root nodenodes possibly have siblings children one parent content tag etc
The DOM specification states that a node can containdocument doc fragment doc type element attribute processing instruction comment text CDATA section entity notation
It also defines which kind of child nodes they should could have
72
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Properties amp Methods of DOM Every DOM node has properties and methods to explore and update the XML treeEvery DOM node has a name a value a typeThere are general properties and methods for all kinds of nodesattributes returns all the attributes of the nodeappendChild(newChild) appends newChild after the other child nodes
Then any specific kind of node has its own specific properties and methodsThese properties and methods are made available by the suitable API for the language of choice
many solutions for Javasee for instance httpjavasuncomxmljaxp
73
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
A Simpe Java DOM public static void main(String[] args) try DOMParser p = new DOMParser() pparse(args[0]) Document doc = pgetDocument() Node n = docgetDocumentElement()getFirstChild() while (n=null ampamp ngetNodeName()equals(recipe)) n = ngetNextSibling() PrintStream out = Systemout outprintln(ltxml version=10gt) outprintln(ltcollectiongt) if (n=null) print(n out) outprintln(ltcollectiongt) catch (Exception e) eprintStackTrace()
74
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Main Problem of DOM
The XML document is loaded as a whole and handled altogether in memory
it might be time-consuming and difficult to managewouldnt it be better if we could load only the part we are actually manipulating
This is the motivation behind SAXwhich is not started as a standardhas problems of acceptancebut has indeed a long tail of followersand also its good reasons to exist
75
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76
Simple API for XML (SAX)
Differently from DOM SAX is event-basedIt sees the document not as a tree but as a text doc
flowing through the SAX parserand generating events as soon as document started ended elements started ended character content etc
A very simple modelgood for simple applicationsand also to avoid memory abuse
Not so well-supported as DOM isin terms of standardisationas well as of tools
76