introduction to xml and its processing techniques

93
XML Model and Processing Transparency No. 1 Introduction to XML and its processing techniques Cheng-Chia Chen 4/22 2003

Upload: malina

Post on 15-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Introduction to XML and its processing techniques. Cheng-Chia Chen 4/22 2003. outlines. What is XML ? A glimpse of XML Why do we need XML ? Some XML applications XML and related Core Specifications APIs for XML Combine XML technology with traditional language processing technology. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to XML and its processing techniques

XML Model and Processing

Transparency No. 1

Introduction to XML and its processing techniques

Cheng-Chia Chen

4/22 2003

Page 2: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 2

outlines

What is XML ?A glimpse of XMLWhy do we need XML ?Some XML applicationsXML and related Core Specifications

APIs for XMLCombine XML technology with traditional language

processing technology.Other important XML programming technologySummary and information for further study

Page 3: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 3

What is XML ?

The eXtensible Markup Language a data format (syntax) used for the representation, storage

and transmission of data whose format is defined by xml. a data-structure definition language : let you define the str

ucture and format of your own data. Text-based markup Language, let you define your own HT

ML-like markup languages. Recommended by World Web Consortium (W3C) in Feb 19

98. intended to be used as a new message format over the Inte

rnet to complement the inadequacy of HTML.

Page 4: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 4

The idea of XML

Existing student information

S9010張得功 資科系 三年級 [email protected]

S9021王德財 應數系 二年級 null

Page 5: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 5

HTML’s concerns

How to present the data:

<TABLE BORDER=1 bgcolor=“yellow” > <TR><TH> 學號 </TH> 姓名 <TH> 科系 </TH> <TH> 年級 </TH> <TH> 電郵 </TH> </TR> <TR><TD> S9010</TD><TD> 張得功 </TD> <TD> 資科系 </TD> <TD> 三年級 </TD> <TD> [email protected] </TD></TR>

<TR> <TD> S9021 </TD> <TD> 王德財 </TD> <TD> 應數系 </TD> <TD> 二年級 </TD> </TR></TABLE>

學號 姓名 科系 年級 電郵

S9010 張得功 資科系 三年級 [email protected]

s9021 王德財 應數系 二年級

Page 6: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 6

XML uses markup tags as well, but, describe the content, rather than the presentation of that content.

the same example coded in XML:

<students> <student>< 學號 > S9010 </ 學號 > < 姓名 > 張得功 </ 姓名 > < 科系 > 資科系 </ 科系 > < 年級 > 三年級 </ 年級 > < 電郵 > [email protected] </ 電郵 > </student> <student>< 學號 > S9021 </ 學號 > < 姓名 > 王德財 </ 姓名 > < 科系 > 應數系 </ 科系

> < 年級 > 二年級 </ 年級 >< 電郵 /> </student> … </students>

Notes: 1. Only contents are encoded in the XML text. 2. All data are annotated by tags indicating their roles or functions in the

message.

XML’s concerns

Page 7: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 7

Where does XML come from ?

a simplified subset of the Standard Generalized Markup Language (SGML) standardized in 1986.

simplified for more general use on the Web and as a data interchange format. without losing extensibility, easier for anyone to write valid XML. easier to write a parser easier for the parser to quickly verify that documents are

well-formed and/or valid. Recommended by W3C at Feb. 1998.

Page 8: Introduction to XML and its processing techniques

XML Model and Processing

Transparency No. 1

A Glimpse of XML

Page 9: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 9

An example XML document

<?xml version="1.0"?>

<note>

<to>Wang</to>

<from>Chen</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>

</note>

Notes:

1. The XML declaration should always be included.

2. <note>…</root> is the root element which has 4 children.

Page 10: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 10

<!– the structure of the document element -->

<department>

<employee id=“s8931">

<name> 張德治 </name>

</employee>

<employee id=“s9017“ id-no =“L12345678” >

<name> 李大春 </name>

<url href =

"http://www.xml.com.tw/~lee/"/>

</employee>

</department>

Page 11: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 11

Key terminology

Element Element type (or element name) Start tag End tag [Element] Content

child element character data [PCDATA]

Attribute Attribute name Attribute value

DTDCommentProcessing Instructions

<? Target data ?>

Page 12: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 12

<department> start-tag

<employee id=“s8931">

<name> 張德治 </name>

</employee>

<employee id=“s9017“ id-no =“L12345678” >

<name> 李大春 </name>

<url href =

"http://www.xml.com.tw/~lee/"/>

</employee>

</department> end-tag

<!– the structure of the document element -->

[The root or document] element

Element type (or name)

Attributes

attribute valueattribute name

PCDATA

Page 13: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 13

Containment Hierarchy of XML Documents

Page 14: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 14

All XML elements must have an end tag

In HTML some elements do not have to have a closing tag. The following code is legal in HTML:

<p>This is a paragraph

<p>This is another paragraph

In XML all elements must have a closing tag like this:

<p>This is a paragraph</p>

<p>This is another paragraph</p> 

Page 15: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 15

XML tags are case sensitive

XML tags are case sensitive. <Letter> != <letter>

Opening and closing tags must match with the same case: <Message>This is incorrect</message> <message>This is correct</message> 

Page 16: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 16

All XML elements must be properly nested

HTML allow overlapped elements:

<b><i>bold and italic</b> italic only</i>

For XML all elements must be properly nested.

<b><i>bold and italic</i> bold only</b>

Page 17: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 17

Single root[document] element

A document contains exactly one root element. All other elements must be nested within the root element.

Elements can have sub (children) elements and subelemetns can have subsubelements and so on.

Elements and text data that can appear as children of an element, their order and multiplicity is definable [by DTD/XML Schema].  

<root> <child>

<subchild>…</subchild>

or text data

<subchild>…</subchild>

</child>

</root>

Page 18: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 18

XML Attributes

Appear within the start tag of an element.Attributes that can appear in the start tag of an element i

s definable [by DTD or XML schema]. ID attributes are for identification and cannot have the sa

me value in a document instance.HTML examples: <img src="computer.gif"> <a href=demo.asp>

XML examples: <file type="gif"> <person id=’3344’>

Note:In XML attribute value must be quoted by ‘ or ". 

Page 19: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 19

Well-formed v.s. Valid XML Documents

Well-Formed XML documents Essentially any document conforming to the XML syntax

rules that we have described. All texts/documents must be well-formed to be XML

documents.Example:

<?xml version="1.0“ ?> <note>

<to>Wang</to>

<from>Chen</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>

</note>

Page 20: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 20

Valid XML documents

A Valid XML document is a well-formed XML document and conforms to the grammar attached to it.

The grammar attached to XML Documents is called a DTD [Document type definition]

A Document with a reference to an external DTD:<?xml version="1.0"?> <!DOCTYPE note SYSTEM "Note.dtd"> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>

Page 21: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 21

DTD

DTD

Document Type Definition;

a grammar for a class of XML documents used to define the legal building blocks of an XML docume

nt.

Document Type Declaration:

Declare the DTD for an XML document;

External subset: // defined at external places

<!DOCTYPE note SYSTEM “note.dtd” >

Internal subset: // inline declarations

<!DOCTYPE note SYSTEM “externSubset.dtd” [……inline markup declarations………]>

Page 22: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 22

DTD: markup Declarations

Element type declarations

Attribute list declarations

Entity declarations

declare macro-like abbreviations.

<!ENTITY chencc “Cheng-Chia Chen”>

<!ENTITY chapter1 SYSTEM “chapter1.xml”>

<!ENTITY % subDTD SYSTEM “dtd1.dtd”>

Notation declarations

Define types of non-xml data

<!NOTATION png SYSTEM “http://www.w3.org/png”>

Page 23: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 23

DTD: Element Type Declaration

Specifies the element type and content:

<!ELEMENT Name contentSpec >

Element’s Content: Empty:

<!ELEMENT homepage EMPTY > Any:

<!ELEMENT container ANY > Only elements (element content)

No character data

Mixed:

Character data mixed

Page 24: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 24

DTD: Element content model

Basically represented by a regular expression over element types.

Building Blocks: Choice

(p | list | table | form ) Sequence

(street, zip, city, country) Occurrences

? + *Example:<!ELEMENT person (name, address+, homepage?,

(email | telephone )+, note*)>

Page 25: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 25

DTD: Mixed element content

can contain either other elements and character data or only character data

Examples:

<!ELEMENT para (#PCDATA |em | strong | abbr )* >

<!ELEMENT p (#PCDATA |em | i | b | a| ul)*>

<!ELEMENT street (#PCDATA)>

<!ELEMENT city (#PCDATA)>

Page 26: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 26

DTD: Attribute List Declaration

Define attributes that can appear in an element type.

format:

<!ATTLIST elName

attrName1 attrType1 attrDefault1

attrName2 attrType2 attrDefault2

………………………………… >

Attribute types: String type : Tokenized type: Enumerated type:

Page 27: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 27

DTD: ATTLIST Attribute Type

String type:

<!ATTLIST person age CDATA #IMPLIED>

Tokenized types: ID, IDREF, IDREFS ENTITY, ENTITIES NMTOKEN, NMTOKENS

<!ATTLIST person id ID #REQUIRED> father IDREF #REQUIRED>

children IDREFS #IMPLIED >

Enumerated type:

<!ATTLIST person gender (Male|Female) #REQUIRED>

Page 28: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 28

DTD:ATTLIST Attribute defaults

Provide information about the attribute’s presence:#REQUIRED

Attribute must appear in the associated element. <!ATTLIST person gender (Male |Female) #REQUIRED>

#IMPLIED The attribute may be absent. no default value. <!ATTLIST person age CDATA #IMPLIED>

Default/constant value <!ATTLIST list type (ol|ul) “ul”> <!ATTLIST list type (ol|ul) #FIXED “ul”>

Page 29: Introduction to XML and its processing techniques

XML Model and Processing

Transparency No. 1

Why do we need XML ?

Page 30: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 30

XML unifies the syntax of information

Layers of information(data): bit byte Character BCD EBCDIC ASCII BIG5 ISO-8859 ==> UNICODE syntax(form) XML semantics (ontology) Semantic Web Application

Semantic Web: an extension of the current web in which information is giv

en well-defined meaning, better enabling computers and people to work in cooperation.

--- Tim Berners-Lee et.al.

Page 31: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 31

New desired requirements in the internet age

Easy retrieval of information over the net realized by current Web/internet technology good browser, web server HTTP, DNS, search engines. HTML, URI, HyperText, MIME

Easy/cheap interoperation of existing software in the internet. also the old goal of distributed system/computing RPC, RMI, CORBA,... a prerequisite for eCommerce

issues: data transmission ==> solved by existing internet infrastructure data representations ?

Page 32: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 32

Why needing a unifying format for data ?

Case: 10 word processors, each need to be able to process docs generated by any other.

1st approach: write a converter A-->B for any A and B. #converter = n x (n-1) = 90 (bad!)

2nd approach: invent a common format (C). write a pair of converters (A --> C, C-->A) for each word

processor. To process doc generated from A by B, simply A ==(A-->C)== C == (C-->B) == B required converts: 2 x n = 20 (much better!) prerequisite: need a common format. This is what XML plays!!

Page 33: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 33

Example:XML in EDA (Electric Design Automation)

Page 34: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 34

Additional benefits of XML (as a common format)

Enable the interoperation of internet/intranet/extranet software/service.

Free (or cheap) cost of obtaining required software for processing XML. without the need to reinvent the wheel. can focus on value-added software based on these

underlying software.Decoupling of tightly-coupled distributed systems into

loosely one. less monopolization of software by vendors more selections of combinations for buyers more chances of contributing software for small company. less investment for buyers.

Page 35: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 35

Comparison of XML with Other formats

HTMLText-based non-markup formats

.c .cpp .java .ini …Binary formats

.dll .exe .o .swf .class .png .jpeg …

Page 36: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 36

Advantages of XML over HTML

XML can define your own tags.XML tags describe the content, rather than the

presentation of that content easier for content search (no annoying presentation data). easier for page development (separating content from

view) easy for devices to render the contents depending on its

environments (single model/multiple views)

Page 37: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 37

Advantage of XML over text formats

Ex: JavaML v.s Java; CppML v.s Cpp XMI v.s rational’s proprietary format web.xml, plugin.xml v.s ***.ini (for configuration) build.xml v.s. makefile XQuery XML format v.s plain text format RelaxNG XML v.s. plain text format

advantage: structure explicitly represented in the XML format. (free and) standard tools (and API) exists for quick parsing of the XML f

ormat. => front-end processing avoided/reduced disadvantage: too verbose.

for storage and transmission. can be overcome by compression

for human generation; (not a problem for machine generation) require smarter editor

for human reading/comprehension:a real problem!!

Page 38: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 38

Advantage of XML over binary formats

Example: ASN.1 XER Encoding rule v. BER/CER/DER/PER classML v.s .clss file format. swfml v.s swf (Flash file format)

advantage: readable; editable (free and) open software and APIs available

disadvantage: take longer time to parse.

The trend: one data model/ multi representation formats + converters among the formats.

Page 39: Introduction to XML and its processing techniques

XML Model and Processing

Transparency No. 1

Some XML Aapplications

Page 40: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 40

Some XML applications

An XML application is an language adopting the XML syntax [which is usually defined by DTD/ Schema].

XML as an alternative representation format (SVG) Scalar Vector Graph : for vector graph (MathML) : for mathematical expressions SMIL (Synchronized Multimedium Integration language): Resource Description Framework (RDF) : an XML language for describ

ing web resources and their relationship CML (Chemical Markup Language) : for chemical molecule JavaML : for java programs CppML : XML formats for C++ Ant : a replacement of make for java Maven:a Java project management and project comprehension tool OOML : a OO PL in XML UIML : user interface Markup language WAP WML (Wireless Markup Language)

See The XML Cover Pages for a bulky listing.

Page 41: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 41

Mathematical Markup Language

<?xml version="1.0"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN" "http://www.w3.org/TR/MathML2/dtd/xhtml-math11-f.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www.w3.org/1998/Math/MathML" >

<head> <title>Fiat Lux</title> </head>

<body>

<p> And God said, </p>

<m:math> <m:mrow> <m:msub> <m:mi>&delta;</m:mi> <m:mi>&alpha;</m:mi> </m:msub> <m:msup> <m:mi>F</m:mi> <m:mi>&alpha;&beta;</m:mi> </m:msup> <m:mi> </m:mi> <m:mo>=</m:mo> <m:mi></m:mi> <m:mfrac> <m:mrow> <m:mn>4</m:mn> <m:mi>&pi;</m:mi> </m:mrow> <m:mi>c</m:mi> </m:mfrac> <m:mi> </m:mi> <m:msup> <m:mi>J</m:mi> <m:mrow> <m:mi>&beta;</m:mi> <m:mo> </m:mo> </m:mrow> </m:msup> </m:mrow> </m:math>

<p> and there was light </p>

</body>

</html>

Page 42: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 42

Page 43: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 43

Vector Graphics

Scalable Vector Graphics (SVG) Adobe SVG Viewer Apache Batik SVG toolkit

Vector Markup Language (VML) Internet Explorer 5.0 or above Microsoft Office 2000

Page 44: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 44

Example

Page 45: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 45

Ant

A make-like building tools Sample Build.xml

<project default="echoFoo" name="ant-test" basedir=".">

<property name="foo5.1" value="${foo5}"/>

<target name="writeFoo3Bar3">

<echo message="foo3 = bar3" file="test.properties"/> </target>

<target name="readWriteFoo4.1Foo4">

<echo message="foo4.1 = ${foo4}" file="test.properties"/> </target>

<target name="readWriteFoo5.1Foo5InStart">

<echo message="foo5.1 = ${foo5.1}" file="test.properties"/> </target>

<target name="echoFoo">

<echo message="${foo}"/> </target>

</project>

Page 46: Introduction to XML and its processing techniques

XML Model and Processing

Transparency No. 1

XML and related Core Specifications

Page 47: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 47

Major W3c XML Technologies

Page 48: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 48

Related technologies

XML is a key technology to ensure interoperability But XML, by itself, is not really useful... we need to

have datatypes, validation (DTD-s, Schemas, ...) mix XML applications (Namespaces) link (XLink, XBase,...) compose/decompose (XInclude, Fragments, ...) refer to XML data content (XPath, Query, ...) transform (XSLT) encrypt, decrypt, sign (Signature, Encryption, ...) interact, script (DOM, Events, ...)

etc

Page 49: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 49

Core specifications for XML

XML 1.0 XML Namespace XML Path language (XPath) XML Stylesheet Langugae (XSL)

XSL Transformation language (XSLT) XSL formating Objects (XSLFO)

XML Linking language (XLink) XML Pointer Langugae (XPointer) XML schemas (; RelaxNG) XHTML XML signatures/canonicalization XML protocols XMLForm XQuery (XML language for Querying XML Documents)

Page 50: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 50

Core Specifications for XML

XML document type definition (DTD) : a utility used to define the

formats and contents of valid XML documents. a specification to define what kinds of texts are well-forme

d XML documentXML namespace

Define a mechanism to avoid collision of elements and/or attribute names in documents using multiple sets of DTDs.

Xlink Define the mechanism for linking to web resources from an

XML document. Xpointer

Define a mechanism for linking to inside an XML document.XPath

Define a mechanism to refer to part of an XML document

Page 51: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 51

XSL ( XML Stylesheet Language)

a language for expressing stylesheets. consists of two parts:

XSLT : a language (in XML format) used to describe how to transform an XML document into one in XML or non-XML format.

XSLFO: an XML vocabulary for specifying formatting semantics.

An XSL stylesheet specifies the presentation of a class of XML documents by describing how an instance of the class is transformed into an XML document that uses the formatting vocabulary.

Page 52: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 52

Page 53: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 53

XML Schema A planned replacement of DTD. used to define the structures and formats of various messa

ges encoded in XML format. another competing alternative: RelaxNG

consists of three documents: Part 0: a primer

an easy-to-understand introcuction Part 2: Datatypes

define tens of frequently used bulit-in datatypes Part 3: structures

specifies the XML Schema definition language, offers facilities for describing the structure and constraining the contents of XML documents

Page 54: Introduction to XML and its processing techniques

XML Model and Processing

Transparency No. 1

Programming Interfaces for XML

Page 55: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 55

Major APIs for Processing XML documents

DOM (level 0,1, 2 & 3) : Document Object Model Tree-based XML API language independent

SAX (version 1 & 2) : Simple API for XML Document, Event-based XML API

JAXP Java API for XML Processing (J2SE)

JDOM, dom4j (XML APIs for Java) DOM for Java Tree-based, simpler version of DOM easier to use than DOM, suitable for Java only

Page 56: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 56

General Model for Processing XML Documents

Implementation

DependentApplications

JAXP

(Parser independent)

Page 57: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 57

DOM

DOM defines: A tree-based logical model for XML documents Platform and language independent APIs for model manipulation

DOM allows: accessing document content; modifying document content; creating new documents and contents in the memory

DOM homepage: http://www.w3.org/DOM/

DOM APIs Defined in the Interface Definition Language (IDL) Language bindings provided for Java, Javascript, C++, Pyt

hon, etc.

Page 58: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 58

DOM logical model

<TABLE>

<TBODY>

<TR>

<TD> 紅樓夢 </TD>

<TD> 曹雪芹 </TD>

</TR>

<TR>

<TD> 三國演義 </TD>

<TD> 羅貫中 </TD>

</TR>

</TBODY>

</TABLE>紅樓夢 曹雪芹 三國演義 羅貫中

(document node; root)

(element node)

(text node)

XML Document is a set of Nodes that form tree structure.

There are different node types: for document, elements, attributes, text content, etc.

Page 59: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 59

DOM : interface Hierarchy

Most important interfaces defined in Java package org.w3c.dom

Page 60: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 60

Containment Hierarchy

Page 61: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 61

DOM provides two groups of interfaces: Generic: Node, NodeList, NamedNodeMap; Specialized: Node subinterfaces for elements, attributes, text no

des, etc. Interfaces:

Node Document Element Attr Text …

Page 62: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 62

Selected Node access and modification methods

Method Name Description

appendChild Appends a child node.

cloneNode Duplicates the node.

getAttributes Returns the node’s attributes.

getChildNodes Returns the node’s child nodes.

getNodeName Returns the node’s name.

getNodeType Returns the node’s type (e.g., element, attribute, text, etc.). Node types are described in greater detail in Fig. 8.9.

getNodeValue Returns the node’s value.

getParentNode Returns the node’s parent.

hasChildNodes Returns true if the node has child nodes.

removeChild Removes a child node from the node.

replaceChild Replaces a child node with another node.

setNodeValue Sets the node’s value.

insertBefore Appends a child node in front of a child node.

Page 63: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 63

Node Relatives and access methods

getParentNode()getPreviousSibling()getChildNodes()getFirstChild()…

this

firstChild

parentNode

lastChild

nextSibling

childNodes

previousSibling

Page 64: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 64

Node creation methods

All defined in the Document interface.

Method Name Description

createElement Creates an element node.

createAttribute Creates an attribute node.

createTextNode Creates a text node.

createComment Creates a comment node.

createProcessingInstruction Creates a processing instruction node.

createCDATASection Creates a CDATA section node.

getDocumentElement Returns the document’s root element.

appendChild Appends a child node.

getChildNodes Returns the child nodes.

Page 65: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 65

Some code snippets

Importing packages:

import org.w3c.dom.*;import org.xml.sax.*;import javax.xml.parsers.*;import com.sun.xml.tree.XmlDocument;

Instantiation of the parser. DOM does not specify parser instantiation, so use J

AXP for implementation independent code:DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

factory.setValidating( true );DocumentBuilder builder = factory.newDocumentBuilder();

Page 66: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 66

Loading and Parsing the XML file:Document document = builder.parse(new File( "intro.xml"));

Getting the root element (myMessage):Node root = document.getDocumentElement();

Casting the root to Element type: Element myMessageNode = ( Element ) root;

Finding the message elements:NodeList messageNodes = myMessageNode.getElementsByTagName("message");

Getting the first message element:Node message = messageNodes.item(0);

Page 67: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 67

Creating a new text content and replacing the old one:

Text newText = document.createTextNode("New Changed Message!!"); Text oldText = (Text) message.getChildNodes().item(0); message.replaceChild( newText, oldText );

Writing the changed document to a new file. DOM does not specify how to save the DOM structure.

This is implementation specific detail: Can use JAXP XSLT API to transform doc into stream

result. ((XmlDocument) document).write( new FileOutputStream("intro1.xml"));

Page 68: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 68

DOM Levels

DOM level 0 (pre XML)DOM level 1 (Discussed here);DOM level 2:

Namespace support; Stylesheets interface; Model for events; Views, Range and Traversal interfaces;

DOM level 3 (working drafts): Loading and Saving documents; XPath Model for DTD and Schema;

Page 69: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 69

SAX

Simple API for XML Developed by the members of XML-DEV list in 1998; SAX is Event based:

The parser reports parsing events: start and end of the document, start and end of an element, errors, etc.

When an event occurs, the parser invokes a method on an event handler;

The application handles the events accordingly;

SAX home page:http://www.saxproject.org/

Types of SAXs Event-based SAX (push technology) (versin 1.0, 2.0) Pull SAX

Page 70: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 70

What is an Event-Based Interface?

Two major types of XML APIs:Tree-based APIs ==> DOM

compiles an XML document into an internal tree structure, then allows an application to navigate that tree.

Event-based APIs. ==> SAX reports parsing events (such as the start and end of

elements) directly to the application through callbacks, usually does not build an internal tree. The application implements handlers to deal with the

different events, much like handling events in a graphical user interface.

Page 71: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 71

How an event-based API worksconsider the following sample document:

<?xml version="1.0"> <doc> <para>Hello, world!</para> </doc>

An event-based interface will break the structure of this document down into a sequence of SAX events: start document start element: doc start element: para characters: Hello, world! end element: para end element: doc end document

Page 72: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 72

Implementation of

Parser

AttrbuteList

Locator

(supplied by

Driver writer)

SAX: Parser archetecture

SAX Driver’s

parser classname

supplied by application writer

Page 73: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 73

SAX: DocumentHandler interface

Java package org.xml.saxDocumentHandler Interface

More important methods:

public abstract void startDocument()public abstract void endDocument()

public abstract void startElement(String name, AttributeList atts)

public abstract void endElement(String name)

public abstract void characters(char ch[],int start, int length)

public abstract void processingInstruction(String target,String da

ta)

Page 74: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 74

Example: (MyHandler.java)

prints a message each time an element starts or ends:import org.xml.sax.HandlerBase;

import org.xml.sax.AttributeList;

public class MyHandler extends HandlerBase {public void startElement (String name, AttributeList atts)

{

System.out.println("Start element: " + name);

}

public void endElement (String name)

{

System.out.println("End element: " + name);

}

}

Page 75: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 75

The main program (SAXApp.java)

import javax.xml.parsers.*;

import org.xml.sax.DocumentHandler;

public class SAXApp {static final String parserClass = "com.microstar.xml.SAXDriver";

// or org.apache.xerces.parsers.SAXParser for xerces

public static void main (String args[]) throws Exception

{

SAXParserFctory fac = SAXParserFactory.newInstance();

SAXParser parser = fac.newSAXParser();

DocumentHandler handler = new MyHandler();

parser.setDocumentHandler(handler);

for (int i = 0; i < args.length; i++) {

parser.parse(args[i]);

} } }

Page 76: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 76

The input

the input XML document (roses.xml): <?xml version="1.0"?> <poem> <line>Roses are red,</line> <line>Violets are blue.</line> <line>Sugar is sweet,</line> <line>and I love you.</line> </poem>

commands:

java SAXApp file://localhost/tmp/roses.xml or

java SAXApp file:///tmp/roses.xml

Page 77: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 77

The output

The output should be as follows: Start element: poem

Start element: line

End element: line

Start element: line

End element: line

Start element: line

End element: line

Start element: line

End element: line

End element: poem

Page 78: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 78

SAX:Error Handler

Three error types: Fatal errors: usually violation of well-formedness constrain

ts. The parser must stop processing; Errors: usually violation of validity rules; Warnings: related to DTD; Signatures: fatalError(ParseException), error(…), warning(…)

Errors are handled by implementing ErrorHandler Interface;The same mechanism is used with DOM parsers;

Page 79: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 79

SAX 2.0

Main Changes:

Namespace support;

Get/set Feature/properties

Introduction of Filter mechanism;

Interface DocumentHandler is replaced by ContentHandler;

New exception classes;

Page 80: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 80

SAX and DOM : Comparison

DOM: maintains an internal structure for the document; possible high memory usage for large documents; enables traversing;

SAX: doesn’t maintain an internal structure; enables building of custom structure; low memory usage; usually faster than DOM; traversing is impossible without internal structure;

DOM implementations are usually built on the top of a SAX parser

Page 81: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 81

Pull SAX

Pull APIs XMLPULL,

Design for J2ME. StAX (Streaming API for XML)

javax.xml.stream JSR-173, proposed by BEA Systems

NekoPull for Apache Xerces 2 .NET

SAX 1.0, 2.0 v.s. pull SAX For SAX, user code [ various event handling code] is used

as subroutines of the SAX Parser. [I.e. user code is the slave while the parsers is master.]

For pull SAX, on the contrary, user code is the master while parser serves as a slave.

Page 82: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 82

Sample code snippet

XmlPullParserFactory factory = XmlPullParserFactory.newInstance();XmlPullParser parser = factory.newPullParser();InputStream in= new FileInputStream(args[0]);parser.setInput(in, null); while (true) { // reading the document...

int event = parser.nextToken(); if (event == XmlPullParser.START_TAG)

{ System.out.println("Start tag"); } else if (event == XmlPullParser.END_TAG)

{ System.out.println("End tag"); } else if (event == XmlPullParser.START_DOCUMENT) { System.out.println("Start document"); } else if (event == XmlPullParser.TEXT)

{ System.out.println("Text"); } …else if (event == XmlPullParser.END_DOCUMENT) { System.out.println("End Document"); break; }

}// If we get here there are no exceptions System.out.println(args[0] + " is well-formed");

Page 83: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 83

Pull SAX : Only four classes

XmlPullParser: an abstract class that represents the parser

XmlPullParserFactory: the factory class that instantiates an implementation depen

dent subclass of XmlPullParser XmlPullException:

the generic class for everything other than an IOException that might go wrong when parsing an XML document, particularly well-formedness errors and tokens that don't have the expected type

XmlSerializer: Define an interface to serialziation of XML Infoset

Page 84: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 84

Event codes

returned by next()/nextToken()/nextTag() inform you of what the parser read.

11 event codes: 1. XmlPullParser.START_DOCUMENT2. XmlPullParser.END_DOCUMENT3. XmlPullParser.START_TAG4. XmlPullParser.END_TAG5. XmlPullParser.TEXT6. XmlPullParser.CDSECT7. XmlPullParser.ENTITY_REF8. XmlPullParser.IGNORABLE_WHITESPACE9. XmlPullParser.PROCESSING_INSTRUCTION10.XmlPullParser.COMMENT11.XmlPullParser.DOCDECL

Depending on what the event is, different methods are available on the XmlPullParser.

Events reported: Next() reports only 3,4,5,2. nextTag() :3,4. netToken() : all

Page 85: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 85

Pros and Cons of Pull SAX

FastMemory efficientStreamableRead-only

Page 86: Introduction to XML and its processing techniques

XML Model and Processing

Transparency No. 1

Combine XML technology with Traditional language processing

technology:

comparison and reuse existing technology

Page 87: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 87

Traditional language processor frontend

Traditional language processor

Available tools: Parser generator, lexical generator, etc.

Lexical analysis

Parsingcode

generation

char string tokens

parse treesor its equivalents

object code

EvaluationResultsparse trees

Syntax-Directed translation/interpretation

Page 88: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 88

Comparison

XML technology language technologyPull SAX lexical analyzer (scanner)? syntax-directed

translation schemeDOM parse tree (or AST)

Issues: Is it possible to Reuse existing language processing technology to proces

s XML doc ? process legacy data (format) using XML technology ?

Page 89: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 89

Use existing XML technology/tools to process legacy data.

Lexical analysis

Parsing

Legacy text , query results, run–time objects

tokens

Pull SAXPush SAX

XML application

XMLApplication

.xml SAXEvents

DOM Builder

DOM

Tree

T2SE PE2SE(ProgrammableBy lexical generator)

parse tree

(programmable byParser generator)

results

name name name

John Smith D. Warwick M. Douglas

Page 90: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 90

Use language technology to process XML data.

Lexical analysis

Parsing+actions

Legacy text

tokens

Pull SAX orPush SAX

XML application

XMLApplication

.xml SAXEvents

DOM Builder

DOM

Tree

SE2T Codeembedding

(ProgrammableBy lexical generator)

DTD2Grammar

DTD

Page 91: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 91

Syntax-Directed XML processing

Page 92: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 92

Other import XML programming techniques

Data binding: Automatic generation of mapping programs b/t XML data and run-time

objects. object views used for programming XML view used for storage/transmission/interoperation Implementations: Castor, Zeus, JAXB, JXQuick

Rule-based XML Processing Implementation: Apache/jakarta Digester

Transform: Retrieve data from existing XML documents and compose it into other

documents in XML or non-XML formats. Standards: XSLT, TrAX, Xquery, Xpath, Xpointer

Executable XML Not just passive data, XML can also be executable! Attach executable code to XML tag.

Implementations: Ant, Simkon, Jelly.

Page 93: Introduction to XML and its processing techniques

Introduction to XLink

Transparency No. 93

Summary and Further study

A brief introduction to XML and its processing techniques. What is XML, Why XML, XML and related spec. XML APIs: DOM, SAX, PullSAX How to combine XML with traditional language processing

technology XML data-binding, transformation, executable XML.

For XML : http://www.w3.org/ http://xml.coverpage.org/

For XML programming with Java Processing XML with Java, Elliotte Rusty Harold,Addison-

Wesley, 2002,Online. JavaTM Web Services Developer Pack 1.1 (and tutorials)