xml tutorial walter underwood senior staff engineer infoseek [email protected]

72
XML Tutorial Walter Underwood Senior Staff Engineer Infoseek [email protected]

Upload: dustin-curtis

Post on 26-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

XML Tutorial

Walter UnderwoodSenior Staff Engineer

[email protected]

Outline

I. XML: Why? What is it?

II. Document Types: representing content

III. Stylesheets: representing presentation

Part I. Why? What is it?

What is XML?

Extensible Markup Language Structured markup Simplified SGML Next-generation HTML W3C Recommendation (spec) Easy to use, easy to implement A buzzword the press can spell

What is XML not?

A programming language A single document type (memo, paper) Replacement for MS Word or FrameMaker An ANSI or ISO standard

Family Tree

SGML (1985)

HTML (1993)

XML (1998)

GML (1969)

Dates are first publication of draft specification

Why not SGML?

Tools are hard to write Tools are expensive Depends on environment (interchange is

difficult) If it did the job, we'd already be using it

Why not HMTL?

Backward compatibility, old browsers Hard to extend (still no formulas, figures) Based on SGML (see previous slide) Too much illegal HTML in use, need clean

slate

An HTML example

<html><body><h1>The Purple Cow</h1>I never saw a purple cow,<br>I never hope to see one;<br>But I can tell you, anyhow,<br>I'd rather see than be one.<br></body></html>

Same thing in XML<?xml version="1.0"?><!DOCTYPE TEI.2 SYSTEM "tei.dtd"><TEI.2><text><body><div1 type="poem"><head>The Purple Cow</head><lg><l>I never saw a purple cow,</l><l>I never hope to see one;</l><l>But I can tell you, anyhow,</l><l>I'd rather see than be one.</l></lg></div1></body></text></TEI.2>

Same thing formatted

The Purple Cow

I never saw a purple cow,I never hope to see one;But I can tell you, anyhow,I'd rather see than be one.

Basic Syntax

Starts with XML declaration<?xml version="1.0"?>

Rest of document inside the "root element"<TEI.2>…</TEI.2>

All text contained in some element<head>The Purple Cow</head>

Start and end tags must match exactly

Well-formed vs. Valid

XML must be well-formed correct syntax tags match, tags nest, all characters legal parser must reject if not well-formed

XML may be valid with respect to a DTD (Document Type Definition) tags are used correctly tags are all declared attributes are declared

Validity Checking

Checks everything specified in a DTD Can't check text (currency, spelling) Checks against DTD: this is a valid memo,

book, bibliography, ... XML editors usually require validity Other tools (search engines) might not

XML Syntax

The XML declaration Elements Entities Text Declarations and Notations Processing Instructions Comments

The XML Declaration

At very beginning of file Officially optional, but always use it Can declare version, encoding, standalone

Must be in that order Each is optional

Must declare other encodings <?xml encoding="Big5"?>

<?xml encoding="ISO-8859-1"?>

Elements

Containing: <person>Nico</person> Empty: <br/> Attributes: <date format="iso8601">… Names can be any Unicode character,

digit, or '.', '-', '_', or ':' (':' is reserved)

<Straße>Kurfürstendamm 175</Straße>

Elements Express Structure

Heading is inside poem element

<div1 type="poem"><head>The Purple Cow</head>

Shows the lines of the poem, not the line breaks on the page

I never saw a purple cow<br> HTML<l>I never saw a purple cow</l> XML

Space between elements is ignored

The Document Tree

<TEI.2><text>

<body><div1>

<head></head><lg>

<l></l><l></l>

</lg></div1>

</body></text>

</TEI.2>

Elements and Attributes

Attributes can parameterize an element <div1 type="poem">

<div1 type="abstract"><div1 type="chapter"><date format="iso8601"><subject scheme="LCSH">

Not as flexible as elements Don't use to save bytes, compress instead

<author first="Fred" last="Flintstone"/> not good

Attribute Syntax

Name can be any Unicode character, digit, or '.', '-', '_', or ':' (':' is reserved)

Cannot repeat Order doesn't matter Values must be quoted (single or double) Values may not contain "<" Values may have defaults in DTD

Special Attributes

xml:lang for language id has unique identifier for element idref references an id xml:* is reserved

Just like HTML, but better Five predefined entities

&amp; &apos; &lt; &gt; &quot;

Define your own in DTD<!ENTITY euro "&#x20AC;">

Use numeric character references&#x20AC; &#8364;

Use Unicode directly

Entities

Text

Unicode 2.0, see www.unicode.org Use predefined entities (&lt; &amp; …)

XML Example: &lt;char>&amp;amp;&lt;/char>

CDATA ("character data") section for raw text without using entities<![CDATA[ XML example: <char>&amp;</char>

]]>

Declarations

Allow validity checking Optional May be internal (in document), external, or

both DTD (Document Type Definition) is all

active declarations Use existing DTDs when possible

External DTD

Most common Use DOCTYPE declaration before root

element <!DOCTYPE greeting SYSTEM "hello.dtd">

<greeting>Hello, world!</greeting>

Internal (standalone) DTD

For custom documents Also uses DOCTYPE declaration

<!DOCTYPE greeting [<!ELEMENT greeting (#PCDATA)>]><greeting>Hello, world!</greeting>

Specify in XML declaration <?xml version="1.0" standalone="yes"?>

External plus Internal DTD

Usually to declare entities Use DOCTYPE declaration before root

element <!DOCTYPE greeting SYSTEM "hello.dtd" [

<!ENTITY excl "&#x21;">]><greeting>Hello, world&excl;</greeting>

Element Type Declarations

Declare name Declare allowed content

<!ELEMENT a EMPTY><!ELEMENT b ANY><!ELEMENT either (one | theother)><!ELEMENT ordered (first, second)><!ELEMENT list (item+)><!ELEMENT dl ((dt?, dd?)*)><!ELEMENT text (#PCDATA)><!ELEMENT mixed (#PCDATA | b | i | em)>

Attribute List Declarations

Declare attributes for an element Declare value types Declare defaults

<!ATTLIST termdef id ID #REQUIRED name CDATA #IMPLIED><!ATTLIST list type (bullets|ordered|glossary) "ordered"><!ATTLIST form method CDATA #FIXED "POST">

Entity Declarations

Pretty names for characters <!ENTITY copy "&#x00A9;">

Boilerplate<!ENTITY copyright

"&copy; Infoseek Corp. 1999, All rights reserved">

Used extensively in complex DTDs

Notations

A name of something outside of XML an unparsed entity target of a processing instruction

Mostly useful to applications<!NOTATION WunderFormatter

SYSTEM "http://wunderco.com/formatter/">

Processing Instructions

Instructions to applications fonts? security? correctness checks?

Linking to a style sheet<?xml-stylesheet href="mystyle.css"

type="text/css"?>

Instructions to indexing robots<?robots index="no" follow="yes"?>

Comments

Like HTML and SGML<!-- a comment -->

Anything is OK inside a comment <!-- <head> & <tail> are elements -->

<!-- <?xml?> declaration goes here -->

But don't use structured comments, use processing instructions instead

<!-- Font: Treefrog --> wrong<?WunderFormatter font="Treefrog"?> right

Unicode and Encodings

Unicode in programs UCS-2: two-byte characters UCS-4: four-byte characters (future)

Unicode in files UTF-8: ASCII is ASCII, rest are 1- to 4-bytes UTF-16: two octets per character, initial

ASCII with numeric character references works, too (&#x00A9; for ©)

Part II. Document Types:representing content

What is a "document type"?

Technical report Specification Bug report Experiment run summary Software manual Novel Poem Play

What is a DTD?

"Document Type Definition" Bunch of XML declarations Usually external to document Designed for some purpose (use one that

matches your needs) Best left to experts

Types of Document Types Text

TEI (scholarly editions) DocBook (software documentation) NITF (news articles)

Data CML (Chemical Markup Language) AIML (Astronomical Instrument ML)

Mixed often custom (bug reports)

A Bug Report Document

<?xml?><bugreport><product>xmltron</product><version>1.1</version><os>RTE</os><osversion>4.0</osversion><date scheme="ISO8601">1999-11-03</date><report><summary>doesn’t work</summary><detail>at all</detail></report><solution>none yet</solution></bugreport>

Make a Document Type

<!DOCTYPE bugreport [ <!-- declarations go here -->

]><bugreport> ...

Doctype and root element must match

Declarations for Elements

<!DOCTYPE bugreport [<!ELEMENT bugreport wait 'til next slide><!ELEMENT product #PCDATA><!ELEMENT version #PCDATA><!ELEMENT os #PCDATA><!ELEMENT osversion #PCDATA><!ELEMENT date #PCDATA><!ELEMENT report (summary, detail)><!ELEMENT summary #PCDATA><!ELEMENT detail #PCDATA><!ELEMENT solution #PCDATA>]>

Declaration for Root Element

<!DOCTYPE bugreport [<!ELEMENT bugreport (product, version, os, osversion, date, report, solution?)>

<solution> is optional, others required andmust be in this order.

Declarations for Attribures

<!ATTLIST date scheme CDATA #IMPLIED>

"CDATA" instead of "PCDATA" means it isn't "parsed" for entities

Declarations for Attributes

"CDATA" instead of "PCDATA" means it isn't "parsed" for entities (no markup)

#IMPLIED means optional (value implied by document)

separate ATTLIST declarations for the same element are OK

internal ATTLIST declarations override external

<!ATTLIST date scheme CDATA #IMPLIED>

Reusing Element Declarations

<product> <name>xmltron</name> <version>1.1</version></product><os> <name>RTE</name> <version>4.0</version></os>

Use the same elements for product andOS info.

New Declarations for Elements

<!ELEMENT product (name, version)><!ELEMENT os (name, version)><!ELEMENT name #PCDATA><!ELEMENT version #PCDATA>

Customizing Existing DTDs

Add attributes Add entities Rarely change elements

Can't override element declarations Can add new child elements to those that allow

ANY

Some DTDs are designed for extensions

Part III. Stylesheets:representing presentation

documents = contents + style

Extensible Stylesheet Language (XSL) Specifications still in draft But implementations keeping pace

XSL is in Three Parts

XSLT: transformation XPath: addressing XML entities FO: formatting objects

We will cover only XSLT today

Client-side XSL

XML

XSLT

FO

Server-side XSL

XML

XSLT

XSLTengine HTML

XML into HTML

XSLT can transform into (called "output method"): XML HTML text

Server-side XSLT engine content in XML served as HTML browser never knows

Transforming The Purple Cow

Add HTML intro and outro convert <head> to <h1> convert <lg> to <p> (at beginning of stanza) convert <l> to <br> (at end of line)

The Purple Cow (XML)<?xml version="1.0"?><!DOCTYPE TEI.2 SYSTEM "tei.dtd"><?xml-stylesheet href="purple.xsl" type="text/xml"?><TEI.2><text><body><div1 type="poem"><head>The Purple Cow</head><lg><l>I never saw a purple cow,</l><l>I never hope to see one;</l><l>But I can tell you, anyhow,</l><l>I'd rather see than be one.</l></lg></div1></body></text></TEI.2>

The Purple Cow (HTML)

<html><body><h1>The Purple Cow</h1>I never saw a purple cow,<br>I never hope to see one;<br>But I can tell you, anyhow,<br>I'd rather see than be one.<br></body></html>

Intro and Outro

<?xml version="1.0"?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="TEI.2"> <html> <body> <xsl:apply-templates/> </body> </html> </xsl:template></xsl:stylesheet>

XSLT So Far

It is XML Uses XML Namespaces—no name conflicts Defaults to text/xml output method Uses text/html if <html> is output at root Applies templates to input

A Template for Text Content

<xsl:template match="head"> <h1> <xsl:apply-templates/> </h1></xsl:template>

Default element rule applies templates Default text rule copies to output IE5 doesn’t implement the default rules

Default Templates

<!-- Default template for elements, applies to children --><xsl:template match="*|/"> <xsl:apply-templates/></xsl:template>

<!-- Default template for text and attribute nodes, copies content to output --><xsl:template match="text()|@*"> <xsl:value-of-select="."/></xsl:template>

Line Groups and Lines

<!-- put a <p> before each stanza --><xsl:template match="lg"> <p> <xsl:apply-templates/></xsl:template>

<!-- put a <br> after each line --><xsl:template match="l"> <xsl:apply-templates/> <br></xsl:template>

A Complete Stylesheet<?xml version="1.0"?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="TEI.2"> <html><body><xsl:apply-templates/></body></html> </xsl:template> <xsl:template match="head"> <h1><xsl:apply-templates/></h1> </xsl:template> <xsl:template match="lg"> <p><xsl:apply-templates/> </xsl:template> <xsl:template match="l"> <xsl:apply-templates/><br> </xsl:template></xsl:stylesheet>

Other XSL Features

Cascading stylesheets Including stylesheets Conditionals (if/else), variables Relative selectors, XPath selectors Counting, sorting String and number manipulation Template modes (e.g. table-of-contents

and full)

Why do it?

Different HTML for different browsers(make sure the default works!)

Index only content with search engine

Generate RTF or TEX with text output method

Analyze XML files (all meta data defined?) Convert between DTDs

Questions?

Resources and URLs

XML Information XML at W3C

www.w3.org/XML www.w3.org/TR/REC-xml

The Annotated XML Spec www.xml.com/pub/axml/axmlintro.html

The Robin Cover SGML/XML page (encyclopedic!) www.oasis-open.org/cover/

The XML Bible, Elliott Rusty Harold updates at: metalab.unc.edu/xml/books/bible/

www.xml.com (articles and directory)

XML Software SAX (Simple API for XML)

www.megginson.com/SAX www.jclark.com/XML (C and Java parsers)

DOM (Document Object Model) www.w3c.org/DOM (specs) www.alphaworks.ibm.com (XML4J parser) developer.java.sun.com/developer/products/xml/(Project X)

Parser conformance testing www.xml.com/pub/1999/09/conformance/ www.oasis-open.org/cover/xmlConformance.html

Avoid MSXML (Microsoft), non-standard and buggy

General DTD Resources

Structuring XML Documents, David Megginson The XML and SGML Cookbook : Recipes for

Structured Information, Rick Jellife more an SGML book, but excellent on Internationalization

Specific DTD Resources

Inside XML DTDs: Scientific and Technical, Simon St. Laurent

DocBook: The Definitive Guide, Norman Walsh and Leonard Muellner

TEI Lite and Bare Bones TEI (SGML) www.tei-c.org (TEI Consortium) www-tei.uic.edu/orgs/tei/intros/teiu5.html www-tei.uic.edu/orgs/tei/intros/teiu6.html

Chemical Markup Language: www.xml-cml.org MathML: www.w3.org/TR/REC-MathML

XSL Resources

Warning: XSL changed in August 1999! W3C Style Activity

www.w3c.org/Style

Updated XSL chapter from The XML Bible metalab.unc.edu/xml/books/bible/updates/14.html

James Clark's XT (XSLT implementation) www.jclark.com/xml/xt.html