basic xml syntax

BASIC XML SYNTAX

XML markup describes and provides structure to the content of an XML document or data packet.

The tag markup syntax of XML is very similar to HTML (both are based upon SGML), with angle brackets used to delimit tags.

All tags begin with a less-than sign (<) and end with a greater-than sign (>).

Unlike HTML, XML is case-sensitive, including element tags and attribute values, that is:<Invoice> ( <INVOICE>

( <invoice> ( <INvoice>

Characters

Because XML is intended for worldwide use, characters are not limited to the 7-bit ASCII character set. XML uses most of the characters that are defined in the 16-bit Unicode character set (currently congruent with ISO/IEC 10646). There are two Unicode formats that are used as the basis of XML characters: UTF-8 and UTF-16. XML allows the use of almost any character encoding that can be mapped to Unicode (such as EBCDIC, Big5, etc.). There are numerous other character encodings that can be used with some XML tools, but UTF-8 and UTF-16 support is required of all XML processors.

The current Unicode specification can be found at: http://www.unicode.org, and ISO/IEC 10646 documentation can be ordered at http://www.iso.ch. The UTF acronym can mean "Unicode Transformation Format" (according to Unicode), or

"UCS Transformation Format" (in IEC or IETF documents) - essentially they mean the same thing, since Unicode and ISO/IEC 10646 are nearly identical.

UTF-8 is commonly used in North America and Europe, since the first 128 character values map directly to 7-bit US-ASCII (conversely any 7-bit ASCII string is valid UTF-8). UTF-8 is a multi-byte encoding, with character values represented in one to six bytes. This encoding is less popular in Asia, since most Asian characters and ideographs require the longest encoded forms.

UTF-8 is described at: http://www.ietf.org/rfc/rfc2279.txt The UTF-16 encoding uses 16-bit values for characters, with the full range of 65,536 possible 16-bit values being split into two parts. There are 63,486 values available to represent single 16-bit character values. The other 2,048 values are reserved to provide paired 16-bit code values for an additional 1,048,544 character values. These are called surrogate pairs, but so far none of these values are being used.

UTF-16 is described at: http://www.ietf.org/rfc/rfc2781.txt These are relatively new standards, and so much of the world's text isn't yet stored in Unicode. However, it was designed to be a superset of most existing character encodings, and so the conversion of legacy data to Unicode is straightforward. For example, converting ASCII to the UTF-16 form of Unicode merely requires stuffing a zero into the high-order byte of the 16-bit character, and simply preserving the low-order byte as is. Of course, this means that twice the storage space is required, compared to the same text in ASCII. As noted above, 7-bit ASCII doesn't even need conversion to be treated as the UTF-8 encoding

http://www.ietf.org/rfc/rfc2279.txt

http://www.ietf.org/rfc/rfc2781.txt

SPECIAL MARKUP CHARACTERS

Five characters have special meaning in XML mark-up:

< - Less-than sign (left angle bracket)

> - Greater-than sign (right angle bracket)

& - Ampersand

' - Apostrophe (single quotation mark)

" - Quotation mark (double quotation mark)

Use < for <

Use > for >

Use & for &

Use ' for ‘ and Use " for "

ELEMENTS

An element is XML's basic container for content - it may contain character data, other elements, and/or other markup (comments, PIs, entity references, etc.). Since they represent discrete objects, elements can be thought of as the "nouns" of XML.

Elements are delimited with a start-tag and an end-tag. If an element has no content, it is known as an empty element, and may be represented with either a start-tag/end-tag pair or using an abbreviation: the empty-element tag. Unlike the looser syntax of HTML and SGML, the end-tag cannot be omitted, except when using an empty-element tag.

All three types of tags are shown in this example:

<html> 

<img src="logo.png" /> 

</html> 

Each of these tags consists of the element type name (this must be a valid XML name) enclosed within a pair of angle brackets

(< >). Let's look at XML tags in more detail.

TAGS

The opening delimiter of an element is called the start-tag. Start-tags are comprised of an element type name, and perhaps some attributes (which we'll look at later in this chapter), enclosed within a pair of angle brackets.

We can think of start-tags as "opening" a container - which is then "closed" with an end-tag. End-tags are comprised of a forward slash (/) followed by an element type name, enclosed within the usual angle brackets.

The name in an end-tag must match the element name in a corresponding start-tag. Everything between the start-tag and the end-tag of an element is contained within that element. The following are legal pairs of start- and end-tags:

<Invoice> ... </Invoice>

<INVOICE> ... </INVOICE>

<INVOICE > ... </INVOICE >

<Wrox:Invoice> ... </Wrox:Invoice>

EMPTY-ELEMENT TAGS

Empty elements are those that have no content, though there may be associated attributes. Let's say that we wanted to explicitly indicate certain points within our XML data (see the next section). We could just add a start- and end-tag pair without any text between

for example:

<point></point>

THE STRUCTURE OF XML DATA

All XML data must conform to both syntax requirements and a simple container structure. Such data is known as well formed (see relevant section later in this chapter for more details). All well-formed XML documents can be comprised of one to three parts:

An optional prolog, which may contain important information about the rest of the data. The body, which consists of one or more elements in the form a hierarchical tree. An optional "miscellaneous" epilog that follows the element tree. These parts, and the unfamiliar syntax in the following illustration, will be described in greater detail later in this chapter.

Prolog

<?xml version="1.0"?>

<!DOCTYPE textfile SYSTEM "http://www.mySite.com/MyDTDs/Textfile.dtd">

<textfile>

<line>A Simple Example</line>

<line> by Yours Truly</line>

<line>This is the 3rd line of a simple 5-line text file.</line>

<line>..the middle line..</line>

<line>And lastly, a final line of text.</line>

<EOF/>

</textfile>

The body sub-tree always has a single root node called the document element (sometimes referred to as the root element) - if not,

the data is not well-formed XML!

Any well-formed XML document must be a simple hierarchical tree with a single root node, called the "document root". This

document tree contains a secondary tree of elements, with its own singular root node, called the "document element".

The document root of each XML document is also the main point of attachment for the document's description using a DTD or

Schema (see Chapters 5 and 6 for more about these). A Processing Instruction (PI - more about these later) is often used to attach

a stylesheet as well (see Chapter 9).

Since well-formed XML data has a tree structure, it can be modeled and manipulated as a tree. A standard model for this

approach is the W3C Document Object Model (DOM), which will be discussed in Chapter 11.

Now let's look at the body of the XML document in greater depth

The Document Element

This element is the parent of all other elements in the tree, and thus it may not be contained in any other element. Because the

document root and the document element are not the same thing, it is better not to refer to the document element as the "root

element" (even though it is the root of the element sub-tree)

String Literals

String literals are used for the values of attributes, internal entities, and external identifiers. All string literals in XML are

enclosed by delimiter pairs, using either an apostrophe (') or a quotation mark ("). The one restriction upon these literals is that

the character used for the delimiters may not appear within the literal - if an apostrophe appears in the literal, the quotation mark

delimiter must be used, and vice versa.

"string"

'string'

"..Jack's cow said "moo""

'..Jack's cow said "moo"'

ATTRIBUTES

If elements are the "nouns" of XML, then attributes are its "adjectives".

Often there is some information about an element that we wish to attach to it, as opposed to including it as a string inside the

element, or one of its children. This can be done using attributes, each of which is comprised of a name-value pair. Both starttags

and empty-element tags may include attributes within the tag. Attribute values must always be string literals, so the attribute

value can use either of the two delimiters:

ELEMENTS VS. ATTRIBUTES

The decision to use an element versus an attribute is not a simple one. Much discussion and argument has occurred about this

topic on both the XML-L and XML-DEV lists. Some argue that attributes should never be used - that they add unnecessary

processing complexity, and that anything that can be represented as an attribute would be better contained within a child element.

Others extol the advantage of being able to validate attribute values and assign default values using a DTD. Experiments have

shown that, despite superficial appearances, use of generic data compression (such as gzip, zlib, or LZW) has shown that neither

form has an inherent advantage for data storage or transmission.

CHARACTER DATA

Character data is plain text that contains no element tags or other markup, except perhaps, character and entity references.

Remember too, that because XML is intended for worldwide use, text means Unicode, not just ASCII (see the "Characters"

section earlier in this chapter).

The ampersand (&) and less-than (<) characters are used as XML's opening delimiters, and thus may never appear in their literal

form (except in CDATA sections, which are discussed later). If these characters are needed within character data, they must be

escaped using the entity references; < or &. It is not necessary to escape the other markup characters (like >), but they

may be escaped (using > in this case), if only for the sake of consistency within the character data.

These escape sequences are part of the set of five such strings defined by the XML specification, and implemented in all

compliant XML parsers.

WHITESPACE

Whitespace is an important linguistic concept for both human and computer languages. Only four characters are treated as whitespace in XML data:

XML's rule for handling whitespace is very simple: all whitespace characters (except for the CR character) within the content are preserved by the parser and passed unmodified to the application, while whitespace within element tags and attribute values may be removed. This is unlike the rampant removal of whitespace carried out in HTML browsers.

SPECIAL-PURPOSE MARKUP

We've already discussed just about every aspect of XML syntax that is necessary to create well-formed XML data (elements,attributes, and character/entity references). There are three additional syntactic constructs that deviate from the familiar syntax of tags (<tagname>) or entity references (&ref;). These are:

Comments

Processing Instructions (PIs)

CDATA sections

COMMENTS

It is often useful to insert notes, or comments, into a document. These comments might provide a revision log, historical notes, or any other sort of meta-data that would be meaningful to the creator and editors of a document (serving to enhance its human readability), but aren't truly part of the document's content. Comments may appear anywhere in a document outside of other markup (that is, you can't put a comment in the middle of a start- or end-tag).

The basic syntax of an XML comment is:

PROCESSING INSTRUCTIONS (PIS)

XML, like SGML, is a descriptive markup language, and so it does not presume to try to explain how to actually process an

element or its contents. This is a powerful advantage in that it provides presentation flexibility, and OS- and applicationindependence.

However, there are times when it is desirable to pass processing hints (or perhaps some script code) to the

application along with the document. The Processing Instruction (PI) is the mechanism that XML provides for this purpose.

CDATA SECTIONS

CDATA sections are a method of including text that contains characters that would otherwise be interpreted as markup. This feature is primarily useful to authors who wish to include examples of XML markup in their documents (like the examples in this book). This is probably the only good reason to include CDATA sections in a document, since almost all advantages of XML are lost when using these sections.

The basic syntax of a CDATA section is:

<![CDATA[...]]>

<![CDATA[&Warn; - &Disclaimer; <© 2001 &USCG; & &USN; > ]]>

<example>&Warn; - &Disclaimer; <© 2001 &USCG; & &USN;

>

</example>

DOCUMENT STRUCTURE

Prolog

The prolog is the appetizer - used to signal the beginning of XML data. It describes the data's character encoding, and provides some other configuration hints to the XML parser and application.

XML Declaration

All XML documents should begin with an XML Declaration. This declaration is not required in most XML documents, but it serves to explicitly identify the data as XML, and does permit some optimizations when processing the document. If the XML data uses an encoding other than UTF-8 or UTF-16, then an XML Declaration with the correct encoding must be used.

If this declaration is included, then the string literal "<?xml " must be the very first six characters of the document – no preceding whitespace or embedded comments are allowed.

While this declaration looks exactly like a processing instruction, strictly speaking it is not a PI (it is a unique declaration defined by the XML 1.0 REC). Nevertheless, the XML Declaration uses PI-like delimiters and an attribute-like parameter syntax that is similar to the one used in element tags (either " or ' may be used to delimit the value strings). For example:

<?xml version="1.0" encoding='utf-8' standalone="yes"?>

<?xml version='1.0' encoding='utf-8'?>

DOCUMENT TYPE DECLARATION

This should not be confused with the DTD (Remember: Document Type Definition)! Rather, the Document Type Declaration can refer to an external DTD and/or contain part of the DTD.

Body

This is, of course, the main course of the XML data, which we've discussed at length in terms of its components: elements,

attributes, character data, etc. It is worth reiterating that the body may contain comments, PIs, and/or whitespace characters

interleaved with elements and character data. The elements must comprise a hierarchical tree, with a single root node.

EPILOG

The XML epilog is the dessert with potentially unpleasant consequences! It may include comments, PIs, and/or whitespace.Comments and whitespace don't cause any significant problems. However, it is unclear whether PIs in the epilog should be applied to the elements in the preceding XML data, or a subsequent XML document (if any). This may well be a solution in search of a problem, or it may just be a problem in and of itself. XML does not define any end-ofdocumentindicator, and many applications will use the document element end-tag for this purpose. In this case, the epilog is never read, let alone processed.

This is a "real design error" as considered by Tim Bray (one of the XML 1.0 REC editors). It is probably inadvisable to use it without a very compelling reason - and the prior knowledge that it will likely not be interoperable with other XML applications.

VALID XML

Any XML data object is considered valid XML if it is well formed, and it meets certain further validity constraints and matches a grammar describing the document's content. Like SGML, XML can provide such a description of document structure in the form of an XML Schema or a DTD

The SGML equivalent of a well-formed document is known as tag-valid. The SGML equivalent of a valid document is type-valid

XML PARSERS

In addition to specifying the syntax of XML, the W3C described some of the behavior of the lower tier of XML's client architecture (the XML processor or parser)

Parser Levels

Two levels of parser ("processor") behavior are defined in the XML 1.0 REC:

Non-validating - ensures that the data is well-formed XML, but need not resolve any external resources

Validating - ensures both well-formedness and validity using a DTD, and must resolve external resources

Parser Implementations

There are two different implementation approaches to processing the XML data:

Event-driven parser - Processes XML data sequentially, handling components one at a time

Tree-based parser - Constructs a tree representation of the entire document and provides access to individual nodes in

the tree (can be constructed on top of an event-driven parser)

Much quasi-religious argument has occurred about this dichotomy, but each approach has its merits. Like so many other realworld

problems, XML processing may have vastly different requirements, and thus different approaches may be best for

different situations.

EVENT-DRIVEN PARSERS

The event-driven model should be quite familiar to programmers of modern GUI interfaces and operating systems. In this case, the XML parser executes a call-back to the application for each component of the XML data: element (with attributes), character data, processing instructions, notation, or comments. It's up to the application to handle the XML data as it is provided via the call-backs - the XML parser does not maintain the element tree structure, or any of the data after it has been parsed. The eventdriven method requires very modest system resources, even for extremely large documents; and because of its simple, low-level access to the structure of the XML data, provides great flexibility in handling the data within the XML application.

TREE-BASED PARSERS

One of the most widely used structures in software engineering is the simple hierarchical tree. All well-formed XML data is defined to be such a tree, and thus common and mature algorithms may be used to traverse the nodes of an XML document, search for content, and/or edit the document tree. These tree algorithms have the advantage of years of academic and commercial development.

XML parsers that use this approach generally conform to the W3C's Document Object Model (DOM). The DOM is a platform and language-neutral interface that allows manipulation of tree-structured documents. On the other hand, the DOM tree must be built in memory, before the document can be manipulated - high-performance virtual memory support is imperative for larger documents! Once the tree is built, an application may access the DOM via a related API.

basic xml syntax

Education

character values

unicode character set

charactersbecause xml

xml document

xml markup

xml processors

xml tools

basis of xml characters