xml databases - cochin university of science and...

XML DATABASES

A SEMINAR REPORT

Submitted By

ATUL KUMAR

in partial fulfilment for the award of the degree

of

BACHELOR OF TECHNOLOGY

in

COMPUTER SCIENCE AND ENGINEERING

SCHOOL OF ENGINEERING

COCHIN UNIVERSITY OF SCIENCE & TECHNOLOGY

KOCHI-682022

SEPTEMBER 2010

Division of Computer Engineering

School of Engineering Cochin University of Science & Technology Kochi-682022

_____________________________________________________

CERTIFICATE

Certified that this is a bonafide record of the seminar work titled

Xml Databases

Done by

ATUL KUMAR

of VII semester Computer Science & Engineering in the year 2010 in

partial fulfillment of the requirements for the award of Degree of

Bachelor of Technology in Computer Science & Engineering of Cochin

University of Science & Technology

Dr. David Peter S. Ms. Anu M. Head of the Division Seminar Guide

Xml Databases

Division of Computer Engineering i

ACKNOWLEDGEMENT

I express my sincere thanks to Ms. Anu M., my seminar guide for her valuable

suggestions and sincere vigilance, Mr. Sudheep P. Eliydom (Staff in charge) for providing right

guidance and co-operations and Dr David peter S. (Head of Division) for allowing us to use the

facilities. Also I would like to extend my sincere thanks to all other members of the faculty of

Computer Science and Engineering Department. Last but not least I want to thank my friends for

their co-operation and encouragement.

ATUL KUMAR

Xml databases

Division of Computer Engineering 1

ABSTRACT

The Xml database is a rapidly growing technology which is poised

to replace many existing technologies used for data storage. It uses xml and many

of its derived technologies, like DTD, XSL, XSLT, Xpath, Xquery, etc., as its frame-

work. Xml document is self-describing and is compatible on all kind of platforms

because it is in text format. This makes it a very powerful technology.

We can store semi-structured data in xml databases.

Also, there are protocols like SOAP for accessing data and web services over the

internet. Due to its simplicity and compatibility on all kinds of platforms , xml

database is rapidly becoming the de facto standard of passing data over the

internet.

TABLE OF CONTENTS

1. INTRODUCTION……………………………………………………………………………………01

2. SEMI-STRUCTURED DATA…………………………………………………………………….08

3. XML …………………………………………………………………………………………………….09

4. XML FOR SEMI-STRUCTURED DATA……………………………………………………..12

5. XML DTD : DOCUMENT TYPE DEFINITION…………………………………………….14

6. XML SCHEMA ………………………………………………………………………………………19

7. XPATH………………………………………………………………………………………………….22

8. XQUERY……………………………………………………………………………………………….25

9. XSLT……………………………………………………………………………………………………..27

10. XML PARSER……………………………………………………………………………………….30

11. XML DATABASE……………………………………………………………………………….….32

12. SOAP…………………………………………………………………………………………………..35

13. CONCLUSION……………………………………………………………………………………...36

14.REFERENCE………………………………………………………………………………………….37

Xml Databases


1. INTRODUCTION

For three decades, application developers have relied on relational databases as the bedrock

for a persistent data storage layer. While the technology is mature, today's requirements are

becoming more complex and relational databases may not be the tool for the job in hand, but

what else does a designer / developer pick if they know no better? - Relational Databases

were developed in the days of procedural programming languages (e.g. C, COBOL and

RPG), programming techniques have evolved in many ways since 30 years ago most notably

with introduction of an Object Oriented approach but the persistent storage model has stayed

the same. This article tries to question if developers have been dumbing down and creating

more work for themselves (unknowingly) for many years, this article also attempts to give an

eye-opener into a new approach of storing and retrieving data.

Commonly today, data structures are often modelled in a hierarchical object manner, imagine

a simple invoice in terms of an object hierarchy:

Simple Invoice, Theoretical Business Object

Invoice = {

date : "2008-05-24"

invoiceNumber : 421

InvoiceItems : {

Item : {

description : "Wool Paddock Shet Ret Double Bound Yellow 4'0"

quantity : 1

unitPrice : 105.00

}

Item : {

description : "Wool Race Roller and Breastplate Red Double"

quantity : 1

unitPrice : 75.00

Xml Databases


}

Item : {

description : "Paddock Jacket Red Size Medium Inc Embroidery"

quantity : 2

unitPrice : 67.50

}

}

}

The following is an example relational structure, containing this data

Table Invoices

date invoiceId

2008-05-24 421

Table InvoiceItems

invoiceId description quantity unitPrice

421 Wool Paddock Shet Ret Double Bound ... 1 105.00

421 Wool Race Roller and Breastplate Red ... 1 75.00

421 Paddock Jacket Red Size Medium Inc ... 2 67.50

Representing this simple single Invoice Object in a relational database can be done, but

immediately even for something this simple you need more than 1 table, table joins based on

keys and of course the Object has to be spanned over multiple tables. This leaves room for

human error; when inserting and updating data it is up to the developer to ensure keys

correctly match and when trying to rebuild the object from the persistent layer you need an

SQL query which will select data from multiple tables, by nature the query returns the data as

essentially a result set of flat 1 dimensional arrays and its then up to the developer to build

this hierarchical object from scratch.

Xml Databases


To a programmer who has been developing with relational databases for some time this may

seem like second nature but for a new developer that has just learned the concepts of Object

Oriented programming this may seem a little alien.

Leaving aside the programmer's responsibility to ensure the mapping between Object and

relational structures, because the data types in SQL databases are quite simplistic all

validation must be performed within the business logic layer of an application before any data

can be inserted or updated in the database.

SQL "CREATE TABLE" and the SQL data type values a developer can bind to each column

is too simplistic to be used as a means of validating data taken directly from a user's input.

Often the business logic layer in today's applications performs additional validation, e.g.

checks that a field is a valid phone number or a valid e-mail address or even that when the

field is inserted into the SQL INSERT or UPDATE statement that it won't actually break the

syntax or cause a security breach.

Object Relational Mapping has definitely eased these problems with relational databases

because it allows a relational database to become a "virtual object database", but O/R

Mapping has brought some problems of its own. O/R Mapping techniques and frameworks

can be difficult to learn, it is by no means simple to map complex Java classes with multiple

Java class descendants to a relational structure, validating user's input is still cumbersome and

essentially still needs to be written in full in the business logic layer and it of course adds an

additional performance overhead because essentially the O/R mapping process attempts to

emulate the natural functionality of an Object oriented database.

Object oriented databases are designed to work well with object oriented programming

languages such as Java, C# and C++. Object Databases use the same model as today's

programming languages as they store and index theoretical objects. Object databases are

generally recommended when there is a business need for high performance processing on

complex data.

What has held Object databases back over the years is A. The industries resilience to change.

B. The majority of developers in the industry can't be bothered to investigate about new or

alternative technologies to the ones that are common place in industry.

However, thankfully change does happen. Today we are living in the information age,

businesses are talking to each other via complex XML data structures, (SOAP and RESTful

Xml Databases


Web Services becoming the ever more popular means of information exchange between

disparate applications and systems).

The XML messages exchanged are by nature hierarchical and deeply tree structured,

sometimes the data is even unpredictable and sometimes the structure is prone to change at

any time, developers trying to map this data to a relational structure may find their lives

becoming more and more difficult.

XML Databases offer the same functionality of Object Databases, data is structured in a

hierarchical manner except XML Databases store XML documents instead of theoretical

Objects. While in principle this is the same concept of data storage, XML databases have the

added benefit of being able exchange the data in its native format, which is perfect for today's

requirements.

Where Object Databases have Object Query Language (OQL), XML Databases have XQuery

which is a W3C standard. XQuery covers the major functionality from former language

proposals like XML-QL, XQL, OQL and the SQL standard.

Going back to the Invoice object and a persistent layer. A developer working with an XML

Database would just need to place an XML representation of the Object into a collection.

The following is an example of the invoice data but stored in XML format

Simple Invoice, XML Representation

<invoice>

<number>421</number>

<date>2008-05-24</date>

<items>

Xml Databases


<item>

<description>Wool Paddock Shet Ret Double Bound Yellow 4'0"</description>

<quantity>1</quantity>

<unitPrice>105.00</unitPrice>

</item>

<item>

<description>Wool Race Roller and Breastplate Red Double</description>



</item>

<item>

<description>Paddock Jacket Red Size Medium Inc Embroidery</description>



</item>

</items>

</invoice>

Pulling up the full invoice from the XML Database requires no long winded table joins, it is

as simple as:

XQuery

Xml Databases


collection("invoices")/invoice[number=421]

Pretty simple when you compare it to the equivilant SQL for Relational Databases:

Equivilant SQL

select * from invoiceitems inner join invoices on

invoiceitems.invoiceid = invoices.invoiceid where invoices.invoiceid = 421;

XML Databases can accept structured as well as unstructured data. XML documents do not

have to conform to any set Schema so a developer can fire anything they wish at the database,

no need to modify tables and columns. On the other hand, XML may conform to an XML

Schema.

XML Schema allows one to define an XML document in both its node structure (e.g.

elements and attributes) as well as the data types contained within these nodes. It allows one

to define these data types in very explicit detail, e.g. a float with additional constraints like

Maximum Number, Minimum Number, Total Digits, Fraction Digits, etc. Strings can also be

given many additional constraints including Minimum and Maximum Lengths as well as

matching a user defined Regular Expression, this is a perhaps the most effective constraint.

Because XML Schema is so powerful in terms of the explicitness of the constraints that can

be placed on XML data, potentially large amounts of validation that would normally be

performed in the business logic layer of an application can be reduced dramatically or even

completely.

A great tool for Java/J2EE Developers is Java Architecture for XML Binding or JAXB which

allows a developer to generate simple Java Bean classes which represent the structure of an

underlying XML document, the classes can be generated from an existing XML Schema.

Object/XML Mapping if you like.

JAXB allows a developer to convert XML documents into in-memory Java Bean Objects

which act as an interface to the underlying XML, it also has the ability to serialize these in-

memory Java Objects back into XML documents. Validation of the in-memory data is

performed based on the original XML Schema from which the classes were generated, which

means far less / no validation code would need to be written in the business logic layer of the

application.

Xml Databases


JAXB also allows the developer to generate an XML Schema based on existing Java code, so

a developer can use an XML Database much like an Object database without ever getting into

the detail of using XML, XQuery or SOAP / RESTful Web Services.

Conclusion

A new project which deals with XML and/or unpredictable data, choosing to use a Relational

Database will not stop the project in its tracks but a great deal of time will be wasted on

trivial matters that could be easily solved by making use of an XML Database instead.

Xml Databases


2. Semi-structured data

Data that is inherently self-describing and does not conform to any explicit and fixed

schema is known as semi-structured data. An example of such a data is an xml document.

The structure is implicit in such data. For example, xml tags define the structure of the data in

an xml document. The information that is associated with the schema in the normal course, is

contained within the data itself. Semi-structured data is usually formalized as labeled graphs.

Some examples of semi-structured data are letters, document, web information systems,

digital libraries, and heterogeneous data integration. A letter has a limited structure as every

letter starts with ‘to’ and ends with ‘from’ but, in between them the structure of a letter

changes from person to person, from place to place and from one situation to another. With

the advent of web the amount of flow of semi-structured data increased many fold.

Irregularity in structure:

There is high irregularity in semi-structured data. Some data elements may

annotate more information than others. The same kind of data may be typified differently. For

example- at some place the names may be written as < lastname, firstname > while at some

other place it may be written as < firstname, lastname >. Also, since a lot of data is added

dynamically so, the structure keeps changing. Hence semi-structured data does not have a

constraining structure rather, it has an indicative structure.

An Example of semi-structured data :-

NOTICE

To : the students

From : the hostel warden

Heading : Air conditioner

Air conditioner will be installed in all rooms by this week.

Here, in this example a notice has certain structure from ‘to’ till ‘heading’ but, after that its

structure can change significantly based on what kind of notice is it.

Xml Databases


3. XML

Extensible Markup Language (XML) is a set of rules for encoding documents in machine-

readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several

other related specifications, all gratis open standards.

XML's design goals emphasize simplicity, generality, and usability over the Internet. It is a

textual data format with strong support via Unicode for the languages of the world. Although

the design of XML focuses on documents, it is widely used for the representation of arbitrary

data structures, for example in web services.

Many application programming interfaces (APIs) have been developed that software

developers use to process XML data, and several schema systems exist to aid in the definition

of XML-based languages.

As of 2009, hundreds of XML-based languages have been developed, including RSS, Atom,

SOAP, and XHTML. XML-based formats have become the default for most office-

productivity tools, including Microsoft Office (Office Open XML), OpenOffice.org and

Apple's iWork.

Key terminology

The material in this section is based on the XML Specification. This is not an exhaustive list

of all the constructs which appear in XML; it provides an introduction to the key constructs

most often encountered in day-to-day use.

(Unicode) Character

By definition, an XML document is a string of characters. Almost every legal Unicode

character may appear in an XML document.

Processor and Application

The processor analyzes the markup and passes structured information to an application.

The specification places requirements on what an XML processor must do and not do, but the

application is outside its scope. The processor (as the specification calls it) is often referred to

colloquially as an XML parser.

Markup and Content

Xml Databases


The characters which make up an XML document are divided into markup and content.

Markup and content may be distinguished by the application of simple syntactic rules. All

strings which constitute markup either begin with the character "<" and end with a ">", or

begin with the character "&" and end with a ";". Strings of characters which are not markup

are content.

Tag

A markup construct that begins with "<" and ends with ">". Tags come in three flavors:

start-tags, for example <section>, end-tags, for example </section>, and empty-element tags,

for example <line-break/>.

Element

A logical component of a document which either begins with a start-tag and ends with a

matching end-tag, or consists only of an empty-element tag. The characters between the start-

and end-tags, if any, are the element's content, and may contain markup, including other

elements, which are called child elements. An example of an element is <Greeting>Hello,

world.</Greeting> (see hello world). Another is <line-break/>.

Attribute

A markup construct consisting of a name/value pair that exists within a start-tag or empty-

element tag. In the example (below) the element img has two attributes, src and alt: <img

src="madonna.jpg" alt='by Raphael'/>. Another example would be <step

number="3">Connect A to B.</step> where the name of the attribute is "number" and the

value is "3".

XML Declaration

XML documents may begin by declaring some information about themselves, as in the

following example.

<?xml version="1.0" encoding="UTF-8" ?>

Xml Databases


Example:

Here is a small, complete XML document, which uses all of these constructs and concepts.

<?xml version="1.0" encoding="UTF-8" ?>

<painting>

<img src="madonna.jpg" alt='Foligno Madonna, by Raphael'/>

<caption>This is Raphael's "Foligno" Madonna, painted in

<date>1511</date>–<date>1512</date>.

</caption>

</painting>

There are five elements in this example document: painting, img, caption, and two dates. The

date elements are children of caption, which is a child of the root element painting. img has

two attributes, src and alt.

Xml Databases


4. xml for semi-structured data

Xml is widely used for storing semi-structured data as xml supports all the features to store

such type of data. For example, we can store the NOTICE example as an xml document as

follows :

<NOTICE>

<To> the students </To>

<From> the hostel warden </From>

<Heading> Air Conditioner </Heading>

<body> air conditioner will be installed in all rooms by this week </body>

</NOTICE>

As an another example here is how a website, imdb, is stored as xml documents :

<imdb>

<show year=”2010”>

<title> inception </title>

<review>

<suntimes>

<reviewer> Robert Langdon </reviewer> gives

<rating> ten </rating> a must watch sci-fi movie.

</suntimes>

</review>

……………….

<review>

………………………

</review>

<box_office> 756,459,231 </box_office>

</show>

Xml Databases


<show year=”2010”>

<title> Toy story 3 </title?

………………

………………

</show>

………………

………...

</imdb>

Thus xml can be used to store a very small semi-structured data like notice to a huge

website like imdb.

Xml Databases


5: Xml DTD : Document Type Definition

Document Type Definition (DTD) is a set of markup declarations that define a document

type for SGML-family markup languages (SGML, XML, HTML). DTDs were a precursor to XML

schema and have a similar function, although different capabilities.

DTDs use a terse formal syntax that declares precisely which elements and references may

appear where in the document of the particular type, and what the elements’ contents and

attributes are. DTDs also declare entities which may be used in the instance document.

XML uses a subset of SGML DTD.

Markup declarations

DTDs describe the structure of a class of documents via element and attribute-list

declarations. Element declarations name the allowable set of elements within the

document, and specify whether and how declared elements and runs of character data may

be contained within each element. Attribute-list declarations name the allowable set of

attributes for each declared element, including the type of each attribute value, if not an

explicit set of valid value(s).

DTD markup declarations declare which element types, attribute lists, entities and notations

are allowed in the structure of the corresponding class of XML documents.

Element type declarations

An element type declaration defines an element and its possible content. A valid XML

document contains only elements that are defined in the DTD.

Various keywords and characters specify an element’s content; they can be either:

* EMPTY for specifying that the defined element allows no content, i.e. it can't have any

children elements, not even text elements (if there are whitespaces, they are ignored);

Xml Databases


* ANY for specifying that the defined element allows any content, without restriction, i.e.

that it may have any number (including none) and type of children elements (including text

elements);

* or an expression, specifying the only elements allowed as direct children in the content

of the defined element; this content can be either:

+ ( #PCDATA ): historically meaning parsed character data, this means that only one

text element is allowed in the content (no quantifier is allowed);

+ ( #PCDATA | element name | ... )*: a limited choice (in an exclusive list between

parentheses and separated by "|" pipe characters and terminated by the required "*"

quantifier) of two or more child elements (including only text elements or the specified

named elements) may be used in any order and number of occurrences in the content.

o an element content, which means that there must be no text elements in the

children elements of the content (all whitespaces encoded between child elements are then

ignored, just like comments). Such element content is specified as content particle in a

variant of Backus-Naur Form without terminal symbols and element names as non-terminal

symbols. Element content consists of:

+ a content particle can be either the name of an element declared in the DTD, or a

sequence list or choice list. It may be followed by an optional quantifier.

+ a sequence list means an ordered list (specified between parentheses and

separated by a "," comma character) of one or more content particles : all the content

particles must appear successively as direct children in the content of the defined element,

at the specified position and relative order;

+ a choice list means an mutually exclusive list (specified between parentheses and

separated by a "|" pipe character) of two or more content particles : only one these content

particles may appear in the content of the defined element at the same position.

+ A quantifier is a single character that immediately follows the specified item to

which it applies, to restrict the number of successive occurrences of these items at the

specified position in the content of the element; it and may be either:

# + for specifying that there must be one or more occurrences of the item —

the effective content of each occurrence may be different;

# * for specifying that any number (zero or more) of occurrences in allowed —

the item is optional and the effective content of each occurrence may be different;

# ? for specifying that there must not be more than one occurrence — the item

is optional;

Xml Databases


# If there is no quantifier, the specified item must occur exactly one time at the

specified position in the content of the element.

For example:

<!ELEMENT html (head, body)>

<!ELEMENT p (#PCDATA | p | ul | dl | table | h1|h2|h3)*>

XML DTD schema example

An example of a very simple external XML DTD to describe the schema of a list of persons

might consist of:

<!ELEMENT people_list (person)*>

<!ELEMENT person (name, birthdate?, gender?, socialsecuritynumber?)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT birthdate (#PCDATA)>

<!ELEMENT gender (#PCDATA)>

<!ELEMENT socialsecuritynumber (#PCDATA)>

Taking this line by line:

1. people_list is a valid element name, and an instance of such an element contains any

number of person elements. The * denotes there can be 0 or more person elements within

the people_list element.

2. person is a valid element name, and an instance of such an element contains one

element named name, followed by one named birthdate (optional), then gender (also

optional) and socialsecuritynumber (also optional). The ? indicates that an element is

Xml Databases


optional. The reference to the name element name has no ?, so a person element must

contain a name element.

3. name is a valid element name, and an instance of such an element contains "parsed

character data" (#PCDATA).

4. birthdate is a valid element name, and an instance of such an element contains parsed

character data.

5. gender is a valid element name, and an instance of such an element contains parsed

character data.

6. socialsecuritynumber is a valid element name, and an instance of such an element

contains parsed character data.

An example of an XML file which makes use of and conforms to this DTD follows. The DTD is

referenced here as an external subset, via the SYSTEM specifier and an URI. It assumes that

we can identify the DTD with the relative URI reference "example.dtd"; the "people_list"

after "!DOCTYPE" tells us that the root tags, or the first element defined in the DTD, is called

"people_list":

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE people_list SYSTEM "example.dtd">

<people_list>

<person>

<name>Fred Bloggs</name>

<birthdate>2008-11-27</birthdate>

<gender>Male</gender>

</person>

</people_list>

The same DTD can also be embedded directly in the XML document itself as an internal

subset, by surrounding it within [square brackets] in the document type declaration, in

Xml Databases


which case the document may no longer depend on other external entities and could be

processed as standalone, like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<!DOCTYPE people_list [

<!ELEMENT people_list (person)*>

<!ELEMENT person (name, birthdate?, gender?, socialsecuritynumber?)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT birthdate (#PCDATA)>

<!ELEMENT gender (#PCDATA)>

<!ELEMENT socialsecuritynumber (#PCDATA)>

]>

<people_list>

<person>

<name>Fred Bloggs</name>

<birthdate>2008-11-27</birthdate>

<gender>Male</gender>

</person>

</people_list>

An alternative to DTD is xml schema.

Xml Databases


6 Xml Schema

XML Schema, published as a W3C recommendation in May 2001, is one of several XML

schema languages. It was the first separate schema language for XML to achieve

Recommendation status by the W3C.

Technically, a schema is an abstract collection of metadata, consisting of a set of schema

components: chiefly element and attribute declarations and complex and simple type

definitions. These components are usually created by processing a collection of schema

documents, which contain the source language definitions of these components. In popular

usage, however, a schema document is often referred to as a schema.

Schema documents are organized by namespace: all the named schema components belong

to a target namespace, and the target namespace is a property of the schema document as

a whole. A schema document may include other schema documents for the same

namespace, and may import schema documents for a different namespace.

When an instance document is validated against a schema (a process known as assessment),

the schema to be used for validation can either be supplied as a parameter to the validation

engine, or it can be referenced directly from the instance document using two special

attributes, xsi:schemaLocation and xsi:noNamespaceSchemaLocation. (The latter

mechanism requires the client invoking validation to trust the document sufficiently to

know that it is being validated against the correct schema.)

XML Schema Documents usually have the filename extension ".xsd". A unique Internet

Media Type is not yet registered for XSDs, so "application/xml" or "text/xml" should be

used, as per RFC 3023.

Example

This is an example of a rather simple schema document to describe an address.

<?xml version="1.0" encoding="utf-8"?>

<xs:schema elementFormDefault="qualified"

xmlns:xs="http://www.w3.org/2001/XMLSchema">

Xml Databases


<xs:element name="Address">

<xs:complexType>

<xs:sequence>

<xs:element name="Recipient" type="xs:string" />

<xs:element name="House" type="xs:string" />

<xs:element name="Street" type="xs:string" />

<xs:element name="Town" type="xs:string" />

<xs:element name="County" type="xs:string" minOccurs="0" />

<xs:element name="PostCode" type="xs:string" />

<xs:element name="Country">

<xs:simpleType>

<xs:restriction base="xs:string">

<xs:enumeration value="FR" />

<xs:enumeration value="DE" />

<xs:enumeration value="ES" />

<xs:enumeration value="UK" />

<xs:enumeration value="US" />

</xs:restriction>

</xs:simpleType>

</xs:element>

</xs:sequence>

</xs:complexType>

</xs:element>

</xs:schema>

Xml Databases


A number of development tools can be used to create a graphical representation of a

schema.

An example of an XML document that conforms to this schema

<?xml version="1.0" encoding="utf-8"?>

<Address xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:noNamespaceSchemaLocation="SimpleAddress.xsd">

<Recipient>Mr. Walter C. Brown</Recipient>

<House>49</House>

<Street>Featherstone Street</Street>

<Town>LONDON</Town>

<PostCode>EC1Y 8SY</PostCode>

<Country>UK</Country>

</Address>

Xml Databases


7.Xpath

XPath 2.0 is the current version of the XPath language defined by the World Wide Web

Consortium, W3C. It became a recommendation on 23 January 2007.

XPath is used primarily for selecting parts of an XML document. For this purpose the XML

document is modelled as a tree of nodes. XPath allows nodes to be selected by means of a

hierarchic navigation path through the document tree.

XPath 2.0 is used as a sublanguage of XSLT 2.0, and it is also a subset of XQuery 1.0. All three

languages share the same data model, type system, and function library, and were

developed together and published on the same day.

Data model

Every value in XPath 2.0 is a sequence of items. The items may be nodes or atomic values.

An individual node or atomic value is considered to be a sequence of length one. Sequences

may not be nested.

Nodes are of seven kinds, corresponding to different constructs in the syntax of XML:

elements, attributes, text nodes, comments, processing instructions, namespace nodes, and

document nodes.

Type system

The type system of XPath 2.0 is noteworthy for the fact that it mixes strong typing and weak

typing within a single language.

Operations such as arithmetic and boolean comparison require atomic values as their

operands. If an operand returns a node (for example, @price * 1.2), then the node is

automatically atomized to extract the atomic value. If the input document has been

validated against a schema, then the node will typically have a type annotation, and this

determines the type of the resulting atomic value (in this example, the price attribute might

have the type decimal). If no schema is in use, the node will be untyped, and the type of the

resulting atomic value will be untypedAtomic.

Xml Databases


Path expressions

The location paths of XPath 1.0 are referred to in XPath 2.0 as path expressions. Informally,

a path expression is a sequence of steps separated by the "/" operator, for example a/b/c

(which is short for child::a/child::b/child::c). More formally, however, "/" is simply a binary

operator that applies the expression on its right-hand side to each item in turn selected by

the expression on the left hand side. So in this example, the expression a selects all the

element children of the context node that are named <a>; the expression child::b is then

applied to each of these nodes, selecting all the <b> children of the <a> elements; and the

expression child::c is then applied to each node in this sequence, which selects all the <c>

children of these <b> elements.

The "/" operator is generalized in XPath 2.0 to allow any kind of expression to be used as an

operand. For example, a function call can be used on the right-hand side. The typing rules

for the operator require that the result of the first operand is a sequence of nodes. The right

hand operand can return either nodes or atomic values (but not a mixture). If the result

consists of nodes, then duplicates are eliminated and the nodes are returned in document

order, and ordering defined in terms of the relative positions of the nodes in the original

XML tree.

Other operators available in XPath 2.0 include the following:

Operators Effect

+, -, *, div, mod,

idiv Arithmetic on numbers, dates, and durations

=, !=, <, >, <=,

>=

General comparison: compare arbitrary sequences. The result is true if any

pair of items, one from each sequence, satisfies the comparison

eq, ne, lt, gt, le,

ge Value comparison: compare single items

is Compare node identity: true if both operands are the same node

<<, >> Compare node position, based on document order

union, intersect,

except

Compare sequences of nodes, treating them as sets, returning the set union,

intersection, or difference

and, or boolean conjunction and disjunction. Negation is achieved using the not()

function.

to defines an integer range, for example 1 to 10

instance of determines whether a value is an instance of a given type

cast as converts a value to a given type

castable as tests whether a value is convertible to a given type

Xml Databases


XPath 2.0 also offers a for expression, which is a small subset of the FLWOR expression from

XQuery. The expression for $x in X return Y evaluates the expression Y for each value in the

result of expression X in turn, referring to that value using the variable reference $x.

The functions available include the following:

Purpose Example Functions

General string

handling

lower-case, upper-case, substring, substring-before, substring-after,

translate, starts-with, ends-with, contains, string-length, concat,

normalize-space, normalize-unicode

Regular

expressions matches, replace, tokenize

Arithmetic count, sum, avg, min, max, round, floor, ceiling, abs

Dates and times adjust-dateTime-to-timezone, current-dateTime, day-from-dateTime, month-

from-dateTime, days-from-duration, months-from-duration, etc.

Properties of

nodes name, node-name, local-name, namespace-uri, base-uri, nilled

Document

handling doc, doc-available, document-uri, collection, id, idref

URIs encode-for-uri, escape-html-uri, iri-to-uri, resolve-uri

QNames QName, namespace-uri-from-QName, prefix-from-QName, resolve-QName

Sequences insert-before, remove, subsequence, index-of, distinct-values, reverse,

unordered, empty, exists

Type checking one-or-more, exactly-one, zero-or-one

Xml Databases


8.Xquery

XQuery is a query and functional programming language that is designed to query

collections of XML data. The mission of the XML Query project is to provide flexible query

facilities to extract data from real and virtual documents on the World Wide Web, therefore

finally providing the needed interaction between the Web world and the database world.

Ultimately, collections of XML files will be accessed like databases.

XQuery provides the means to extract and manipulate data from XML documents or any

data source that can be viewed as XML, such as relational databases or office documents.

XQuery uses XPath expression syntax to address specific parts of an XML document. It

supplements this with a SQL-like "FLWOR expression" for performing joins. A FLWOR

expression is constructed from the five clauses after which it is named: FOR, LET, WHERE,

ORDER BY, RETUR XQuery 1.0 does not include features for updating XML documents or

databases; it also lacks full text search capability. These features are both under active

development for a subsequent version of the language.

XQuery is a programming language that can express arbitrary XML to XML data

transformations with the following features:

1. Logical/physical data independence

2. Declarative

3. High level

4. Side-effect free

5. Strongly typed.

Xml Databases


Examples

The sample XQuery code below lists the unique speakers in each act of Shakespeare's play

Hamlet, encoded in hamlet.xml

<html><head/><body>

{

for $act in doc("hamlet.xml")//ACT

let $speakers := distinct-values($act//SPEAKER)

return

<div>

<h1>{ string($act/TITLE) }</h1>

<ul>

{

for $speaker in $speakers

return <li>{ $speaker }</li>

}

</ul>

</div>

}

</body></html>

Xml Databases


9.XSLT

XSLT (Extensible Stylesheet Language Transformations) is a declarative, XML-based language

used for the transformation of XML documents into other XML documents. The original

document is not changed; rather, a new document is created based on the content of an

existing one.[2] The new document may be serialized (output) by the processor in standard

XML syntax or in another format, such as HTML or plain text.[3] XSLT is often used to

convert XML data into HTML or XHTML documents for display as a web page: the

transformation may happen dynamically either on the client or on the server, or it may be

done as part of the publishing process. It is also used to create output for printing or direct

video display, typically by transforming the original XML into XSL Formatting Objects to

create formatted output which can then be converted to a variety of formats, a few of

which are PDF, PostScript, AWT and PNG. XSLT is also used to translate XML messages

between different XML schemas, or to make changes to documents within the scope of a

single schema, for example by removing the parts of a message that are not needed.

XSLT examples

Sample of incoming XML document

<?xml version="1.0" ?>

<persons>

<person username="JS1">

<name>John</name>

<family-name>Smith</family-name>

</person>

<person username="MI1">

<name>Morka</name>

<family-name>Ismincius</family-name>

</person>

</persons>

Example 1 (transforming XML)

This XSLT stylesheet provides templates to transform the XML document:

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

version="1.0">

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/persons">

<root>

<xsl:apply-templates select="person"/>

</root>

</xsl:template>

<xsl:template match="person">

<name username="{@username}">

<xsl:value-of select="name" />

</name>

Xml Databases


</xsl:template>

</xsl:stylesheet>

Its evaluation results in a new XML document, having another structure:


<root>

<name username="JS1">John</name>

<name username="MI1">Morka</name>

</root>

Example 2 (transforming XML to XHTML)

Processing the following example XSLT file


<xsl:stylesheet

version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns="http://www.w3.org/1999/xhtml">

<xsl:output method="xml" indent="yes" encoding="UTF-8"/>

<xsl:template match="/persons">

<html>

<head> <title>Testing XML Example</title> </head>

<body>

<h1>Persons</h1>

<ul>

<xsl:apply-templates select="person">

<xsl:sort select="family-name" />

</xsl:apply-templates>

</ul>

</body>

</html>

</xsl:template>

<xsl:template match="person">

<li>

<xsl:value-of select="family-name"/><xsl:text>, </xsl:text>

<xsl:value-of select="name"/>

</li>

</xsl:template>

</xsl:stylesheet>

with the XML input file shown above results in the following XHTML


<html xmlns="http://www.w3.org/1999/xhtml">

<head> <title>Testing XML Example</title> </head>

<body>

<h1>Persons</h1>

<ul>

<li>Ismincius, Morka</li>

<li>Smith, John</li>

</ul>

Xml Databases


</body>

</html>

This XHTML generates the output below when rendered in a web browser.

Rendered XHTML generated from an XML input file and an XSLT transformation.

http://en.wikipedia.org/wiki/File:Xslt_ex2.png

Xml Databases


10. XML Parser

A parser is a piece of program that takes a physical representation of some data and

converts it into an in-memory form for the program as a whole to use. Parsers are used

everywhere in software. An XML Parser is a parser that is designed to read XML and create a

way for programs to use XML. There are different types, and each has its advantages. Unless

a program simply and blindly copies the whole XML file as a unit, every program must

implement or call on an XML parser.

The main types of parsers are known by some funny names: SAX and DOM

Xml Databases


What is SAX?

SAX stands for Simple API for XML. Its main characteristic is that as it reads each unit of

XML, it creates an event that the calling program can use. This allows the calling program to

ignore the bits it doesn't care about, and just keep or use what it likes. The disadvantage is

that the calling program must keep track of everything it might ever need. SAX is often used

in certain high-performance applications or areas where the size of the XML might exceed

the memory available to the running program.

What's a DOM?

DOM stands for Document Object Model. It differs from SAX in that it builds the entire XML

document representation in memory and then hands the calling program the whole chunk

of memory. DOM can be very memory intensive; by the time you figure in the overhead for

managing the relationships of the nodes, you might be talking 4× to 8× the size of the

original document in memory usage.

DOM has been widely criticized for being too complicated; it has tried to maintain the same

programming interface for whatever language it is implemented in, even if it violates some

of the conventions of that language. This has led to some DOM-like implementations that

are more in keeping in line with the philosophy of the local language

Xml Databases


11. XML database

An XML database is a data persistence software system that allows data to be stored in XML

format. This data can then be queried, exported and serialized into the desired format.

Two major classes of XML database exist:

1. XML-enabled: these map all XML to a traditional database (such as a relational

database), accepting XML as input and rendering XML as output. This term implies that the

database does the conversion itself (as opposed to relying on middleware).

2. Native XML (NXD): the internal model of such databases depends on XML and uses XML

documents as the fundamental unit of storage, which are , however, not necessarily stored

in the form of text files.

Rationale for XML in databases

O'Connell gives one reason for the use of XML in databases: the increasingly common use of

XML for data transport, which has meant that "data is extracted from databases and put

into XML documents and vice-versa". It may prove more efficient (in terms of conversion

costs) and easier to store the data in XML format.

Native XML databases

The term "native XML database" (NXD) can lead to confusion. Many NXDs do not function as

standalone databases at all, and do not really store the native (text) form.

The formal definition from the XML: DB initiative (which appears to be inactive since 2003)

states that a native XML database:

Xml Databases


* Defines a (logical) model for an XML document — as opposed to the data in that

document — and stores and retrieves documents according to that model. At a minimum,

the model must include elements, attributes, PCDATA, and document order. Examples of

such models include the XPath data model, the XML Info set, and the models implied by the

DOM and the events in SAX 1.0.

* Has an XML document as its fundamental unit of (logical) storage, just as a relational

database has a row in a table as its fundamental unit of (logical) storage.

* Need not have any particular underlying physical storage model. For example, NXDs can

use relational, hierarchical, or object-oriented database structures, or use a proprietary

storage format (such as indexed, compressed files).

Additionally, many XML databases provide a logical model of grouping documents, called

"collections". Databases can set up and manage many collections at one time. In some

implementations, a hierarchy of collections can exist, much in the same way that an

operating system's directory-structure works.

All XML databases now support at least one form of querying syntax. Minimally, just about

all of them support XPath for performing queries against documents or collections of

documents. XPath provides a simple pathing system that allows users to identify nodes that

match a particular set of criteria.

In addition to XPath, many XML databases support XSLT as a method of transforming

documents or query-results retrieved from the database. XSLT provides a declarative

language written using an XML grammar. It aims to define a set of XPath filters that can

transform documents (in part or in whole) into other formats including Plain text, XML, or

HTML.

Many XML databases also support XQuery to perform querying. XQuery includes XPath as a

node-selection method, but extends XPath to provide transformational capabilities. Users

sometimes refer to its syntax as "FLWOR" (pronounced 'Flower') because the query may

Xml Databases


include the following clauses: 'for', 'let', 'where', 'order by' and 'return'. Traditional RDBMS

vendors (who traditionally had SQL only engines), are now shipping with hybrid SQL and

XQuery engines. Hybrid SQL/XQuery engines help to query XML data alongside the

relational data, in the same query expression. This approach helps in combining relational

and XML data.

Some XML databases support an API called the XML: DB API (or XAPI) as a form of

implementation-independent access to the XML data store. In XML databases, XAPI

resembles ODBC and JDBC as used with relational databases. On the 24th of June 2009, The

Java Community Process released the final version of the XQuery API for Java specification

(XQJ) - "a common API that allows an application to submit queries conforming to the W3C

XQuery 1.0 specification and to process the results of such queries.

Databases known to support Common Programming Standards such as

XQuery API :

Xml Database Language

BaseX Java

eXist Java

MarkLogic Server C++

MonetDB/XQuery C++

Xml Databases


12. SOAP

SOAP, originally defined as Simple Object Access Protocol, is a protocol specification for

exchanging structured information in the implementation of Web Services in computer

networks. It relies on Extensible Markup Language (XML) for its message format, and usually

relies on other Application Layer protocols, most notably Remote Procedure Call (RPC) and

Hypertext Transfer Protocol (HTTP), for message negotiation and transmission. SOAP can

form the foundation layer of a web services protocol stack, providing a basic messaging

framework upon which web services can be built. This XML based protocol consists of three

parts: an envelope, which defines what is in the message and how to process it, a set of

encoding rules for expressing instances of application-defined datatypes, and a convention

for representing procedure calls and responses.

As a layman's example of how SOAP procedures can be used, a SOAP message could be sent

to a web-service-enabled web site, for example, a real-estate price database, with the

parameters needed for a search. The site would then return an XML-formatted document

with the resulting data, e.g., prices, location, features. Because the data is returned in a

standardized machine-parseable format, it could then be integrated directly into a third-

party web site or application.

Advantages

* SOAP is versatile enough to allow for the use of different transport protocols. The

standard stacks use HTTP as a transport protocol, but other protocols are also usable (e.g.,

JMS, SMTP).

* Since the SOAP model tunnels fine in the HTTP get/response model, it can tunnel easily

over existing firewalls and proxies, without modifications to the SOAP protocol, and can use

the existing infrastructure.

Disadvantages

* Because of the verbose XML format, SOAP can be considerably slower than competing

middleware technologies such as CORBA. This may not be an issue when only small

messages are sent.[7] To improve performance for the special case of XML with embedded

binary objects, the Message Transmission Optimization Mechanism was introduced.

* When relying on HTTP as a transport protocol and not using WS-Addressing or an ESB,

the roles of the interacting parties are fixed. Only one party (the client) can use the services

of the other. Developers must use polling instead of notification in these common cases.

Xml Databases


13. Conclusion

Xml databases are rapidly becoming the de facto standard of transferring data over the

internet. Also, because of their ability to store heterogeneous data, they are used in a wide

variety of fields like e-publishing, digital libraries, finance industry, etc.

Xml standard is evolving day and night and new standard of xml are being created for

almost every field. Open source base and simplicity has made it best tool to pass

information. We live in an era of information technology and xml databases is the best

vehicle ,we can ride on.

Xml Databases


14. References

Books:

Database Systems by: Navathe and Elmasari

Xml: the complete reference , TMH publications

Websites :

http://en.wikipedia.org/wiki/XML_database

http://www.cfoster.net/articles/xmldb-business-case/

http://www.25hoursaday.com/StoringAndQueryingXML.html

http://www.stylusstudio.com/db_to_xml_mapper.html

http://www.wisegeek.com/what-is-an-xml-database.htm

http://en.wikipedia.org/wiki/SOAP

http://en.wikipedia.org/wiki/XML_database

http://www.cfoster.net/articles/xmldb-business-case/

http://www.25hoursaday.com/StoringAndQueryingXML.html

http://www.stylusstudio.com/db_to_xml_mapper.html

http://www.wisegeek.com/what-is-an-xml-database.htm

http://en.wikipedia.org/wiki/SOAP