xml databases - cochin university of science and...
TRANSCRIPT
XML DATABASES
A SEMINAR REPORT
Submitted By
ATUL KUMAR
in partial fulfilment for the award of the degree
of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
SCHOOL OF ENGINEERING
COCHIN UNIVERSITY OF SCIENCE & TECHNOLOGY
KOCHI-682022
SEPTEMBER 2010
Division of Computer Engineering
School of Engineering Cochin University of Science & Technology Kochi-682022
_____________________________________________________
CERTIFICATE
Certified that this is a bonafide record of the seminar work titled
Xml Databases
Done by
ATUL KUMAR
of VII semester Computer Science & Engineering in the year 2010 in
partial fulfillment of the requirements for the award of Degree of
Bachelor of Technology in Computer Science & Engineering of Cochin
University of Science & Technology
Dr. David Peter S. Ms. Anu M. Head of the Division Seminar Guide
Xml Databases
Division of Computer Engineering i
ACKNOWLEDGEMENT
I express my sincere thanks to Ms. Anu M., my seminar guide for her valuable
suggestions and sincere vigilance, Mr. Sudheep P. Eliydom (Staff in charge) for providing right
guidance and co-operations and Dr David peter S. (Head of Division) for allowing us to use the
facilities. Also I would like to extend my sincere thanks to all other members of the faculty of
Computer Science and Engineering Department. Last but not least I want to thank my friends for
their co-operation and encouragement.
ATUL KUMAR
Xml databases
Division of Computer Engineering 1
ABSTRACT
The Xml database is a rapidly growing technology which is poised
to replace many existing technologies used for data storage. It uses xml and many
of its derived technologies, like DTD, XSL, XSLT, Xpath, Xquery, etc., as its frame-
work. Xml document is self-describing and is compatible on all kind of platforms
because it is in text format. This makes it a very powerful technology.
We can store semi-structured data in xml databases.
Also, there are protocols like SOAP for accessing data and web services over the
internet. Due to its simplicity and compatibility on all kinds of platforms , xml
database is rapidly becoming the de facto standard of passing data over the
internet.
TABLE OF CONTENTS
1. INTRODUCTION……………………………………………………………………………………01
2. SEMI-STRUCTURED DATA…………………………………………………………………….08
3. XML …………………………………………………………………………………………………….09
4. XML FOR SEMI-STRUCTURED DATA……………………………………………………..12
5. XML DTD : DOCUMENT TYPE DEFINITION…………………………………………….14
6. XML SCHEMA ………………………………………………………………………………………19
7. XPATH………………………………………………………………………………………………….22
8. XQUERY……………………………………………………………………………………………….25
9. XSLT……………………………………………………………………………………………………..27
10. XML PARSER……………………………………………………………………………………….30
11. XML DATABASE……………………………………………………………………………….….32
12. SOAP…………………………………………………………………………………………………..35
13. CONCLUSION……………………………………………………………………………………...36
14.REFERENCE………………………………………………………………………………………….37
Xml Databases
Division of Computer Engineering 1
1. INTRODUCTION
For three decades, application developers have relied on relational databases as the bedrock
for a persistent data storage layer. While the technology is mature, today's requirements are
becoming more complex and relational databases may not be the tool for the job in hand, but
what else does a designer / developer pick if they know no better? - Relational Databases
were developed in the days of procedural programming languages (e.g. C, COBOL and
RPG), programming techniques have evolved in many ways since 30 years ago most notably
with introduction of an Object Oriented approach but the persistent storage model has stayed
the same. This article tries to question if developers have been dumbing down and creating
more work for themselves (unknowingly) for many years, this article also attempts to give an
eye-opener into a new approach of storing and retrieving data.
Commonly today, data structures are often modelled in a hierarchical object manner, imagine
a simple invoice in terms of an object hierarchy:
Simple Invoice, Theoretical Business Object
Invoice = {
date : "2008-05-24"
invoiceNumber : 421
InvoiceItems : {
Item : {
description : "Wool Paddock Shet Ret Double Bound Yellow 4'0"
quantity : 1
unitPrice : 105.00
}
Item : {
description : "Wool Race Roller and Breastplate Red Double"
quantity : 1
unitPrice : 75.00
Xml Databases
Division of Computer Engineering 2
}
Item : {
description : "Paddock Jacket Red Size Medium Inc Embroidery"
quantity : 2
unitPrice : 67.50
}
}
}
The following is an example relational structure, containing this data
Table Invoices
date invoiceId
2008-05-24 421
Table InvoiceItems
invoiceId description quantity unitPrice
421 Wool Paddock Shet Ret Double Bound ... 1 105.00
421 Wool Race Roller and Breastplate Red ... 1 75.00
421 Paddock Jacket Red Size Medium Inc ... 2 67.50
Representing this simple single Invoice Object in a relational database can be done, but
immediately even for something this simple you need more than 1 table, table joins based on
keys and of course the Object has to be spanned over multiple tables. This leaves room for
human error; when inserting and updating data it is up to the developer to ensure keys
correctly match and when trying to rebuild the object from the persistent layer you need an
SQL query which will select data from multiple tables, by nature the query returns the data as
essentially a result set of flat 1 dimensional arrays and its then up to the developer to build
this hierarchical object from scratch.
Xml Databases
Division of Computer Engineering 3
To a programmer who has been developing with relational databases for some time this may
seem like second nature but for a new developer that has just learned the concepts of Object
Oriented programming this may seem a little alien.
Leaving aside the programmer's responsibility to ensure the mapping between Object and
relational structures, because the data types in SQL databases are quite simplistic all
validation must be performed within the business logic layer of an application before any data
can be inserted or updated in the database.
SQL "CREATE TABLE" and the SQL data type values a developer can bind to each column
is too simplistic to be used as a means of validating data taken directly from a user's input.
Often the business logic layer in today's applications performs additional validation, e.g.
checks that a field is a valid phone number or a valid e-mail address or even that when the
field is inserted into the SQL INSERT or UPDATE statement that it won't actually break the
syntax or cause a security breach.
Object Relational Mapping has definitely eased these problems with relational databases
because it allows a relational database to become a "virtual object database", but O/R
Mapping has brought some problems of its own. O/R Mapping techniques and frameworks
can be difficult to learn, it is by no means simple to map complex Java classes with multiple
Java class descendants to a relational structure, validating user's input is still cumbersome and
essentially still needs to be written in full in the business logic layer and it of course adds an
additional performance overhead because essentially the O/R mapping process attempts to
emulate the natural functionality of an Object oriented database.
Object oriented databases are designed to work well with object oriented programming
languages such as Java, C# and C++. Object Databases use the same model as today's
programming languages as they store and index theoretical objects. Object databases are
generally recommended when there is a business need for high performance processing on
complex data.
What has held Object databases back over the years is A. The industries resilience to change.
B. The majority of developers in the industry can't be bothered to investigate about new or
alternative technologies to the ones that are common place in industry.
However, thankfully change does happen. Today we are living in the information age,
businesses are talking to each other via complex XML data structures, (SOAP and RESTful
Xml Databases
Division of Computer Engineering 4
Web Services becoming the ever more popular means of information exchange between
disparate applications and systems).
The XML messages exchanged are by nature hierarchical and deeply tree structured,
sometimes the data is even unpredictable and sometimes the structure is prone to change at
any time, developers trying to map this data to a relational structure may find their lives
becoming more and more difficult.
XML Databases offer the same functionality of Object Databases, data is structured in a
hierarchical manner except XML Databases store XML documents instead of theoretical
Objects. While in principle this is the same concept of data storage, XML databases have the
added benefit of being able exchange the data in its native format, which is perfect for today's
requirements.
Where Object Databases have Object Query Language (OQL), XML Databases have XQuery
which is a W3C standard. XQuery covers the major functionality from former language
proposals like XML-QL, XQL, OQL and the SQL standard.
Going back to the Invoice object and a persistent layer. A developer working with an XML
Database would just need to place an XML representation of the Object into a collection.
The following is an example of the invoice data but stored in XML format
Simple Invoice, XML Representation
<invoice>
<number>421</number>
<date>2008-05-24</date>
<items>
Xml Databases
Division of Computer Engineering 5
<item>
<description>Wool Paddock Shet Ret Double Bound Yellow 4'0"</description>
<quantity>1</quantity>
<unitPrice>105.00</unitPrice>
</item>
<item>
<description>Wool Race Roller and Breastplate Red Double</description>
<quantity>1</quantity>
<unitPrice>75.00</unitPrice>
</item>
<item>
<description>Paddock Jacket Red Size Medium Inc Embroidery</description>
<quantity>2</quantity>
<unitPrice>67.50</unitPrice>
</item>
</items>
</invoice>
Pulling up the full invoice from the XML Database requires no long winded table joins, it is
as simple as:
XQuery
Xml Databases
Division of Computer Engineering 6
collection("invoices")/invoice[number=421]
Pretty simple when you compare it to the equivilant SQL for Relational Databases:
Equivilant SQL
select * from invoiceitems inner join invoices on
invoiceitems.invoiceid = invoices.invoiceid where invoices.invoiceid = 421;
XML Databases can accept structured as well as unstructured data. XML documents do not
have to conform to any set Schema so a developer can fire anything they wish at the database,
no need to modify tables and columns. On the other hand, XML may conform to an XML
Schema.
XML Schema allows one to define an XML document in both its node structure (e.g.
elements and attributes) as well as the data types contained within these nodes. It allows one
to define these data types in very explicit detail, e.g. a float with additional constraints like
Maximum Number, Minimum Number, Total Digits, Fraction Digits, etc. Strings can also be
given many additional constraints including Minimum and Maximum Lengths as well as
matching a user defined Regular Expression, this is a perhaps the most effective constraint.
Because XML Schema is so powerful in terms of the explicitness of the constraints that can
be placed on XML data, potentially large amounts of validation that would normally be
performed in the business logic layer of an application can be reduced dramatically or even
completely.
A great tool for Java/J2EE Developers is Java Architecture for XML Binding or JAXB which
allows a developer to generate simple Java Bean classes which represent the structure of an
underlying XML document, the classes can be generated from an existing XML Schema.
Object/XML Mapping if you like.
JAXB allows a developer to convert XML documents into in-memory Java Bean Objects
which act as an interface to the underlying XML, it also has the ability to serialize these in-
memory Java Objects back into XML documents. Validation of the in-memory data is
performed based on the original XML Schema from which the classes were generated, which
means far less / no validation code would need to be written in the business logic layer of the
application.
Xml Databases
Division of Computer Engineering 7
JAXB also allows the developer to generate an XML Schema based on existing Java code, so
a developer can use an XML Database much like an Object database without ever getting into
the detail of using XML, XQuery or SOAP / RESTful Web Services.
Conclusion
A new project which deals with XML and/or unpredictable data, choosing to use a Relational
Database will not stop the project in its tracks but a great deal of time will be wasted on
trivial matters that could be easily solved by making use of an XML Database instead.
Xml Databases
Division of Computer Engineering 8
2. Semi-structured data
Data that is inherently self-describing and does not conform to any explicit and fixed
schema is known as semi-structured data. An example of such a data is an xml document.
The structure is implicit in such data. For example, xml tags define the structure of the data in
an xml document. The information that is associated with the schema in the normal course, is
contained within the data itself. Semi-structured data is usually formalized as labeled graphs.
Some examples of semi-structured data are letters, document, web information systems,
digital libraries, and heterogeneous data integration. A letter has a limited structure as every
letter starts with ‘to’ and ends with ‘from’ but, in between them the structure of a letter
changes from person to person, from place to place and from one situation to another. With
the advent of web the amount of flow of semi-structured data increased many fold.
Irregularity in structure:
There is high irregularity in semi-structured data. Some data elements may
annotate more information than others. The same kind of data may be typified differently. For
example- at some place the names may be written as < lastname, firstname > while at some
other place it may be written as < firstname, lastname >. Also, since a lot of data is added
dynamically so, the structure keeps changing. Hence semi-structured data does not have a
constraining structure rather, it has an indicative structure.
An Example of semi-structured data :-
NOTICE
To : the students
From : the hostel warden
Heading : Air conditioner
Air conditioner will be installed in all rooms by this week.
Here, in this example a notice has certain structure from ‘to’ till ‘heading’ but, after that its
structure can change significantly based on what kind of notice is it.
Xml Databases
Division of Computer Engineering 9
3. XML
Extensible Markup Language (XML) is a set of rules for encoding documents in machine-
readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several
other related specifications, all gratis open standards.
XML's design goals emphasize simplicity, generality, and usability over the Internet. It is a
textual data format with strong support via Unicode for the languages of the world. Although
the design of XML focuses on documents, it is widely used for the representation of arbitrary
data structures, for example in web services.
Many application programming interfaces (APIs) have been developed that software
developers use to process XML data, and several schema systems exist to aid in the definition
of XML-based languages.
As of 2009, hundreds of XML-based languages have been developed, including RSS, Atom,
SOAP, and XHTML. XML-based formats have become the default for most office-
productivity tools, including Microsoft Office (Office Open XML), OpenOffice.org and
Apple's iWork.
Key terminology
The material in this section is based on the XML Specification. This is not an exhaustive list
of all the constructs which appear in XML; it provides an introduction to the key constructs
most often encountered in day-to-day use.
(Unicode) Character
By definition, an XML document is a string of characters. Almost every legal Unicode
character may appear in an XML document.
Processor and Application
The processor analyzes the markup and passes structured information to an application.
The specification places requirements on what an XML processor must do and not do, but the
application is outside its scope. The processor (as the specification calls it) is often referred to
colloquially as an XML parser.
Markup and Content
Xml Databases
Division of Computer Engineering 10
The characters which make up an XML document are divided into markup and content.
Markup and content may be distinguished by the application of simple syntactic rules. All
strings which constitute markup either begin with the character "<" and end with a ">", or
begin with the character "&" and end with a ";". Strings of characters which are not markup
are content.
Tag
A markup construct that begins with "<" and ends with ">". Tags come in three flavors:
start-tags, for example <section>, end-tags, for example </section>, and empty-element tags,
for example <line-break/>.
Element
A logical component of a document which either begins with a start-tag and ends with a
matching end-tag, or consists only of an empty-element tag. The characters between the start-
and end-tags, if any, are the element's content, and may contain markup, including other
elements, which are called child elements. An example of an element is <Greeting>Hello,
world.</Greeting> (see hello world). Another is <line-break/>.
Attribute
A markup construct consisting of a name/value pair that exists within a start-tag or empty-
element tag. In the example (below) the element img has two attributes, src and alt: <img
src="madonna.jpg" alt='by Raphael'/>. Another example would be <step
number="3">Connect A to B.</step> where the name of the attribute is "number" and the
value is "3".
XML Declaration
XML documents may begin by declaring some information about themselves, as in the
following example.
<?xml version="1.0" encoding="UTF-8" ?>
Xml Databases
Division of Computer Engineering 11
Example:
Here is a small, complete XML document, which uses all of these constructs and concepts.
<?xml version="1.0" encoding="UTF-8" ?>
<painting>
<img src="madonna.jpg" alt='Foligno Madonna, by Raphael'/>
<caption>This is Raphael's "Foligno" Madonna, painted in
<date>1511</date>–<date>1512</date>.
</caption>
</painting>
There are five elements in this example document: painting, img, caption, and two dates. The
date elements are children of caption, which is a child of the root element painting. img has
two attributes, src and alt.
Xml Databases
Division of Computer Engineering 12
4. xml for semi-structured data
Xml is widely used for storing semi-structured data as xml supports all the features to store
such type of data. For example, we can store the NOTICE example as an xml document as
follows :
<NOTICE>
<To> the students </To>
<From> the hostel warden </From>
<Heading> Air Conditioner </Heading>
<body> air conditioner will be installed in all rooms by this week </body>
</NOTICE>
As an another example here is how a website, imdb, is stored as xml documents :
<imdb>
<show year=”2010”>
<title> inception </title>
<review>
<suntimes>
<reviewer> Robert Langdon </reviewer> gives
<rating> ten </rating> a must watch sci-fi movie.
</suntimes>
</review>
……………….<!-----many other reviews--->
<review>
………………………
</review>
<box_office> 756,459,231 </box_office>
</show>
Xml Databases
Division of Computer Engineering 13
<show year=”2010”>
<title> Toy story 3 </title?
………………
………………
</show>
………………
………..<!--many more shows-->.
</imdb>
Thus xml can be used to store a very small semi-structured data like notice to a huge
website like imdb.
Xml Databases
Division of Computer Engineering 14
5: Xml DTD : Document Type Definition
Document Type Definition (DTD) is a set of markup declarations that define a document
type for SGML-family markup languages (SGML, XML, HTML). DTDs were a precursor to XML
schema and have a similar function, although different capabilities.
DTDs use a terse formal syntax that declares precisely which elements and references may
appear where in the document of the particular type, and what the elements’ contents and
attributes are. DTDs also declare entities which may be used in the instance document.
XML uses a subset of SGML DTD.
Markup declarations
DTDs describe the structure of a class of documents via element and attribute-list
declarations. Element declarations name the allowable set of elements within the
document, and specify whether and how declared elements and runs of character data may
be contained within each element. Attribute-list declarations name the allowable set of
attributes for each declared element, including the type of each attribute value, if not an
explicit set of valid value(s).
DTD markup declarations declare which element types, attribute lists, entities and notations
are allowed in the structure of the corresponding class of XML documents.
Element type declarations
An element type declaration defines an element and its possible content. A valid XML
document contains only elements that are defined in the DTD.
Various keywords and characters specify an element’s content; they can be either:
* EMPTY for specifying that the defined element allows no content, i.e. it can't have any
children elements, not even text elements (if there are whitespaces, they are ignored);
Xml Databases
Division of Computer Engineering 15
* ANY for specifying that the defined element allows any content, without restriction, i.e.
that it may have any number (including none) and type of children elements (including text
elements);
* or an expression, specifying the only elements allowed as direct children in the content
of the defined element; this content can be either:
+ ( #PCDATA ): historically meaning parsed character data, this means that only one
text element is allowed in the content (no quantifier is allowed);
+ ( #PCDATA | element name | ... )*: a limited choice (in an exclusive list between
parentheses and separated by "|" pipe characters and terminated by the required "*"
quantifier) of two or more child elements (including only text elements or the specified
named elements) may be used in any order and number of occurrences in the content.
o an element content, which means that there must be no text elements in the
children elements of the content (all whitespaces encoded between child elements are then
ignored, just like comments). Such element content is specified as content particle in a
variant of Backus-Naur Form without terminal symbols and element names as non-terminal
symbols. Element content consists of:
+ a content particle can be either the name of an element declared in the DTD, or a
sequence list or choice list. It may be followed by an optional quantifier.
+ a sequence list means an ordered list (specified between parentheses and
separated by a "," comma character) of one or more content particles : all the content
particles must appear successively as direct children in the content of the defined element,
at the specified position and relative order;
+ a choice list means an mutually exclusive list (specified between parentheses and
separated by a "|" pipe character) of two or more content particles : only one these content
particles may appear in the content of the defined element at the same position.
+ A quantifier is a single character that immediately follows the specified item to
which it applies, to restrict the number of successive occurrences of these items at the
specified position in the content of the element; it and may be either:
# + for specifying that there must be one or more occurrences of the item —
the effective content of each occurrence may be different;
# * for specifying that any number (zero or more) of occurrences in allowed —
the item is optional and the effective content of each occurrence may be different;
# ? for specifying that there must not be more than one occurrence — the item
is optional;
Xml Databases
Division of Computer Engineering 16
# If there is no quantifier, the specified item must occur exactly one time at the
specified position in the content of the element.
For example:
<!ELEMENT html (head, body)>
<!ELEMENT p (#PCDATA | p | ul | dl | table | h1|h2|h3)*>
XML DTD schema example
An example of a very simple external XML DTD to describe the schema of a list of persons
might consist of:
<!ELEMENT people_list (person)*>
<!ELEMENT person (name, birthdate?, gender?, socialsecuritynumber?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT birthdate (#PCDATA)>
<!ELEMENT gender (#PCDATA)>
<!ELEMENT socialsecuritynumber (#PCDATA)>
Taking this line by line:
1. people_list is a valid element name, and an instance of such an element contains any
number of person elements. The * denotes there can be 0 or more person elements within
the people_list element.
2. person is a valid element name, and an instance of such an element contains one
element named name, followed by one named birthdate (optional), then gender (also
optional) and socialsecuritynumber (also optional). The ? indicates that an element is
Xml Databases
Division of Computer Engineering 17
optional. The reference to the name element name has no ?, so a person element must
contain a name element.
3. name is a valid element name, and an instance of such an element contains "parsed
character data" (#PCDATA).
4. birthdate is a valid element name, and an instance of such an element contains parsed
character data.
5. gender is a valid element name, and an instance of such an element contains parsed
character data.
6. socialsecuritynumber is a valid element name, and an instance of such an element
contains parsed character data.
An example of an XML file which makes use of and conforms to this DTD follows. The DTD is
referenced here as an external subset, via the SYSTEM specifier and an URI. It assumes that
we can identify the DTD with the relative URI reference "example.dtd"; the "people_list"
after "!DOCTYPE" tells us that the root tags, or the first element defined in the DTD, is called
"people_list":
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE people_list SYSTEM "example.dtd">
<people_list>
<person>
<name>Fred Bloggs</name>
<birthdate>2008-11-27</birthdate>
<gender>Male</gender>
</person>
</people_list>
The same DTD can also be embedded directly in the XML document itself as an internal
subset, by surrounding it within [square brackets] in the document type declaration, in
Xml Databases
Division of Computer Engineering 18
which case the document may no longer depend on other external entities and could be
processed as standalone, like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE people_list [
<!ELEMENT people_list (person)*>
<!ELEMENT person (name, birthdate?, gender?, socialsecuritynumber?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT birthdate (#PCDATA)>
<!ELEMENT gender (#PCDATA)>
<!ELEMENT socialsecuritynumber (#PCDATA)>
]>
<people_list>
<person>
<name>Fred Bloggs</name>
<birthdate>2008-11-27</birthdate>
<gender>Male</gender>
</person>
</people_list>
An alternative to DTD is xml schema.
Xml Databases
Division of Computer Engineering 19
6 Xml Schema
XML Schema, published as a W3C recommendation in May 2001, is one of several XML
schema languages. It was the first separate schema language for XML to achieve
Recommendation status by the W3C.
Technically, a schema is an abstract collection of metadata, consisting of a set of schema
components: chiefly element and attribute declarations and complex and simple type
definitions. These components are usually created by processing a collection of schema
documents, which contain the source language definitions of these components. In popular
usage, however, a schema document is often referred to as a schema.
Schema documents are organized by namespace: all the named schema components belong
to a target namespace, and the target namespace is a property of the schema document as
a whole. A schema document may include other schema documents for the same
namespace, and may import schema documents for a different namespace.
When an instance document is validated against a schema (a process known as assessment),
the schema to be used for validation can either be supplied as a parameter to the validation
engine, or it can be referenced directly from the instance document using two special
attributes, xsi:schemaLocation and xsi:noNamespaceSchemaLocation. (The latter
mechanism requires the client invoking validation to trust the document sufficiently to
know that it is being validated against the correct schema.)
XML Schema Documents usually have the filename extension ".xsd". A unique Internet
Media Type is not yet registered for XSDs, so "application/xml" or "text/xml" should be
used, as per RFC 3023.
Example
This is an example of a rather simple schema document to describe an address.
<?xml version="1.0" encoding="utf-8"?>
<xs:schema elementFormDefault="qualified"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
Xml Databases
Division of Computer Engineering 20
<xs:element name="Address">
<xs:complexType>
<xs:sequence>
<xs:element name="Recipient" type="xs:string" />
<xs:element name="House" type="xs:string" />
<xs:element name="Street" type="xs:string" />
<xs:element name="Town" type="xs:string" />
<xs:element name="County" type="xs:string" minOccurs="0" />
<xs:element name="PostCode" type="xs:string" />
<xs:element name="Country">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="FR" />
<xs:enumeration value="DE" />
<xs:enumeration value="ES" />
<xs:enumeration value="UK" />
<xs:enumeration value="US" />
</xs:restriction>
</xs:simpleType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Xml Databases
Division of Computer Engineering 21
A number of development tools can be used to create a graphical representation of a
schema.
An example of an XML document that conforms to this schema
<?xml version="1.0" encoding="utf-8"?>
<Address xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="SimpleAddress.xsd">
<Recipient>Mr. Walter C. Brown</Recipient>
<House>49</House>
<Street>Featherstone Street</Street>
<Town>LONDON</Town>
<PostCode>EC1Y 8SY</PostCode>
<Country>UK</Country>
</Address>
Xml Databases
Division of Computer Engineering 22
7.Xpath
XPath 2.0 is the current version of the XPath language defined by the World Wide Web
Consortium, W3C. It became a recommendation on 23 January 2007.
XPath is used primarily for selecting parts of an XML document. For this purpose the XML
document is modelled as a tree of nodes. XPath allows nodes to be selected by means of a
hierarchic navigation path through the document tree.
XPath 2.0 is used as a sublanguage of XSLT 2.0, and it is also a subset of XQuery 1.0. All three
languages share the same data model, type system, and function library, and were
developed together and published on the same day.
Data model
Every value in XPath 2.0 is a sequence of items. The items may be nodes or atomic values.
An individual node or atomic value is considered to be a sequence of length one. Sequences
may not be nested.
Nodes are of seven kinds, corresponding to different constructs in the syntax of XML:
elements, attributes, text nodes, comments, processing instructions, namespace nodes, and
document nodes.
Type system
The type system of XPath 2.0 is noteworthy for the fact that it mixes strong typing and weak
typing within a single language.
Operations such as arithmetic and boolean comparison require atomic values as their
operands. If an operand returns a node (for example, @price * 1.2), then the node is
automatically atomized to extract the atomic value. If the input document has been
validated against a schema, then the node will typically have a type annotation, and this
determines the type of the resulting atomic value (in this example, the price attribute might
have the type decimal). If no schema is in use, the node will be untyped, and the type of the
resulting atomic value will be untypedAtomic.
Xml Databases
Division of Computer Engineering 23
Path expressions
The location paths of XPath 1.0 are referred to in XPath 2.0 as path expressions. Informally,
a path expression is a sequence of steps separated by the "/" operator, for example a/b/c
(which is short for child::a/child::b/child::c). More formally, however, "/" is simply a binary
operator that applies the expression on its right-hand side to each item in turn selected by
the expression on the left hand side. So in this example, the expression a selects all the
element children of the context node that are named <a>; the expression child::b is then
applied to each of these nodes, selecting all the <b> children of the <a> elements; and the
expression child::c is then applied to each node in this sequence, which selects all the <c>
children of these <b> elements.
The "/" operator is generalized in XPath 2.0 to allow any kind of expression to be used as an
operand. For example, a function call can be used on the right-hand side. The typing rules
for the operator require that the result of the first operand is a sequence of nodes. The right
hand operand can return either nodes or atomic values (but not a mixture). If the result
consists of nodes, then duplicates are eliminated and the nodes are returned in document
order, and ordering defined in terms of the relative positions of the nodes in the original
XML tree.
Other operators available in XPath 2.0 include the following:
Operators Effect
+, -, *, div, mod,
idiv Arithmetic on numbers, dates, and durations
=, !=, <, >, <=,
>=
General comparison: compare arbitrary sequences. The result is true if any
pair of items, one from each sequence, satisfies the comparison
eq, ne, lt, gt, le,
ge Value comparison: compare single items
is Compare node identity: true if both operands are the same node
<<, >> Compare node position, based on document order
union, intersect,
except
Compare sequences of nodes, treating them as sets, returning the set union,
intersection, or difference
and, or boolean conjunction and disjunction. Negation is achieved using the not()
function.
to defines an integer range, for example 1 to 10
instance of determines whether a value is an instance of a given type
cast as converts a value to a given type
castable as tests whether a value is convertible to a given type
Xml Databases
Division of Computer Engineering 24
XPath 2.0 also offers a for expression, which is a small subset of the FLWOR expression from
XQuery. The expression for $x in X return Y evaluates the expression Y for each value in the
result of expression X in turn, referring to that value using the variable reference $x.
The functions available include the following:
Purpose Example Functions
General string
handling
lower-case, upper-case, substring, substring-before, substring-after,
translate, starts-with, ends-with, contains, string-length, concat,
normalize-space, normalize-unicode
Regular
expressions matches, replace, tokenize
Arithmetic count, sum, avg, min, max, round, floor, ceiling, abs
Dates and times adjust-dateTime-to-timezone, current-dateTime, day-from-dateTime, month-
from-dateTime, days-from-duration, months-from-duration, etc.
Properties of
nodes name, node-name, local-name, namespace-uri, base-uri, nilled
Document
handling doc, doc-available, document-uri, collection, id, idref
URIs encode-for-uri, escape-html-uri, iri-to-uri, resolve-uri
QNames QName, namespace-uri-from-QName, prefix-from-QName, resolve-QName
Sequences insert-before, remove, subsequence, index-of, distinct-values, reverse,
unordered, empty, exists
Type checking one-or-more, exactly-one, zero-or-one
Xml Databases
Division of Computer Engineering 25
8.Xquery
XQuery is a query and functional programming language that is designed to query
collections of XML data. The mission of the XML Query project is to provide flexible query
facilities to extract data from real and virtual documents on the World Wide Web, therefore
finally providing the needed interaction between the Web world and the database world.
Ultimately, collections of XML files will be accessed like databases.
XQuery provides the means to extract and manipulate data from XML documents or any
data source that can be viewed as XML, such as relational databases or office documents.
XQuery uses XPath expression syntax to address specific parts of an XML document. It
supplements this with a SQL-like "FLWOR expression" for performing joins. A FLWOR
expression is constructed from the five clauses after which it is named: FOR, LET, WHERE,
ORDER BY, RETUR XQuery 1.0 does not include features for updating XML documents or
databases; it also lacks full text search capability. These features are both under active
development for a subsequent version of the language.
XQuery is a programming language that can express arbitrary XML to XML data
transformations with the following features:
1. Logical/physical data independence
2. Declarative
3. High level
4. Side-effect free
5. Strongly typed.
Xml Databases
Division of Computer Engineering 26
Examples
The sample XQuery code below lists the unique speakers in each act of Shakespeare's play
Hamlet, encoded in hamlet.xml
<html><head/><body>
{
for $act in doc("hamlet.xml")//ACT
let $speakers := distinct-values($act//SPEAKER)
return
<div>
<h1>{ string($act/TITLE) }</h1>
<ul>
{
for $speaker in $speakers
return <li>{ $speaker }</li>
}
</ul>
</div>
}
</body></html>
Xml Databases
Division of Computer Engineering 27
9.XSLT
XSLT (Extensible Stylesheet Language Transformations) is a declarative, XML-based language
used for the transformation of XML documents into other XML documents. The original
document is not changed; rather, a new document is created based on the content of an
existing one.[2] The new document may be serialized (output) by the processor in standard
XML syntax or in another format, such as HTML or plain text.[3] XSLT is often used to
convert XML data into HTML or XHTML documents for display as a web page: the
transformation may happen dynamically either on the client or on the server, or it may be
done as part of the publishing process. It is also used to create output for printing or direct
video display, typically by transforming the original XML into XSL Formatting Objects to
create formatted output which can then be converted to a variety of formats, a few of
which are PDF, PostScript, AWT and PNG. XSLT is also used to translate XML messages
between different XML schemas, or to make changes to documents within the scope of a
single schema, for example by removing the parts of a message that are not needed.
XSLT examples
Sample of incoming XML document
<?xml version="1.0" ?>
<persons>
<person username="JS1">
<name>John</name>
<family-name>Smith</family-name>
</person>
<person username="MI1">
<name>Morka</name>
<family-name>Ismincius</family-name>
</person>
</persons>
Example 1 (transforming XML)
This XSLT stylesheet provides templates to transform the XML document:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/persons">
<root>
<xsl:apply-templates select="person"/>
</root>
</xsl:template>
<xsl:template match="person">
<name username="{@username}">
<xsl:value-of select="name" />
</name>
Xml Databases
Division of Computer Engineering 28
</xsl:template>
</xsl:stylesheet>
Its evaluation results in a new XML document, having another structure:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<name username="JS1">John</name>
<name username="MI1">Morka</name>
</root>
Example 2 (transforming XML to XHTML)
Processing the following example XSLT file
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/1999/xhtml">
<xsl:output method="xml" indent="yes" encoding="UTF-8"/>
<xsl:template match="/persons">
<html>
<head> <title>Testing XML Example</title> </head>
<body>
<h1>Persons</h1>
<ul>
<xsl:apply-templates select="person">
<xsl:sort select="family-name" />
</xsl:apply-templates>
</ul>
</body>
</html>
</xsl:template>
<xsl:template match="person">
<li>
<xsl:value-of select="family-name"/><xsl:text>, </xsl:text>
<xsl:value-of select="name"/>
</li>
</xsl:template>
</xsl:stylesheet>
with the XML input file shown above results in the following XHTML
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head> <title>Testing XML Example</title> </head>
<body>
<h1>Persons</h1>
<ul>
<li>Ismincius, Morka</li>
<li>Smith, John</li>
</ul>
Xml Databases
Division of Computer Engineering 29
</body>
</html>
This XHTML generates the output below when rendered in a web browser.
Rendered XHTML generated from an XML input file and an XSLT transformation.
Xml Databases
Division of Computer Engineering 30
10. XML Parser
A parser is a piece of program that takes a physical representation of some data and
converts it into an in-memory form for the program as a whole to use. Parsers are used
everywhere in software. An XML Parser is a parser that is designed to read XML and create a
way for programs to use XML. There are different types, and each has its advantages. Unless
a program simply and blindly copies the whole XML file as a unit, every program must
implement or call on an XML parser.
The main types of parsers are known by some funny names: SAX and DOM
Xml Databases
Division of Computer Engineering 31
What is SAX?
SAX stands for Simple API for XML. Its main characteristic is that as it reads each unit of
XML, it creates an event that the calling program can use. This allows the calling program to
ignore the bits it doesn't care about, and just keep or use what it likes. The disadvantage is
that the calling program must keep track of everything it might ever need. SAX is often used
in certain high-performance applications or areas where the size of the XML might exceed
the memory available to the running program.
What's a DOM?
DOM stands for Document Object Model. It differs from SAX in that it builds the entire XML
document representation in memory and then hands the calling program the whole chunk
of memory. DOM can be very memory intensive; by the time you figure in the overhead for
managing the relationships of the nodes, you might be talking 4× to 8× the size of the
original document in memory usage.
DOM has been widely criticized for being too complicated; it has tried to maintain the same
programming interface for whatever language it is implemented in, even if it violates some
of the conventions of that language. This has led to some DOM-like implementations that
are more in keeping in line with the philosophy of the local language
Xml Databases
Division of Computer Engineering 32
11. XML database
An XML database is a data persistence software system that allows data to be stored in XML
format. This data can then be queried, exported and serialized into the desired format.
Two major classes of XML database exist:
1. XML-enabled: these map all XML to a traditional database (such as a relational
database), accepting XML as input and rendering XML as output. This term implies that the
database does the conversion itself (as opposed to relying on middleware).
2. Native XML (NXD): the internal model of such databases depends on XML and uses XML
documents as the fundamental unit of storage, which are , however, not necessarily stored
in the form of text files.
Rationale for XML in databases
O'Connell gives one reason for the use of XML in databases: the increasingly common use of
XML for data transport, which has meant that "data is extracted from databases and put
into XML documents and vice-versa". It may prove more efficient (in terms of conversion
costs) and easier to store the data in XML format.
Native XML databases
The term "native XML database" (NXD) can lead to confusion. Many NXDs do not function as
standalone databases at all, and do not really store the native (text) form.
The formal definition from the XML: DB initiative (which appears to be inactive since 2003)
states that a native XML database:
Xml Databases
Division of Computer Engineering 33
* Defines a (logical) model for an XML document — as opposed to the data in that
document — and stores and retrieves documents according to that model. At a minimum,
the model must include elements, attributes, PCDATA, and document order. Examples of
such models include the XPath data model, the XML Info set, and the models implied by the
DOM and the events in SAX 1.0.
* Has an XML document as its fundamental unit of (logical) storage, just as a relational
database has a row in a table as its fundamental unit of (logical) storage.
* Need not have any particular underlying physical storage model. For example, NXDs can
use relational, hierarchical, or object-oriented database structures, or use a proprietary
storage format (such as indexed, compressed files).
Additionally, many XML databases provide a logical model of grouping documents, called
"collections". Databases can set up and manage many collections at one time. In some
implementations, a hierarchy of collections can exist, much in the same way that an
operating system's directory-structure works.
All XML databases now support at least one form of querying syntax. Minimally, just about
all of them support XPath for performing queries against documents or collections of
documents. XPath provides a simple pathing system that allows users to identify nodes that
match a particular set of criteria.
In addition to XPath, many XML databases support XSLT as a method of transforming
documents or query-results retrieved from the database. XSLT provides a declarative
language written using an XML grammar. It aims to define a set of XPath filters that can
transform documents (in part or in whole) into other formats including Plain text, XML, or
HTML.
Many XML databases also support XQuery to perform querying. XQuery includes XPath as a
node-selection method, but extends XPath to provide transformational capabilities. Users
sometimes refer to its syntax as "FLWOR" (pronounced 'Flower') because the query may
Xml Databases
Division of Computer Engineering 34
include the following clauses: 'for', 'let', 'where', 'order by' and 'return'. Traditional RDBMS
vendors (who traditionally had SQL only engines), are now shipping with hybrid SQL and
XQuery engines. Hybrid SQL/XQuery engines help to query XML data alongside the
relational data, in the same query expression. This approach helps in combining relational
and XML data.
Some XML databases support an API called the XML: DB API (or XAPI) as a form of
implementation-independent access to the XML data store. In XML databases, XAPI
resembles ODBC and JDBC as used with relational databases. On the 24th of June 2009, The
Java Community Process released the final version of the XQuery API for Java specification
(XQJ) - "a common API that allows an application to submit queries conforming to the W3C
XQuery 1.0 specification and to process the results of such queries.
Databases known to support Common Programming Standards such as
XQuery API :
Xml Database Language
BaseX Java
eXist Java
MarkLogic Server C++
MonetDB/XQuery C++
Xml Databases
Division of Computer Engineering 35
12. SOAP
SOAP, originally defined as Simple Object Access Protocol, is a protocol specification for
exchanging structured information in the implementation of Web Services in computer
networks. It relies on Extensible Markup Language (XML) for its message format, and usually
relies on other Application Layer protocols, most notably Remote Procedure Call (RPC) and
Hypertext Transfer Protocol (HTTP), for message negotiation and transmission. SOAP can
form the foundation layer of a web services protocol stack, providing a basic messaging
framework upon which web services can be built. This XML based protocol consists of three
parts: an envelope, which defines what is in the message and how to process it, a set of
encoding rules for expressing instances of application-defined datatypes, and a convention
for representing procedure calls and responses.
As a layman's example of how SOAP procedures can be used, a SOAP message could be sent
to a web-service-enabled web site, for example, a real-estate price database, with the
parameters needed for a search. The site would then return an XML-formatted document
with the resulting data, e.g., prices, location, features. Because the data is returned in a
standardized machine-parseable format, it could then be integrated directly into a third-
party web site or application.
Advantages
* SOAP is versatile enough to allow for the use of different transport protocols. The
standard stacks use HTTP as a transport protocol, but other protocols are also usable (e.g.,
JMS, SMTP).
* Since the SOAP model tunnels fine in the HTTP get/response model, it can tunnel easily
over existing firewalls and proxies, without modifications to the SOAP protocol, and can use
the existing infrastructure.
Disadvantages
* Because of the verbose XML format, SOAP can be considerably slower than competing
middleware technologies such as CORBA. This may not be an issue when only small
messages are sent.[7] To improve performance for the special case of XML with embedded
binary objects, the Message Transmission Optimization Mechanism was introduced.
* When relying on HTTP as a transport protocol and not using WS-Addressing or an ESB,
the roles of the interacting parties are fixed. Only one party (the client) can use the services
of the other. Developers must use polling instead of notification in these common cases.
Xml Databases
Division of Computer Engineering 36
13. Conclusion
Xml databases are rapidly becoming the de facto standard of transferring data over the
internet. Also, because of their ability to store heterogeneous data, they are used in a wide
variety of fields like e-publishing, digital libraries, finance industry, etc.
Xml standard is evolving day and night and new standard of xml are being created for
almost every field. Open source base and simplicity has made it best tool to pass
information. We live in an era of information technology and xml databases is the best
vehicle ,we can ride on.
Xml Databases
Division of Computer Engineering 37
14. References
Books:
Database Systems by: Navathe and Elmasari
Xml: the complete reference , TMH publications
Websites :
http://en.wikipedia.org/wiki/XML_database
http://www.cfoster.net/articles/xmldb-business-case/
http://www.25hoursaday.com/StoringAndQueryingXML.html
http://www.stylusstudio.com/db_to_xml_mapper.html
http://www.wisegeek.com/what-is-an-xml-database.htm
http://en.wikipedia.org/wiki/SOAP