Download - XML: Data Driving Business?

XML: Data Driving Business?

Laks V.S.Lakshmanan,

IIT Bombay and Concordia University

XML : Data Model

• What is an XML Document– Linearization of a tree structure– Every node of the tree can have several character

strings associated– Info content of the document is the tree structure

together with the character strings

Is XML just a syntax for data interchange and serialization?

XML: Data Model

Types of nodes Element Eg. <p a1="A1" . . . an="An">c1 . . . cm</p>

Document Eg. <!DOCTYPE name [markedupdeclarations]>

Processing instruction Eg. <?xml version=“1.0”? >

Comment Eg. 

Atomic data Eg. <Data>

What is a DTD?

• Document Type Definition(DTD) serves as grammar

• A document type definition specifies:

– the elements that are permissible in a document of this type

– for each each element the possible attributes, their range of values and defaults

– for each element, the structure of its contents, including:

• which element can occur and in what order

• whether text characters can occur

Example of a DTD

Eg:<!DOCTYPE> Bookslist[

<!ELEMENT Bookslist (book)*><!ELEMENT book

(title,author*,publisher)><!ELEMENT title (#PCDATA)><!ELEMENT author(#PCDATA)><!ELEMENT publisher(#PCDATA)>

]

XML and DTD

• Well formed documents– Tags should be nested properly and attributes should be

unique.

• Valid documents– Well formed documents that confirm to a Document

Type Definition(DTD)

• DTDs are used– Constrain structure

– Declare entities

– Provide some default values for attributes

DTD Limitations

• too much document oriented• too simple and too complicated at the same time• too limited to represent complex structures• IDREFs are not typed• No notion of inheritance/sub-typing• too many ways to represent the same thing• names are global, not locals

DTD vs. Database Schema

• Order is of significance in DTD and not in DB• DTD does not provide for data types• DTD cannot specify keys

XMLSchema

• Why XMLSchema – Based on XML syntax– Can be parsed and manipulated like any XML

document– Supports variety of data types– Allows extensions of vocabularies and inherit from

elements– Provides namespace integration – Provides logical grouping of attributes

XMLSchema: An example

<datatype name="PriceType"> <basetype name="decimal"/> <minExclusive>0.00</minExclusive> <scale>2</scale></datatype><element name="price" type="PriceType"></element>

<element name='Person'> ... </element>

<element name='Employee'>

<refines name='Person'/> ...

</element>

XMLSchema vs. DTDDTD XMLSchema

Syntax Specialized Same as XML

Compactness Compact Verbose

Data types Strings Variety of types

Data model Closed Open

Namespaceintegration

Primitive Full fledged

Attributegrouping

Not supported Supported

XML Data

• Superset of XMLSchema

• Can express Database relationships too..

• Eg: <elementType id="booktable">

<element id="titleID" type="#title”/>

<element type="#author”/>

<element type="#pages”/>

<key id="bookkey"> <keyPart href="#titleID"/> </key> </elementType>

Semistructured data

• Data that is neither raw nor very strictly typed like in databases

• Examples of semistructured data– Html file with one entry per restaurant that

provides info on prices, addresses, styles – BibTex files– Genome and scientific databases– Online documentation

Semistructured data: Main aspects

• Structure– Irregular– Implicit– Partial

• Schema– Very large– Rapidly evolving– Distinction between data and schema is blurred

Semistructured data:Data model

• Object Exchange Model(OEM)– Lightweight and flexible– Data representation

• As a graph with objects as vertices and labels on edges

• Each object has a unique object identifier

• Some objects are atomic, e.g., integer, real,…

• Complex objects have value as set of object references

OEM: An example

Semistructured data: Query Languages

• Lorel– Based on OQL– Eg.,

• Select author:X

from biblio.book.author X

• Computes the set of book authors

• Forms a new node and connects it with edges labelled author to nodes resulting from evaluation of the path expression

Lorel: Salient features

• Coercion• force comparison operators to handle comparisons

between objects of different types like between string and integer

• Eg.Select row:X

from biblio.paper X

where X.year=1998

Comment:

==>Year could have been string or integer

Lorel: Salient Features• Path expressions

• Data model allows arbitrary nesting

• Queries should hence be able to probe arbitrary depth

• Provided by path expressions

• Eg.

select title:t

from chapter(.section)* s, s.title t

where t like "*XML*"

UnQL• Based on Edge labeled Graph Model• Coercion not supported

• More precise knowledge of data needed

• Pattern Usage– Eg.

Select title: X

where {biblio: {paper: {title: X, year:Y}}}

in db, Y>1998

UnQL• Path variables

– Can use path too as data– Eg.

Select @P

from db1 @P.X

where matches(“.*(U|u)biquitin.*”,X)

==>To determine where string “ubiquitin” appears in db1

Semistructured vs. XML• Both are schema-less, self-describing

• XML is ordered and semistructured data is not

• XML can mix text and elements:– XML has lots of other stuff: entities, processing

instructions, comments

Requirements of an XML Query Language

• XML Output• Server-side processing• Query operations

– Selection, Extraction, Reduction, Restructuring, Combination

• No schema required• Exploit available schema• Preserve order and association• Programmatic Manipulation

Requirements of an XML Query Language

• XML representation• Mutual embedding with XML• XLink and XPointer cognizant• Support for new data types• Suitable for metadata

XML Query Languages• XQL

• XML-QL

• Quilt

XQL• Simple expressions

•//product[@maker='BSA'] : All products with attribute maker ‘BSA’

• Filters•author/address[@type='email']: Address nodes with attribute type as email

• Subscripts•section[1,3 to 5]: Nodes with position 1,3,4,5

XQL• Supports boolean and set operators

•q1 and q2

•q1 union q2

• Grouping•//invoice{q1} : Using invoice groups the results of q1

• Sequence •a before b

• Others : node(), text(), ...

XQL: Limitations• Flattening

– As the results of patterns and filters are not modeled by an intermediate relation

• Restructuring– As flattening not permitted cannot restructure

• Tag variables– Not supported

• Sorting


• XML-QL

• Quilt

XML-QL• Simple examples

WHERE <book> <publisher>

<name>Addison-Wesley</name> </publisher>

<title> $t</title> <author> $a</author> </book> IN "www.a.b.c/bib.xml"CONSTRUCT

<result> <author>$a</author>

<title>$t</title> </result>

XML-QL• Grouping

WHERE <book> $p </> IN "www.a.b.c/bib.xml", <title > $t </>, <publisher>

<name>Addison-Wesley</> </publisher> IN $p

CONSTRUCT <result> <title> $t </> WHERE <author> $a </> IN $p CONSTRUCT <author> $a</> </>

==> Groups by title.

XML-QL• Tag variables

WHERE <$p> <title> $t </title> <year>1995 </> <$e> Smith </> </> IN "www.a.b.c/bib.xml", $e IN {author, editor}

CONSTRUCT <$p> <title> $t </title> <$e> Smith </> </>

==> List of books where Smith could be either author or editor

XML-QL• Regular Path Expressions

WHERE <part*> <name>$r</> <brand>Ford</>

</> IN "www.a.b.c/bib.xml"CONSTRUCT <result>$r</>

==> Gets list of names of parts irrespective of the nesting of parts in the document.

XML-QL• Skolem functions

WHERE <$> <author> <firstname> $fn </> <lastname> $ln </> </> <title> $t </> </> IN "www.a.b.c/bib.xml",CONSTRUCT <person ID=PersonID($fn, $ln)> <firstname> $fn </> <lastname> $ln </> <publicationtitle> $t </> </>

==> PersonID is a Skolem function

Generates new id for distinct value of ($fn,$ln) else appends to existing node.

XML-QL• Allows integrating data from multiple

sources

• Can query order as well

• Provides for embedding query within data

• Allows function definitions

• Is relationally complete

XML-QL• Is everything fine?

– Pattern specifications are too verbose– Result of the WHERE clause is a relation

composed of scalar values• So cannot preserve information about hierarchy and

sequence

• Can hence not handle hierarchy and sequence related queries


• XML-QL

• Quilt

Quilt• Combines strengths of XML-QL and XQL

• Derives ability to navigate and select nodes based on sequence from XQL

• Binding of variables done like in XML-QL

Quilt• An example

FOR $b in //book

WHERE exists($b/title) AND

NOT exists($b/author)

RETURN $b/title

==> Lists those titles of those books which do not have author info

Quilt XML Input

FOR/LET

Tuples of bound var. WHERE

Tuples selected

RETURN

XML Output

Flow of data in a quilt expression

Quilt: Filtering Documents• Need to preserve the relationships among

selected elements

• Eg:C

CB

C

B

AA

A C B

B

B A

A

BA

filter = A|B

Quilt• Can perform Sorting

• Aggregation provided

• Allows recursive functions

Quilt: The real power of it• Sample document

<section>

<section.title>Procedure</section.title> The patient was taken to the operating room where she was placed in a supine position and <Anesthesia>induced under general anesthesia. </Anesthesia> <Prep> <action>Foley catheter was placed to decompress the bladder</action> and the abdomen was then prepped and draped in sterile fashion. </Prep> <Incision> A curvilinear incision was made <Geography>in the midline immediately infraumbilical</Geography> and the subcutaneous tissue was divided <Instrument>using electrocautery.</Instrument> </Incision> The fascia was identified and <action>#2 0 Maxon stay sutures were placed on each side of the midline.</action> <Incision> The fascia was divided using <Instrument>electrocautery</Instrument> and the peritoneum was entered. </Incision> <Observation>The small bowel was identified</Observation> and <action> the <Instrument>Hasson trocar</Instrument> </action>

:

</section>

Quilt: The real power of it• In each section with title "Procedure", what Instruments were used in

the second Incision?FOR $s IN //section[section.title="Procedure"]

RETURN ($s//Incision)[2]/Instrument

• In each section with title "Procedure", what are the first two instruments to be used?

FOR $s IN //section[section.title="Procedure"]

RETURN ($s//Instrument)[1-2]

Quilt: The real power of it• In the first procedure, what happened between the first incision and

the second incision?

FOR $proc IN //section[section.title="Procedure"][1],

$bet IN $proc//((* AFTER ($proc//incision)[1]) BEFORE ($proc//incision)[2]) RETURN $bet

XML Storage• Text files

• Simple

• Would require special purpose query processor

• Relational databases• Ternary relations [Florescu et al]

• Inlining methods [Shanmugasamudram et al]

• STORED [Mary Fernandez]

XML Storage• Object Oriented databases[Sophie Cluet et al]

• Native storage

XML Storage• Using Ternary relations

• Edge labels are maintained in a table with the object ids that the edge connects

• Value of leaf nodes are stored using yet another table

&o1

&o3

&o2

&o4 &o5

paper

title author authoryear

&o6

“The Calculus” “…” “…” “1986”

Store XML in Ternary Relation

S o u r c e L a b e l D e s t

& o 1 p a p e r & o 2& o 2 t i t l e & o 3& o 2 a u t h o r & o 4& o 2 a u t h o r & o 5& o 2 y e a r & o 6

N o d e V a l u e

& o 3 T h e C a l c u l u s& o 4 …& o 5 …& o 6 1 9 8 6

Ref

Val

XML Storage• DTDs converted into DTD graph

• Inlining methods• Basic inlining

• Shared inlining

• Hybrid inlining

Corresponding DTD graph

Element graph for Editor Element

XML Storage• Basic inlining

• For each node in the DTD graph a relation is created

• Creates a large no. of relations

Relations created using Basic inlining

XML Storage• Shared inlining

• Create relations for elements in-degree>1

• An element node is repr in exactly 1 rel

• For mutually recursive elements make one as a separate relation

Relations created using shared inlining

XML Storage• Hybrid inlining

• inlines elements with in-degree > 1 that are not recursive or reached through a “*” node

Relations created using hybrid inlining

XML Storage• STORED

• Uses a query language to specify mappings.

• Mappings are generated using mining algorithms

• Nonconforming data is stored in overflow graphs.

XML Storage• STORED(contd.)

• Given a data instance D, a STORED query is generated automatically.

FROM Audit.taxpayer:$X{name:$N, phone:$P1,

optional{phone:$P2}}

STORE R1($X,$N,$P1,$P2)

• Given relational mappings, generate explicit overflow mappings so that the query is lossless.

XML Storage• Object oriented method

• Using DTD a hierarchy of the elements is obtained

• Each element is now modeled as a class

• For handling “*” of DTD a list of objects is maintained

• To handle union types(Eg., phone|email) new class can be introduced

XML Storage• eXcelon way

– eXcelon XML Data Engine is a high performance XML data management engine

– Based on ObjectStore DBMS

– When XML data gets parsed in eXcelon, it is represented in XMLStore as discrete XML elements.

– The hierarchical structure of XML is therefore preserved in its persistent representation

XML AlgebraWhy yet another algebra?

– Structure of data• Deeply structured

• Exact structure not specific

– Recursion• Structurally recursive

Proposed Algebra: Too much stress on type conformance

XML Algebra• Sample Data<bib>

<book>

<title>Data on the Web</title>

<year>1999</year>

<author>Abiteboul</author>

<author>Buneman</author>

</book>

<book>

<title> XML Query</title>

<year>2000</year>

<author>Mary</author>

</book>

</bib>

XML Algebratype Bib = bib [ Book{0,*}]

type Book = book [

title [String ],

year [Integer],

author[ String]{1,*}

]

let bib0: Bib = bib [

book [

title [“Data on the Web”], year [1999],

author[“Abiteboul”], author[“Buneman”]

]

book[

title[“XML Query”],year[2000],

author[“Mary”]

]

]

XML Algebra• Projection

Eg: project book( children (bib0) )– Allows a more convenient notation as well

(similar to Xpath notation)– Eg. bib0/book/author

==> author [“Abiteboul”]

author [“Buneman”]

author [“Mary”]

:author [ String ] {0,*}

XML Algebra• Selection

Eg: for b bib0/book in

where value(b/year) <= 2000 then b

==> book [

title [ “Data on the web”],

year [“1999”],

author[“Abiteboul”],

author[“Buneman”]

]

: Book{0,*}

XML Algebra• Join:type Reviews =

reviews [

book [

title [String],

review [ String]

]{0,*}

]

let review0: Reviews =

reviews[

book [ title[“XMLQuery”],

review[“A fine book”]

],

book [ title[“Data on Web”],

review[“This is great”]

]

]

XML Algebra• Join

for b bib0/book infor r review0/book in

where value(b/title) = value(r/title) thenbook [ b/title, b/author, r/review]

==> book [title [“Data on the web”],

author[“Abiteboul”],author[“Buneman”]

review[“A fine book”]],

XML Algebra• Join book[

title[“XML Query”],

author[“Mary”],

review[“This is great”]

]

: book[

title[String ],

author[String]{1,*},

review[String]

]{0,*}

XML Algebra• Querying Order

– Index function pairs an integer index with each element in a forest

– Eg: index(book0/author)

==> pair[fst[1],snd[author[“Abiteboul”]]],

pair[fst[2],snd [author[“Buneman”]]],

pair[fst[3],snd [author[“Suciu”]]]

:pair[fst[Integer],snd[author[String]]]{1,*}

XML Algebra• Aggregation

– Has five built-in aggregation

functions: avg,count, max, min and sum– Eg:

for b bib0/book in

where count(b/author) >= 2 then b/title

==> title[“Data on the web”]

: title{0,*}

XML Algebra• Additional Features

– Structural Recursion • To define documents with recursive structure, recursive types

are used

– Sorting• sort(pairs)

– Grouping• Group(pairs)

Kweelt• Is a framework to query XML Data

• An implementation of Quilt

• Architecture :

XML Indexing1

2 3 4 5 6

7 8 9 10 11 12 13

t t t t t

a b a c a d a a b

Semistructured Data

XML Indexing• Data guides(Used in Lore)

• Data guide is a concise and accurate summary of the data graph

1

2 3 4 5 6

7 8 10 12 13 7 13 9 11

t

ab c

d

Data Guide

XML Indexing• T-Index

1

2 3 4 5 6

7 13 8 10 12 9 11

t

aa c db

T-Index

Challenges

• Storage issues• Relational or native?

• Query optimization• Query plan?

• Other than queries…say triggers?

• Updates to data

• Mining of XML data

Download - XML: Data Driving Business?

Top Related