Download - XML: Data Driving Business?
XML: Data Driving Business?
Laks V.S.Lakshmanan,
IIT Bombay and Concordia University
XML : Data Model
• What is an XML Document– Linearization of a tree structure– Every node of the tree can have several character
strings associated– Info content of the document is the tree structure
together with the character strings
Is XML just a syntax for data interchange and serialization?
XML: Data Model
Types of nodes Element Eg. <p a1="A1" . . . an="An">c1 . . . cm</p>
Document Eg. <!DOCTYPE name [markedupdeclarations]>
Processing instruction Eg. <?xml version=“1.0”? >
Comment Eg. <!--This is a comment-->
Atomic data Eg. <Data>
What is a DTD?
• Document Type Definition(DTD) serves as grammar
• A document type definition specifies:
– the elements that are permissible in a document of this type
– for each each element the possible attributes, their range of values and defaults
– for each element, the structure of its contents, including:
• which element can occur and in what order
• whether text characters can occur
Example of a DTD
Eg:<!DOCTYPE> Bookslist[
<!ELEMENT Bookslist (book)*><!ELEMENT book
(title,author*,publisher)><!ELEMENT title (#PCDATA)><!ELEMENT author(#PCDATA)><!ELEMENT publisher(#PCDATA)>
]
XML and DTD
• Well formed documents– Tags should be nested properly and attributes should be
unique.
• Valid documents– Well formed documents that confirm to a Document
Type Definition(DTD)
• DTDs are used– Constrain structure
– Declare entities
– Provide some default values for attributes
DTD Limitations
• too much document oriented• too simple and too complicated at the same time• too limited to represent complex structures• IDREFs are not typed• No notion of inheritance/sub-typing• too many ways to represent the same thing• names are global, not locals
DTD vs. Database Schema
• Order is of significance in DTD and not in DB• DTD does not provide for data types• DTD cannot specify keys
XMLSchema
• Why XMLSchema – Based on XML syntax– Can be parsed and manipulated like any XML
document– Supports variety of data types– Allows extensions of vocabularies and inherit from
elements– Provides namespace integration – Provides logical grouping of attributes
XMLSchema: An example
<datatype name="PriceType"> <basetype name="decimal"/> <minExclusive>0.00</minExclusive> <scale>2</scale></datatype><element name="price" type="PriceType"></element>
<element name='Person'> ... </element>
<element name='Employee'>
<refines name='Person'/> ...
</element>
XMLSchema vs. DTDDTD XMLSchema
Syntax Specialized Same as XML
Compactness Compact Verbose
Data types Strings Variety of types
Data model Closed Open
Namespaceintegration
Primitive Full fledged
Attributegrouping
Not supported Supported
XML Data
• Superset of XMLSchema
• Can express Database relationships too..
• Eg: <elementType id="booktable">
<element id="titleID" type="#title”/>
<element type="#author”/>
<element type="#pages”/>
<key id="bookkey"> <keyPart href="#titleID"/> </key> </elementType>
Semistructured data
• Data that is neither raw nor very strictly typed like in databases
• Examples of semistructured data– Html file with one entry per restaurant that
provides info on prices, addresses, styles – BibTex files– Genome and scientific databases– Online documentation
Semistructured data: Main aspects
• Structure– Irregular– Implicit– Partial
• Schema– Very large– Rapidly evolving– Distinction between data and schema is blurred
Semistructured data:Data model
• Object Exchange Model(OEM)– Lightweight and flexible– Data representation
• As a graph with objects as vertices and labels on edges
• Each object has a unique object identifier
• Some objects are atomic, e.g., integer, real,…
• Complex objects have value as set of object references
OEM: An example
Semistructured data: Query Languages
• Lorel– Based on OQL– Eg.,
• Select author:X
from biblio.book.author X
• Computes the set of book authors
• Forms a new node and connects it with edges labelled author to nodes resulting from evaluation of the path expression
Lorel: Salient features
• Coercion• force comparison operators to handle comparisons
between objects of different types like between string and integer
• Eg.Select row:X
from biblio.paper X
where X.year=1998
Comment:
==>Year could have been string or integer
Lorel: Salient Features• Path expressions
• Data model allows arbitrary nesting
• Queries should hence be able to probe arbitrary depth
• Provided by path expressions
• Eg.
select title:t
from chapter(.section)* s, s.title t
where t like "*XML*"
UnQL• Based on Edge labeled Graph Model• Coercion not supported
• More precise knowledge of data needed
• Pattern Usage– Eg.
Select title: X
where {biblio: {paper: {title: X, year:Y}}}
in db, Y>1998
UnQL• Path variables
– Can use path too as data– Eg.
Select @P
from db1 @P.X
where matches(“.*(U|u)biquitin.*”,X)
==>To determine where string “ubiquitin” appears in db1
Semistructured vs. XML• Both are schema-less, self-describing
• XML is ordered and semistructured data is not
• XML can mix text and elements:– XML has lots of other stuff: entities, processing
instructions, comments
Requirements of an XML Query Language
• XML Output• Server-side processing• Query operations
– Selection, Extraction, Reduction, Restructuring, Combination
• No schema required• Exploit available schema• Preserve order and association• Programmatic Manipulation
Requirements of an XML Query Language
• XML representation• Mutual embedding with XML• XLink and XPointer cognizant• Support for new data types• Suitable for metadata
XML Query Languages• XQL
• XML-QL
• Quilt
XQL• Simple expressions
•//product[@maker='BSA'] : All products with attribute maker ‘BSA’
• Filters•author/address[@type='email']: Address nodes with attribute type as email
• Subscripts•section[1,3 to 5]: Nodes with position 1,3,4,5
XQL• Supports boolean and set operators
•q1 and q2
•q1 union q2
• Grouping•//invoice{q1} : Using invoice groups the results of q1
• Sequence •a before b
• Others : node(), text(), ...
XQL: Limitations• Flattening
– As the results of patterns and filters are not modeled by an intermediate relation
• Restructuring– As flattening not permitted cannot restructure
• Tag variables– Not supported
• Sorting
XML Query Languages• XQL
• XML-QL
• Quilt
XML-QL• Simple examples
WHERE <book> <publisher>
<name>Addison-Wesley</name> </publisher>
<title> $t</title> <author> $a</author> </book> IN "www.a.b.c/bib.xml"CONSTRUCT
<result> <author>$a</author>
<title>$t</title> </result>
XML-QL• Grouping
WHERE <book> $p </> IN "www.a.b.c/bib.xml", <title > $t </>, <publisher>
<name>Addison-Wesley</> </publisher> IN $p
CONSTRUCT <result> <title> $t </> WHERE <author> $a </> IN $p CONSTRUCT <author> $a</> </>
==> Groups by title.
XML-QL• Tag variables
WHERE <$p> <title> $t </title> <year>1995 </> <$e> Smith </> </> IN "www.a.b.c/bib.xml", $e IN {author, editor}
CONSTRUCT <$p> <title> $t </title> <$e> Smith </> </>
==> List of books where Smith could be either author or editor
XML-QL• Regular Path Expressions
WHERE <part*> <name>$r</> <brand>Ford</>
</> IN "www.a.b.c/bib.xml"CONSTRUCT <result>$r</>
==> Gets list of names of parts irrespective of the nesting of parts in the document.
XML-QL• Skolem functions
WHERE <$> <author> <firstname> $fn </> <lastname> $ln </> </> <title> $t </> </> IN "www.a.b.c/bib.xml",CONSTRUCT <person ID=PersonID($fn, $ln)> <firstname> $fn </> <lastname> $ln </> <publicationtitle> $t </> </>
==> PersonID is a Skolem function
Generates new id for distinct value of ($fn,$ln) else appends to existing node.
XML-QL• Allows integrating data from multiple
sources
• Can query order as well
• Provides for embedding query within data
• Allows function definitions
• Is relationally complete
XML-QL• Is everything fine?
– Pattern specifications are too verbose– Result of the WHERE clause is a relation
composed of scalar values• So cannot preserve information about hierarchy and
sequence
• Can hence not handle hierarchy and sequence related queries
XML Query Languages• XQL
• XML-QL
• Quilt
Quilt• Combines strengths of XML-QL and XQL
• Derives ability to navigate and select nodes based on sequence from XQL
• Binding of variables done like in XML-QL
Quilt• An example
FOR $b in //book
WHERE exists($b/title) AND
NOT exists($b/author)
RETURN $b/title
==> Lists those titles of those books which do not have author info
Quilt XML Input
FOR/LET
Tuples of bound var. WHERE
Tuples selected
RETURN
XML Output
Flow of data in a quilt expression
Quilt: Filtering Documents• Need to preserve the relationships among
selected elements
• Eg:C
CB
C
B
AA
A C B
B
B A
A
BA
filter = A|B
Quilt• Can perform Sorting
• Aggregation provided
• Allows recursive functions
Quilt: The real power of it• Sample document
<section>
<section.title>Procedure</section.title> The patient was taken to the operating room where she was placed in a supine position and <Anesthesia>induced under general anesthesia. </Anesthesia> <Prep> <action>Foley catheter was placed to decompress the bladder</action> and the abdomen was then prepped and draped in sterile fashion. </Prep> <Incision> A curvilinear incision was made <Geography>in the midline immediately infraumbilical</Geography> and the subcutaneous tissue was divided <Instrument>using electrocautery.</Instrument> </Incision> The fascia was identified and <action>#2 0 Maxon stay sutures were placed on each side of the midline.</action> <Incision> The fascia was divided using <Instrument>electrocautery</Instrument> and the peritoneum was entered. </Incision> <Observation>The small bowel was identified</Observation> and <action> the <Instrument>Hasson trocar</Instrument> </action>
:
</section>
Quilt: The real power of it• In each section with title "Procedure", what Instruments were used in
the second Incision?FOR $s IN //section[section.title="Procedure"]
RETURN ($s//Incision)[2]/Instrument
• In each section with title "Procedure", what are the first two instruments to be used?
FOR $s IN //section[section.title="Procedure"]
RETURN ($s//Instrument)[1-2]
Quilt: The real power of it• In the first procedure, what happened between the first incision and
the second incision?
FOR $proc IN //section[section.title="Procedure"][1],
$bet IN $proc//((* AFTER ($proc//incision)[1]) BEFORE ($proc//incision)[2]) RETURN $bet
XML Storage• Text files
• Simple
• Would require special purpose query processor
• Relational databases• Ternary relations [Florescu et al]
• Inlining methods [Shanmugasamudram et al]
• STORED [Mary Fernandez]
XML Storage• Object Oriented databases[Sophie Cluet et al]
• Native storage
XML Storage• Using Ternary relations
• Edge labels are maintained in a table with the object ids that the edge connects
• Value of leaf nodes are stored using yet another table
&o1
&o3
&o2
&o4 &o5
paper
title author authoryear
&o6
“The Calculus” “…” “…” “1986”
Store XML in Ternary Relation
S o u r c e L a b e l D e s t
& o 1 p a p e r & o 2& o 2 t i t l e & o 3& o 2 a u t h o r & o 4& o 2 a u t h o r & o 5& o 2 y e a r & o 6
N o d e V a l u e
& o 3 T h e C a l c u l u s& o 4 …& o 5 …& o 6 1 9 8 6
Ref
Val
XML Storage• DTDs converted into DTD graph
• Inlining methods• Basic inlining
• Shared inlining
• Hybrid inlining
Corresponding DTD graph
Element graph for Editor Element
XML Storage• Basic inlining
• For each node in the DTD graph a relation is created
• Creates a large no. of relations
Relations created using Basic inlining
XML Storage• Shared inlining
• Create relations for elements in-degree>1
• An element node is repr in exactly 1 rel
• For mutually recursive elements make one as a separate relation
Relations created using shared inlining
XML Storage• Hybrid inlining
• inlines elements with in-degree > 1 that are not recursive or reached through a “*” node
Relations created using hybrid inlining
XML Storage• STORED
• Uses a query language to specify mappings.
• Mappings are generated using mining algorithms
• Nonconforming data is stored in overflow graphs.
XML Storage• STORED(contd.)
• Given a data instance D, a STORED query is generated automatically.
FROM Audit.taxpayer:$X{name:$N, phone:$P1,
optional{phone:$P2}}
STORE R1($X,$N,$P1,$P2)
• Given relational mappings, generate explicit overflow mappings so that the query is lossless.
XML Storage• Object oriented method
• Using DTD a hierarchy of the elements is obtained
• Each element is now modeled as a class
• For handling “*” of DTD a list of objects is maintained
• To handle union types(Eg., phone|email) new class can be introduced
XML Storage• eXcelon way
– eXcelon XML Data Engine is a high performance XML data management engine
– Based on ObjectStore DBMS
– When XML data gets parsed in eXcelon, it is represented in XMLStore as discrete XML elements.
– The hierarchical structure of XML is therefore preserved in its persistent representation
XML AlgebraWhy yet another algebra?
– Structure of data• Deeply structured
• Exact structure not specific
– Recursion• Structurally recursive
Proposed Algebra: Too much stress on type conformance
XML Algebra• Sample Data<bib>
<book>
<title>Data on the Web</title>
<year>1999</year>
<author>Abiteboul</author>
<author>Buneman</author>
</book>
<book>
<title> XML Query</title>
<year>2000</year>
<author>Mary</author>
</book>
</bib>
XML Algebratype Bib = bib [ Book{0,*}]
type Book = book [
title [String ],
year [Integer],
author[ String]{1,*}
]
let bib0: Bib = bib [
book [
title [“Data on the Web”], year [1999],
author[“Abiteboul”], author[“Buneman”]
]
book[
title[“XML Query”],year[2000],
author[“Mary”]
]
]
XML Algebra• Projection
Eg: project book( children (bib0) )– Allows a more convenient notation as well
(similar to Xpath notation)– Eg. bib0/book/author
==> author [“Abiteboul”]
author [“Buneman”]
author [“Mary”]
:author [ String ] {0,*}
XML Algebra• Selection
Eg: for b bib0/book in
where value(b/year) <= 2000 then b
==> book [
title [ “Data on the web”],
year [“1999”],
author[“Abiteboul”],
author[“Buneman”]
]
: Book{0,*}
XML Algebra• Join:type Reviews =
reviews [
book [
title [String],
review [ String]
]{0,*}
]
let review0: Reviews =
reviews[
book [ title[“XMLQuery”],
review[“A fine book”]
],
book [ title[“Data on Web”],
review[“This is great”]
]
]
XML Algebra• Join
for b bib0/book infor r review0/book in
where value(b/title) = value(r/title) thenbook [ b/title, b/author, r/review]
==> book [title [“Data on the web”],
author[“Abiteboul”],author[“Buneman”]
review[“A fine book”]],
XML Algebra• Join book[
title[“XML Query”],
author[“Mary”],
review[“This is great”]
]
: book[
title[String ],
author[String]{1,*},
review[String]
]{0,*}
XML Algebra• Querying Order
– Index function pairs an integer index with each element in a forest
– Eg: index(book0/author)
==> pair[fst[1],snd[author[“Abiteboul”]]],
pair[fst[2],snd [author[“Buneman”]]],
pair[fst[3],snd [author[“Suciu”]]]
:pair[fst[Integer],snd[author[String]]]{1,*}
XML Algebra• Aggregation
– Has five built-in aggregation
functions: avg,count, max, min and sum– Eg:
for b bib0/book in
where count(b/author) >= 2 then b/title
==> title[“Data on the web”]
: title{0,*}
XML Algebra• Additional Features
– Structural Recursion • To define documents with recursive structure, recursive types
are used
– Sorting• sort(pairs)
– Grouping• Group(pairs)
Kweelt• Is a framework to query XML Data
• An implementation of Quilt
• Architecture :
XML Indexing1
2 3 4 5 6
7 8 9 10 11 12 13
t t t t t
a b a c a d a a b
Semistructured Data
XML Indexing• Data guides(Used in Lore)
• Data guide is a concise and accurate summary of the data graph
1
2 3 4 5 6
7 8 10 12 13 7 13 9 11
t
ab c
d
Data Guide
XML Indexing• T-Index
1
2 3 4 5 6
7 13 8 10 12 9 11
t
aa c db
T-Index
Challenges
• Storage issues• Relational or native?
• Query optimization• Query plan?
• Other than queries…say triggers?
• Updates to data
• Mining of XML data