dan suciutools for xml data exchange dan suciu at&t labs joint work with mary fernandez

26
Dan Suciu Tools for XML Data Exchange Tools for XML Data Exchange Dan Suciu AT&T Labs Joint work with Mary Fernandez

Upload: stuart-bradley

Post on 27-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Dan Suciu Tools for XML Data Exchange

Tools for XML Data Exchange

Dan SuciuAT&T Labs

Joint work with Mary Fernandez

Dan Suciu Tools for XML Data Exchange

XML Has Many Facets

• XML for fancier Web pages

– XML generated with structural editors

• XML for messaging

– generated during applications

• XML for Data Exchange

– generated from legacy data

Dan Suciu Tools for XML Data Exchange

XML in Data Exchange

• communities agree on common DTD

• export their data in XML

• exchange over HTTP protocol

• applications understand only that DTD

Dan Suciu Tools for XML Data Exchange

An Example of XML Data<book> <publisher> Addison-Wesley </publisher>

<author> Serge Abiteboul </author>

<author> <first-name> Rick </first-name>

<last-name> Hull </last-name>

<author> Victor Vianu </author>

<title> Foundations of Databases </title>

<year> 1995 </year>

</book>

<book> <publisher> Freeman </publisher>

<author> Jeffrey D. Ullman </author>

<title> Principles of Database and Knowledge Base Systems </title>

<year> 1998 </year>

</book>

Dan Suciu Tools for XML Data Exchange

XML Exchange Vision

application

relational data

Transform

Integrate

Warehouse

XML Data WEB (HTTP)

application

application

legacy data

object-relational

Dan Suciu Tools for XML Data Exchange

Tools

• export legacy data to XML– RXL

• query/transform/integrate XML data– XML-QL

• compress XML data– XMill

• store/process incoming XML data– STORED

Dan Suciu Tools for XML Data Exchange

XML-QL: A Query Language for XML

• http://www.w3.org/TR/NOTE-xml-ql (8/98)

• W3C new Working Group on QL (9/99)

• XML-QL characteristics:– relational complete (like SQL)– XML input, XML output– queries, transforms, integrates XML data

[Deutsch et al., 1999 (WWW8)]

Dan Suciu Tools for XML Data Exchange

Querying in XML-QL

where <book language=“french”> <publisher> <name> Morgan Kaufmann </name> </publisher> <author> $a </author> </book> in “www.a.b.c/bib.xml”construct $a

where <book language=“french”> <publisher> <name> Morgan Kaufmann </name> </publisher> <author> $a </author> </book> in “www.a.b.c/bib.xml”construct $a

Pattern

Dan Suciu Tools for XML Data Exchange

Transformations in XML-QL

Note: </> abbreviates </book> or </result> or ...

where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result> <author> $a </> <lang> $l </> </>

where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result> <author> $a </> <lang> $l </> </>

<result> <author>. . .</author><lang>. . .</lang></result><result> <author>. . .</author><lang>. . .</lang></result><result> <author>. . .</author><lang>. . .</lang></result>

Template

Dan Suciu Tools for XML Data Exchange

Transformations in XML-QL

where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result> <author id=F($a)> $a</> <lang> $l </> </>

where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result> <author id=F($a)> $a</> <lang> $l </> </>

<result> <author>. . .</author> <lang>. . .</lang> <lang>. . .</lang> </result><result> <author>. . .</author> <lang>. . .</lang> <lang>. . .</lang> </result>

Skolem Functions in Templates

Dan Suciu Tools for XML Data Exchange

Data Integration in XML-QL

{ where <book > <isbn> $n </> <title> $t </> </> in “www.books.com” construct <result id=F($n)> <title> $t </> </> }

{ where <review> <isbn> $n </> <review> $r </> </> in “www.reviews.com”construct <result id=F($n)> <review> $r </> </> }

{ where <book > <isbn> $n </> <title> $t </> </> in “www.books.com” construct <result id=F($n)> <title> $t </> </> }

{ where <review> <isbn> $n </> <review> $r </> </> in “www.reviews.com”construct <result id=F($n)> <review> $r </> </> }

<result id=“..” > <title>. . .</title> <review>. . .</review> <review>. . .</review> </result>

Dan Suciu Tools for XML Data Exchange

RXL:Export Legacy Data To XML• legacy data

– fragmented into many flat relations– 3rd normal form– schema is proprietary

• XML data– nested– un-normalized– schema designed by agreement

Dan Suciu Tools for XML Data Exchange

RXL: An Example

• relational database:

• virtual XML view:

<store> <name> n1 </name> <book> ... </book> <book> ... </book> ... </store> <store> <name>n2 </name> <book> ... </book> <book> ... </book> …</store>

s i d n a m e… …… …

Stores i d b i d… …… …

SBb i d t i t l e… …… …

Book

Dan Suciu Tools for XML Data Exchange

A Simple RXL Query

• specify XML view declaratively

from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bidconstruct <store ID=f(Store.sid)> <name> Store.name </name> <book> Book.title </book> </store>

from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bidconstruct <store ID=f(Store.sid)> <name> Store.name </name> <book> Book.title </book> </store>

Dan Suciu Tools for XML Data Exchange

RXL: Querying the XML View

• users ask XML-QL queries:– find stores who sell “The Calculus”

where <store> <name> $n </name> <book> The Calculus </book> <store>construct <result> $n </result>

where <store> <name> $n </name> <book> The Calculus </book> <store>construct <result> $n </result>

Dan Suciu Tools for XML Data Exchange

RXL: Query composition

system composes query with view:from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bid and Book.title=“The Calculus”construct <result> Store.name </result>

from Store, SB, Bookwhere Store.sid=SB.sid and SB.bid=Book.bid and Book.title=“The Calculus”construct <result> Store.name </result>

s i d n a m e… …… …

Stores i d b i d… …… …

SBb i d t i t l e… …… …

Book<store> <name> n1 </name> <book> ... </book> <book> ... </book> ... </store> <store> <name>n2 </name> <book> ... </book> <book> ... </book> …</store>

RXL XML-QL

Dan Suciu Tools for XML Data Exchange

Compressing XML Data

• for exchange and archiving

• can use general tool (gzip)

• but specialized tool twice as good (Xmill)

Dan Suciu Tools for XML Data Exchange

Xmill Example: Weblogs

202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478 |-|-|http://www02.so-net.or.jp/|Mozilla/3.01 [ja] (Win95; I)

<apache:entry> <apache:host>202.239.238.16</apache:host> <apache:requestLine>GET / HTTP/1.0</apache:requestLine> <apache:contentType>text/html</apache:contentType> <apache:statusCode>200</apache:statusCode> <apache:date>1997/10/01-00:00:02</apache:date> <apache:byteCount>4478</apache:byteCount> <apache:referer>http://www02.so-net.or.jp/</apache:referer> <apache:userAgent>Mozilla/3.01 [ja] (Win95; I)</apache:userAgent> </apache:entry></store>

Dan Suciu Tools for XML Data Exchange

Xmill Example: Weblogs

weblog.dat: 15.9MB weblog.dat.gz: 1.6MB

weblog.xml: 24.2MB weblog.xml.gz: 2.1MB

weblog1.xmi: 1.75MB

weblog2.xmi: 1.33MB

weblog3.xmi: 0.82MB

xmill -p // weblog.xml weblog1.xmixmill -p // weblog.xml weblog1.xmi

xmill weblog.xml weblog2.xmi xmill weblog.xml weblog2.xmi

xmill -f settings.pz weblog.xml weblog3.xmi xmill -f settings.pz weblog.xml weblog3.xmi

Dan Suciu Tools for XML Data Exchange

Xmill: Fine Tuning the Compression

-p//apache:host=>seqcomb(u8 "." u8 "." u8 "." u8)-p//apache:userAgent=>seq(e "/" e)-p//apache:byteCount=>u-p//apache:statusCode=>e-p//apache:contentType=>e-p//apache:requestLine=>seq("GET " rep("/" e) " HTTP/1." e)-p//apache:date=>seq(u "/" u8 "/" u8 "-" u8 ":" di ":" di)-p//apache:referer=>or(seq("file:" t) seq("http://" or(seq(rep("." e) "/" rep("/" e)) rep("." e))) t)

-p//apache:host=>seqcomb(u8 "." u8 "." u8 "." u8)-p//apache:userAgent=>seq(e "/" e)-p//apache:byteCount=>u-p//apache:statusCode=>e-p//apache:contentType=>e-p//apache:requestLine=>seq("GET " rep("/" e) " HTTP/1." e)-p//apache:date=>seq(u "/" u8 "/" u8 "-" u8 ":" di ":" di)-p//apache:referer=>or(seq("file:" t) seq("http://" or(seq(rep("." e) "/" rep("/" e)) rep("." e))) t)

Dan Suciu Tools for XML Data Exchange

Storing XML Data

• Scenario:– receive a large XML data instance– want to store, manage it

• Could build an XML management system from scratch (eXcelon)

• Preferably: use existing database systems

Dan Suciu Tools for XML Data Exchange

&o1

&o3

&o2

&o4 &o5

paper

title author authoryear

&o6

“The Calculus” “…” “…” “1986”

Storing XML:Ternary Relation

[Florescu, Kossman 1999]

S o u r c e L a b e l D e s t

& o 1 p a p e r & o 2& o 2 t i t l e & o 3& o 2 a u t h o r & o 4& o 2 a u t h o r & o 5& o 2 y e a r & o 6

N o d e V a l u e

& o 3 T h e C a l c u l u s& o 4 …& o 5 …& o 6 1 9 8 6

Ref

Val

Dan Suciu Tools for XML Data Exchange

Storing XML:Derive Schema from DTD

• DTD:

• ODMG classes:

• [Christophides et al. 1994 , Shanmugasundaram et al. 1999]

<!ELEMENT employee (name, address, project*)><!ELEMENT address (street, city, state, zip)>

class Employee public type tuple (name:string, address:Address, project:List(Project))class Address public type tuple (street:string, …)

Dan Suciu Tools for XML Data Exchange

STORED Approach:Mine Data to Derive Schema

paperpaper paper

paper

authorauthor author author author

titletitle title title

year

fn fn fn fn lnlnlnln

a u t h o r t i t l eX X

f n 1 l n 1 f n 2 l n 2 t i t l e y e a r

X X X X X -X X - - X XX X - - X -

Paper1

Paper2

[Deutsch et al. 1999]

Dan Suciu Tools for XML Data Exchange

Summary

• XML - simple (?), lightweight syntax

• Challenge: build bridges to existing database tools

• XML in data exchange: YES

• XML as a new data model: NO

Dan Suciu Tools for XML Data Exchange

More Info

http://www.research.att.com/~suciu

Data on the Web:

From Relational to Semistructured to XML

Morgan Kaufmann, 1999