historical xml databases

34
Historical XML Databases Fusheng Wang and Carlo Zaniolo University of California, Los Angeles

Upload: odelia

Post on 19-Jan-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Historical XML Databases. Fusheng Wang and Carlo Zaniolo University of California, Los Angeles. Overview. State of the art Two scenarios of archiving history Publishing relational database history in XML Temporal queries with XQuery Historical database architecture - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Historical XML Databases

Historical XML Databases

Fusheng Wang and Carlo Zaniolo

University of California, Los Angeles

Page 2: Historical XML Databases

Overview

State of the art Two scenarios of archiving history Publishing relational database history in XML Temporal queries with XQuery Historical database architecture Efficient query support for temporal queries Conclusion

Page 3: Historical XML Databases

State of the Art

Publishing Relational DBs as XML Documents:as actual documents; to be processed using

the very rich XML tool set (XSLT, DOM)as views; to be queried by languages such as

XPath or XQuery. Queries against these views are then mapped into SQL queries on the DB

DB vendors are very active in this area, e.g.:SQLX and SQL functions for XML publishingXTables ( XPeranto ) as a middleware

Page 4: Historical XML Databases

Our Proposal: Publish the History of Relational DBs as XML Documents

Publish the history of relational DBs as XML documents: Natural ways to represent such history in XML Historical queries can be expressed in XQuery

as is—no extensions to the data model or query language required for temporal queries

Approach amenable to efficient implementation: query and storage efficiency of alternative approaches

Gain: Temporal applications are very important and are not supported well by current databases

Page 5: Historical XML Databases

Two Basic Scenarios

XML Data Warehouses archive the history:change can be detected by current database update

logsor compute the delta between the published XML

document snapshots of the new version and old versionTraditional version management (RCS, SCCS). More

recent techniques (UBCC, RBVM) used for XML and complex queries

RDBMSs archive the history:XML history is a view---and historical queries are

mapped back into relational ones (e.g., using XTables)

Page 6: Historical XML Databases

A Short History of Timein Databases

Between 33 and 48 proposals counted:A perennial struggle to get around the limitations of

relational (flat) tables and a rigid query language (SQL)Clifford, Croker, Grandi, and Tuzhilin in their “On

Temporal Grouping” paper show that the temporal-grouped models are more natural and powerful [Temp DB workshop, 1995]

But it is hard to fit temporally grouped models and query languages into SQL—an infinite morass

Page 7: Historical XML Databases

Temporal Grouping in XML

XML makes it possible to express and support temporal grouping

The history of a relational DB can be viewed as an XML document, using such representation

Then, powerful temporal queries can be specified without requiring the introduction of new constructs in the language

There are many ways to publish DBs using XML—and not all will do

Page 8: Historical XML Databases

History of Tables Transaction-Time Relational Tables

Timestamped tuple snapshots name empno salary title deptno DOB start end

Bob 10003 60000 Engineer d01 1945-04-09 1995-01-01 1995-05-31

Bob 10003 70000 Engineer d01 1945-04-09 1995-06-01 1995-09-30

Bob 10003 70000 Sr Engineer d02 1945-04-09 1995-10-01 1996-01-31

Bob 10003 70000 Tech Leader

d02 1945-04-09 1996-02-01 1996-12-31

name empno salary title deptno DOB

Bob1995-01-01:1996-12-31

100031995-01-01:1996-12-31

600001995-01-01:1996-05-31

Engineer1995-01-01:1995-09-30

d011995-01-01:1995-09-30

1945-04-091995-01-01:1996-12-3170000

1995-06-01:1996-12-31

Sr Engineer1995-10-01:1996-01-31

d021995-10-01:1996-12-31

Tech Leader1996-02-01:1996-12-31

Temporally grouped history of employees

Page 9: Historical XML Databases

Publishing DB History in XML Many Alternatives

1. Each table as XML document: columns as attributes Flat structure that corresponds to the tuple snapshots (employees2.xml)

2. Each table as an XML document: columns as elements A natural structure [Clifford et al.] which simplifies many queries (

employees.xml, depts.xml)

3. Multiple tables as a single XML document: flat structure Good for some join queries but not for others(company.xml)

4. Multiple tables as a single XML document: hierarchy similar but more intuitive than previous (depts3.xml)

5. Multiple tables as an XML document: flat structure with IDs Can simplify join queries with IDs and IDREFs (company2.xml)

Page 10: Historical XML Databases

XML Representation of DB HistoryTable Columns as XML Elements

<employees tstart="1995-01-01" tend="1996-12-31"><employee tstart="1995-01-01" tend="1996-12-31">

<empno tstart="1995-01-01" tend="1996-12-31">10003</empno><name tstart="1995-01-01" tend="1996-12-31">Bob</name><salary tstart="1995-01-01" tend="1995-05-31">60000</salary><salary tstart="1995-06-01" tend="1996-12-31">70000</salary><title tstart="1995-01-01" tend="1995-09-30">Engineer</title><title tstart="1995-10-01" tend="1996-01-31">Sr Engineer</title><title tstart="1996-02-01" tend="1996-12-31">Tech Leader</title><dept tstart="1995-01-01" tend="1995-09-30">QA</dept><dept tstart="1995-10-01" tend="1996-12-31">RD</dept><DOB tstart="1995-01-01" tend="1996-12-31">1945-04-09</DOB>

</employee> <!-- More… --></employees>

<employees tstart="1995-01-01" tend="1996-12-31"><employee tstart="1995-01-01" tend="1996-12-31">

<empno tstart="1995-01-01" tend="1996-12-31">10003</empno><name tstart="1995-01-01" tend="1996-12-31">Bob</name><salary tstart="1995-01-01" tend="1995-05-31">60000</salary><salary tstart="1995-06-01" tend="1996-12-31">70000</salary><title tstart="1995-01-01" tend="1995-09-30">Engineer</title><title tstart="1995-10-01" tend="1996-01-31">Sr Engineer</title><title tstart="1996-02-01" tend="1996-12-31">Tech Leader</title><dept tstart="1995-01-01" tend="1995-09-30">QA</dept><dept tstart="1995-10-01" tend="1996-12-31">RD</dept><DOB tstart="1995-01-01" tend="1996-12-31">1945-04-09</DOB>

</employee> <!-- More… --></employees>

Page 11: Historical XML Databases

XML Representation of DB History(cont’d)

Historical data is represented in an XML documentTwo attributes tstart and tend are used to represent the

time intervalThe value now is used to denote the ever-increasing

current timeNode updates:

delete: tend is updated to the current timestamp insert: a new node is appended with tend set as now update: delete followed by an insert

Page 12: Historical XML Databases

Schema of the XML Representation

<!ELEMENT employees (employee)*>

<!ATTLIST employees tstart CDATA #REQUIRED tend CDATA #REQUIRED>

<!ELEMENT employee (empno+, name+, salary+, title+, dept+, DOB+)>

<!ATTLIST employee tstart CDATA #REQUIRED tend CDATA #REQUIRED>

<!ELEMENT empno (#PCDATA)>

<!ATTLIST empno tstart CDATA #REQUIRED tend CDATA #REQUIRED>

<!ELEMENT salary (#PCDATA)>

<!ATTLIST salary tstart CDATA #REQUIRED tend CDATA #REQUIRED>

<!-- … -->

<!ELEMENT employees (employee)*>

<!ATTLIST employees tstart CDATA #REQUIRED tend CDATA #REQUIRED>

<!ELEMENT employee (empno+, name+, salary+, title+, dept+, DOB+)>

<!ATTLIST employee tstart CDATA #REQUIRED tend CDATA #REQUIRED>

<!ELEMENT empno (#PCDATA)>

<!ATTLIST empno tstart CDATA #REQUIRED tend CDATA #REQUIRED>

<!ELEMENT salary (#PCDATA)>

<!ATTLIST salary tstart CDATA #REQUIRED tend CDATA #REQUIRED>

<!-- … -->

The document has a well-defined schema derived from the snapshot document:

Page 13: Historical XML Databases

Temporal Queries with XQuery

Temporal projection: retrieve the salary history of “Bob”:

element salary_history { for $s in document("employees.xml")/ employees/employee/[name=“Bob”]/salary return $s }

element salary_history { for $s in document("employees.xml")/ employees/employee/[name=“Bob”]/salary return $s }

Snapshot queries: retrieve the departments on 1996-01-31:

for $d in document("depts.xml")/depts/dept[tstart(.) <= "1996-01-31" and tend(.) >= "1996-01-31"]let $n := $d/name[tstart(.)<="1996-01-31" and tend(.)>="1996-01-31"]let $m := $d/manager[tstart(.)<="1996-01-31" and tend(.)>=

"1996-01-31"]return( element dept{$n,$m } )

for $d in document("depts.xml")/depts/dept[tstart(.) <= "1996-01-31" and tend(.) >= "1996-01-31"]let $n := $d/name[tstart(.)<="1996-01-31" and tend(.)>="1996-01-31"]let $m := $d/manager[tstart(.)<="1996-01-31" and tend(.)>=

"1996-01-31"]return( element dept{$n,$m } )

Page 14: Historical XML Databases

Temporal Queries with XQuery (cont’d)

Interval Queries. Find employee(s) who worked in the “QA” department throughout the history of that department:

for $d in document("depts.xml")/depts/dept[deptname='QA']/deptnofor $e in document("employees.xml")/employees/employee[deptno=$d]where tstart($e/deptno)=tstart($d) and tend($e/deptno)=tend($d)return $e/name

for $d in document("depts.xml")/depts/dept[deptname='QA']/deptnofor $e in document("employees.xml")/employees/employee[deptno=$d]where tstart($e/deptno)=tstart($d) and tend($e/deptno)=tend($d)return $e/name

Page 15: Historical XML Databases

Complex Temporal Queries with XQuery

A Since B. Find the employee who has been the manager of the dept since he/she joined the dept “d007”:

for $e in document("employees.xml")/employees/employeelet $m:= $e/title[title="Manager" and tend(.)=current-date()]let $d := $e/deptno[deptno ="d007" and tcontains($m, .) ]where not empty($d) and not empty($m)return <employee> {

$e/empno, $e/firstname, $e/lastname}</employee>

for $e in document("employees.xml")/employees/employeelet $m:= $e/title[title="Manager" and tend(.)=current-date()]let $d := $e/deptno[deptno ="d007" and tcontains($m, .) ]where not empty($d) and not empty($m)return <employee> {

$e/empno, $e/firstname, $e/lastname}</employee>

Page 16: Historical XML Databases

Complex Temporal Queries with XQuery (cont’d)

Period Containment. Find employees with same history as employee “10112”, i.e., they worked in the same dept(s) as employee “10112” and exactly for the same periods:

for $e1 in document("employees.xml")/employees/employee [empno = '10112']

for $e2 in document("employees.xml")/employees/employee [empno != '10112']

where every $d1 in $e1/deptno satisfies some $d2 in $e2/deptno satisfies(string($d1) = string( $d2 ) and tequals($d2, $d1))

and every $d2 in $e2/deptno satisfies some $d1 in $e1/deptno

satisfies (string($d2) =

string( $d1 ) and tequals($d1, $d2))return <employee>{$e2/empno}</employee>

for $e1 in document("employees.xml")/employees/employee [empno = '10112']

for $e2 in document("employees.xml")/employees/employee [empno != '10112']

where every $d1 in $e1/deptno satisfies some $d2 in $e2/deptno satisfies(string($d1) = string( $d2 ) and tequals($d2, $d1))

and every $d2 in $e2/deptno satisfies some $d1 in $e1/deptno

satisfies (string($d2) =

string( $d1 ) and tequals($d1, $d2))return <employee>{$e2/empno}</employee>

Page 17: Historical XML Databases

User-Defined Temporal Functions

Shield the user from the low-level details used in representing time, e.g., “now”

Eliminate the need for the user to write complex functions, e.g., coalescing and diff

Predefined functions:History functions: history($e,Ts, Te), snapshot($e, T),

invariance($e, Ts, Te)Restructuring functions: coalese($l) Interval functions: toverlaps, tprecedes, tcontains, tequals,

tmeetsDuration and date/time functions: timespan($e), tstart($e),

tend($e), tinterval($e), telement(Ts, Te), getdbnow(), rtend($e), external($e)

Page 18: Historical XML Databases

Support for ‘now’

‘now’: no change until now. Values of tuples are still current at the time the query is asked

Internally, “end of time” values are used to denote ‘now’, e.g., 9999-12-31

Intervals are only accessed through built-in functions: tstart() returns the start of an interval, tend() returns the end or CURRENT_DATE if it’s different from 9999-12-31

In the output, tend value can be: “9999-12-31”, CURRENT_DATE (through rtend() ), or “now” (through externalnow() )

Page 19: Historical XML Databases

Historical Database Architecture

Historical Database

XML DataCurrent Database

Active Rules/update logs

Historical XML Data

XML Queries

Temporal XML Queries

XML Publishing

XML Views

XML Publishing

XML Views

Page 20: Historical XML Databases

Historical XML Database ArchitectureTwo Approaches

XML-enabled RDBMSHistorical view decomposed into relational databases as

binary tablesHistorical data can then be published as XML document

through SQL/XML publishing functions; or queried through a middleware as XML views

Native XML databasesHistorical data are stored in native XML databaseXML queries can be specified directly upon the databaseNative XML databases: SoftwareAG’s Tamino, eXcelon’s XIS

Page 21: Historical XML Databases

Relational Storage of Temporal Relational Data

AssumptionsEach entity or relation has a unique key ( or composite

keys) to identify it which will not change along the history. e.g., employee: empno

Relational schema:employee(empno, firstname, lastname, sex, DOB,

deptno, salary, title)The historical XML documents are decomposed into

tables

Page 22: Historical XML Databases

Relational Storage of Temporal Relational Data (cont’d)

Key table for keys:employee_id(id, tstart, tend), where id =empno For composite keys, the table will be like:

lineitem_id(id, supplierno, itemno, tstart, tend)

Attribute history tables:employee_lastname(id, lastname, tstart, tend)

employee_salary(id, salary, tstart, tend)employee_title(id, title, tstart, tend)…

Global relation table: keep all the relations history relations(name, tstart, tend)

Page 23: Historical XML Databases

Relational Storage of Temporal Relational Data (cont’d)

Sample contents of employee_salary: ID SALARY TSTART TEND ================================ 100022 58805 02/04/1985 02/04/1986 100022 61118 02/05/1986 02/04/1987 100022 65103 02/05/1987 02/04/1988 100022 64712 02/05/1988 02/03/1989 100022 65245 02/04/1989 02/03/1990 100023 43162 07/13/1988 07/13/1989 ...

Page 24: Historical XML Databases

XML publishing and XML Queries

A middleware (XPERANTO/XTABLES) can be used to publish and query historical tables as XML documentsCreate XML views over relational dataEach database has a default XML viewThe temporal XML document representation can be

reconstructed with user-defined XML views with XQuery, and be queried with XQuery

Query upon XML views with XQueryOnly the desired relational data items are materializedMost computation pushed down to relational engine

Page 25: Historical XML Databases

Automatic Archiving

StatementCREATE HISTORICAL VIEW viewname ASSELECT col1, col2, …FROM tablename [ USING KEY coli, Colj, … ]

Results:Historical tables are created for each attribute of the current

tableTemporal XML views are created with XPERANTOThe historical tables are initialized with the snapshot of the

current tableActive rules are started to trace any changes and archive

into the historical tablesTemporal XQuery can be specified on the XML views

Page 26: Historical XML Databases

Implementation Comparisons

A temporal data simulation program automatically generates the historical data in XML

Total number of employees: 300,024 Database systems and major supported query

languages for comparison:Relational: DB2. SQL Native:

SoftwareAG’s Tamino (text-based storage). XPath eXcelon’s XIS (XML Information Server) (OODBMS-based

storage). XQuery

Page 27: Historical XML Databases

Performance Comparisons

H-Document DB2 Tamino XIS0

100000

200000

300000

400000

500000

600000

700000

x

KB

Data Size Index Size

X: not available

Storage Size:

Page 28: Historical XML Databases

Performance Comparisons (cont’d)

Query Performance of DB2 and Tamino:

Q1 Q2 Q3 Q4 Q5 Q6 Q7

1

10

100

x

Tim

e(s

econds)

(log

10)

Queries

DB2 Tamino

x: not available

Q2: history query

Q4,Q6: snapshot queries

Q3,Q5: interval queries

Q1: scan of databases

Q7: join

Page 29: Historical XML Databases

Performance Comparisons (cont’d)

Query Performance of Tamino and XIS (1/3 data size)

Q1 Q2 Q3 Q4 Q5 Q6 Q7

0

100

200

300

400

500

x

seco

nds

Tamino XIS

X: not available

Page 30: Historical XML Databases

Efficient Query Support for Temporal Queries

H-document is first clustered by document structure, and then by the change historyTamino will preserve the clustering structure thus

retrieving the history of a node can be efficient In RDBMS approach, tuples are stored in the order of

updates, neither temporarily clustered nor clustered by objects

Traditional B+ Tree index will not help on interval-related temporal queries

A segment-based archiving scheme was used in this project

Page 31: Historical XML Databases

Segment-based Archiving Scheme (cont’d)

Sample segments: Segment1 (01/01/1985 - 10/17/1991): ID SALARY TSTART TEND 100002 40000 02/20/1988 02/19/1989 100002 42010 02/20/1989 02/19/1990 100002 42525 02/20/1990 02/19/1991 100002 42727 02/20/1991 12/31/9999 ... Segment2 (10/18/1991 - 07/08/1995): ID SALARY TSTART TEND100002 42727 02/20/1991 02/19/1992 100002 45237 02/20/1992 02/18/1993 100002 46465 02/19/1993 02/18/1994 100002 47418 02/19/1994 02/18/1995 100002 47273 02/19/1995 12/31/9999 ... segment3 (07/09/1995 - 01/08/1999): ...

Page 32: Historical XML Databases

Query Performance

Query Performance with different usefulness

Q1 Q2 Q3 Q4 Q5

0

2000

4000

6000

8000T

ime(

mill

isec

onds)

Queries

Umin

= 0.400 U

min = 0.360

Umin

= 0.257 U

min = 0.200

Umin

= 0.000

Q1,Q3: snapshot queriesQ5: interval queriesQ2, Q4: history queries

Page 33: Historical XML Databases

Conclusion

XML can be used to support a temporally grouped data model, and represent temporal relational data

The framework supports complex temporal queries with XQuery, without extension to XQuery

The XML-viewed history of database tables can be stored using a native XML database or using a RDBMS

RDBMS has significant query performance compared to native XML database, while the latter can be more effective in terms of storage due to compression techniques

A segment-based archiving scheme based on usefulness can significantly boost the performance on most temporal queries

Page 34: Historical XML Databases

History of XML Documents

The temporal representation in XML not only applies to historical relational data, but also historical XML documents

<document tstart="2002-01-01" tend="now"> <chapter tstart="2002-01-01" tend="now"> <no isAttr="yes" tstart="2002-01-01" tend="now">1</no> <title tstart="2002-01-01" tend="2002-01-01">Introduction</title> <title tstart="2002-01-02" tend="now">Introduction and Overview</title> <section tstart="2002-01-01" tend="now"> <title tstart="2002-01-01" tend="now">Background</title> <subsection tstart="2002-01-01" tend="now">

<title tstart="2002-01-01" tend="now">Previous Work</title> <content tstart="2002-01-01" tend="now">...</content>

</subsection> </section> </chapter> <!-- ... --></document>

<document tstart="2002-01-01" tend="now"> <chapter tstart="2002-01-01" tend="now"> <no isAttr="yes" tstart="2002-01-01" tend="now">1</no> <title tstart="2002-01-01" tend="2002-01-01">Introduction</title> <title tstart="2002-01-02" tend="now">Introduction and Overview</title> <section tstart="2002-01-01" tend="now"> <title tstart="2002-01-01" tend="now">Background</title> <subsection tstart="2002-01-01" tend="now">

<title tstart="2002-01-01" tend="now">Previous Work</title> <content tstart="2002-01-01" tend="now">...</content>

</subsection> </section> </chapter> <!-- ... --></document>