web data and the resurrection of database theory dan suciu university of washington
DESCRIPTION
Short History of Database Theory The legendary beginnings, : Relational databases are the brainchild of a theoretician (Codd) Heavily debated at the time (against CODASYL) It took several years for the concept to be validated in practice Theory driving the industryTRANSCRIPT
![Page 1: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/1.jpg)
Web Data and the Resurrection of Database Theory
Dan SuciuUniversity of Washington
![Page 2: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/2.jpg)
“In theory there is no difference between theory and practice. In practice there is.”
Jan L.A. van de Snepscheut
September 12, 1953 - February 23, 1994
![Page 3: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/3.jpg)
Short History of Database Theory
The legendary beginnings, 1970-1971:• Relational databases are the brainchild of a
theoretician (Codd)• Heavily debated at the time (against CODASYL)• It took several years for the concept to be
validated in practice
Theory driving the industry
![Page 4: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/4.jpg)
Short History of Database Theory
The golden years (end of 70s, early 80s)• Relational theory
– Functional dependencies– Query containment
• Transactions• Access methods
Theory listening to the industry
![Page 5: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/5.jpg)
Short History of Database Theory
Refined decadence (end of 80s, early 90s)• Descriptive complexity• Logic databases• Complex objects• Constraint databases
Divorce ?
![Page 6: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/6.jpg)
“Database Metatheory:Asking the Big Queries”
Christos Papadimitriou, in PODS, 1995• Theory is inevitable: CS is a science of the artificial, and its artifact is being changed
by the very act of studying it
Immaturescience
Normalscience Crisis Revolution
• Kuhn’s paradigm principle, for natural sciences
![Page 7: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/7.jpg)
Is DB Theory in a Crisis Today ?
• Industry’s focus:– one particular data model: relational/SQL– one particular application (client-server)
• Theory’s focus is on Logic:– New data models, query languages (query
containment, complex objects, recursion)– New applications (incomplete information,
query rewriting using views)
![Page 8: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/8.jpg)
One Example of Unused Theory
Containment of conjunctive queries is NP complete [Chandra and Merlin’77]
Dozens of extensions:• With union and difference [Sagiv and Yannakakis’81]• With order predicates [Klug’88, van den Meyden’92]• With complex objects [Levy and Suciu’97]• With regular expressions [Florescu, Levy and Suciu’98]
![Page 9: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/9.jpg)
Query Containment
The query:
Minimization not used by RDBMs today
Q1 = SELECT DISTINCT x.name, x.phone FROM Person x, Person y, Person z WHERE x.department = y.department AND x.manager = z.manager
Q2 = SELECT DISTINCT x.name, x.phone FROM Person x
Is minimized to:
The following can be checked: Q1 Q2 and Q1 Q2
…hence Q1=Q2
![Page 10: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/10.jpg)
Why Today Things Are Changing
Just one reason: The Web
More precisely:• A new data model
– Semistructured data– XML syntax
• New applications – Transformation– Integration
![Page 11: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/11.jpg)
Web Data Management
• Who creates the new rules– W3C working groups– Sometimes the industryThe new artifacts are not concepts, but standards
• The double role of theory– Long term: conceptualize/rationalize
• E.g. keys for XML [Buneman, Davidson, Fan, Hara, Tan’01]
– Short term: answer technical questions
![Page 12: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/12.jpg)
Some Questions for Database Theory
• XML publishing• Typechecking XML transformations• XML storage• Data distribution
![Page 13: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/13.jpg)
Warehouse
application
relational data
Transform
IntegrateXML Data WEB (HTTP)
application
application
legacy data
object-relational
WarehouseXMLPublishing
XMLStorage
XMLTypechecking
XMLDistribution
![Page 14: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/14.jpg)
XML Publishing
Today:• Legacy data
– fragmented into many flat relations– 3rd normal form– proprietary
• XML data– nested– un-normalized– public (450 schemas at www.biztalk.org)
![Page 15: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/15.jpg)
XML Publishing: an Example
Eu-Stores US-Stores
Products
Eu-Sales US-Sales
name country name url
date
date tax
name priceUSD
euSid usSid
pid
Legacy data in E/R:
![Page 16: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/16.jpg)
XML Publishing: an Example• XML view
<allsales> <country> <name> France </name> <store> <name> Nicolas </name> <product> <name> Blanc de Blanc </name> <sold> 10/10/2000 </sold> <sold> 12/10/2000 </sold> … </product> <product>…</product>… </store>…. </country> …</allsales>
• In summary: group by country store product
![Page 17: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/17.jpg)
allsales
country
name store
name product
name sold
date tax
url
PCDATA
PCDATA
PCDATA
PCDATA PCDATA
PCDATA
*
*
*
*
?
?
Output “schema”:
![Page 18: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/18.jpg)
{ FROM EuStores $S, EuSales $L, Products $P WHERE $S.euSid = $L.euSid AND $L.pid = $P.pid CONSTRUCT <allsales()> <country($S.country)> <name> $S.country </name> <store($S.euSid)> <name> $S.name </name> <product($P.pid)> <name> $P.name </name> <price> $P.priceUSD </price> </product> </store> </country> <allsales>} /* union….. */
XML Publishing
…. /* union */{ FROM USStores $S, EuSales $L, Products $P WHERE $S.usSid = $L.euSid AND $L.pid = $P.pid CONSTRUCT <allsales()> <country(“USA”)> <name> USA </name> <store($S.euSid)> <name> $S.name </name> <url> $S.url </url> <product($P.pid)> <name> $P.name </name> <price> $P.priceUSD </price> <tax> $L.tax </tax> </product> </store> </country> <allsales>}
In SilkRoute [Fernandez, Suciu, Tan ’00]
![Page 19: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/19.jpg)
Non-recursive datalog(SELECT DISTINCT … )allsales()
country(c)
name(c) store(c,x)
name(n) product(c,x,y)
name(n) sold(c,x,y,d)
date(c,x,y,d) Tax(c,x,y,d,t)
url(c,x,u)
c
n
n
d t
u
Internal Representation
country(c) :-EuStores(x,_,c), EuSales(x,y,_), Products(y,_,_)country(“USA”) :-
store(c,x) :- EuStores(x,_,c), EuSales(x,y,_), Products(y,_,_)store(c,x) :- USStores(x,_,_), USSales(x,y,_), Products(y,_,_), c=“USA”
url(c,x,u):-USStores(x,_,u), USSales(x,y,_),Products(y,_,_)
allsales():-
Large query (x100 lines), large XML answer (x100 MB)
*
*
*
*
?
View Tree:
![Page 20: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/20.jpg)
Users Ask Specific XML Queries
• find names, urls of all stores who sold on 1/1/2000 (in XML-QL / XQuery melange):
WHERE <allsales/country/store> <product/sold/date> 1/1/2000 </> <name> $X </> <url> $Y </> </>RETURN $X , $Y
Small query, small answer
![Page 21: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/21.jpg)
name(c)
name(n)
Tax(c,x,y,d,t)date(c,x,y,d)
allsales()
country(c)
store(c,x)
name(n) product(c,x,y)
sold(c,x,y,d)
url(c,x,u)
c
n
n
d t
u
Query Compositionallsales
country
store
product
sold
date
url
1/1/2000
name
$X $Y
View Tree XML-QL Query Pattern$n1
$n2
$n3
$n4
$n5
$Z
“Evaluate” the XML pattern(s) on the view tree, combine all datalog rules
![Page 22: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/22.jpg)
Query CompositionResult (in theory…):
( SELECT S.name, S.url
FROM USStores S, USSales L, Products P
WHERE S.usSid=L.usSid AND L.pid=P.pid AND L.date=‘1/1/2000’)
UNION
( SELECT S2.name, S2.url
FROM EUStores S1, EUSales L1, Products P1
USStores S2, USSales L2, Products P2,
WHERE S1.usSid=L1.usSid AND L1.pid=P1.pid AND L1.date=‘1/1/2000’
AND S2.usSid=L2.usSid AND L2.pid=P1.pid
AND S1.country=“USA” AND S1.euSid = S2.usSid)
![Page 23: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/23.jpg)
Complexity of XML Publishing
• But in practice: 5-7 times more joins !– Need query minimization
• Could this be avoided ?– We thought hard and couldn’t find a better way– Asked students to re-implement: same problem– It is NP-hard !
![Page 24: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/24.jpg)
XML Publishing Is NP-Hard
customer
order complaintPCDATA
??
PCDATA
order():- Q1 complaint():- Q2
XML query:
The composed SQL query is :Minimizing it is NP hard ! (can be shown…)
View Tree:
WHERE <customer> <order> $x </> <complaint> $y </> </>RETURN ( )
Q1 JOIN Q2
![Page 25: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/25.jpg)
Recent Advancements in Query Containment
Definition FOk = First Order Logic with k variables
Fact If Q2 FOk and k “is small”, then Q1 Q2 can be checked efficiently
[Kolaitis, Vardi’98], [Vardi’00], [Chekuri, Ramajaran’97]
![Page 26: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/26.jpg)
XML Publishing: Finale
Prediction techniques based on FOk and/or query width will be deployed in XML publishing in the future
(perhaps under different names)
![Page 27: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/27.jpg)
XML Typechecking
Purpose: ensure that the generated XML conforms to the desired DTD (or XML Schema)
Two kinds:• Dynamic typechecking
– Easy: lots of XML validating parsers available• Static typechecking
– Hard: need complex analysis of the XML generation program
![Page 28: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/28.jpg)
XML Typechecking
XML generation programs:• Publishing: RDBMS XML (e.g. SilkRoute)• Transformation: XML XML (e.g. XSL, Xquery)• Integration: XML + XML XML
This talk: XML XML
![Page 29: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/29.jpg)
The XML Typechecking Problem
Given an XML XML transformation f:
Type Checking ProblemGiven DTDs 1, 2, check D 1, f(D) 2
sometimes 1 = any: check D, f(D) 2
![Page 30: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/30.jpg)
Today’s Systems Try to DoType Inference
Type Inference ProblemGiven DTD 1, find the DTD f(1) = {f(D) | D 1}
Today’s systems:• “Compute” f(1)
• Check f(1) 2 (which is possible)
sometimes 1 = any: compute f(any)check f(any) 2
![Page 31: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/31.jpg)
Theory’s Role:Send a Warning
This approach fails in general !
But it may work OK in most “practical” cases...
![Page 32: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/32.jpg)
Why XML Type Inference Fails
Xquery f =
• “Inferred” (wrong) DTD f(any):
RETURN <a> (FROM Employee $x RETURN <b/>), (FROM Employee $x RETURN <c/>), (FROM Employee $x RETURN <d/>) </a>
<!ELEMENT a (b*,c*,d*)>
<!ELEMENT a ({bn,cn,dn | n 0})>• “Real” output “DTD”
<!ELEMENT a ((b,b)*,(c,c)*,(d,d)* | (b,b)*,b,(c,c)*,c,(d,d)*,d)>
• Fails to typecheck f(any) 2 when 2=
![Page 33: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/33.jpg)
The Typechecking Problem in Theory and Practice
• In practice, we care about typechecking• Question for theory: is this possible ?• Positive result [Milo, Suciu, Vianu, 2000]:
– Decidable for k-pebble tree tansducers– Hence: decidable for:
• Join-free XQuery• Simple XSLT programs
• Negative result [Alon, Milo, Neven, Suciu, Vianu 2001]:– Undecidable for transformations with value joins
![Page 34: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/34.jpg)
The Typechecking: Finale
Prediction: systems will continue to use type inference, but will never be as robust as type checking in programming languages
Need to understand well their applicability
![Page 35: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/35.jpg)
XML Storage
Problem:• Given: a (large) XML data instance• Goal: store/process it in a RDBMS• Problem: find the relational schema !
• Current approaches:– Generic schema [Florescu, Kossman 99]– Derive schema from DTD [Shanmungasudaram et al 99]– Derive schema from XML data[Deutsch, Fernandez, Suciu 99]
![Page 36: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/36.jpg)
The Theory of XML Storage
• The simplest case: flat, unique subelements
M =
• How do we cover all 1’s most economically ?– R1(E2, E3, E4), R2(E1, E5, E9, E12), …
Oid E1 E2 E3 E4 … E5000
&1 1 0 0 1 … 0
&2 0 1 1 0 … 0
&3 0 1 0 1 … 0
&4 0 1 1 1 … 0
&5 1 0 1 0 … 0
&6 1 1 0 0 … 0
… … …
&o10000000 0 1 0 0 0
![Page 37: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/37.jpg)
The Theory of XML Storage
• XML storage and matrix rank
M =
• Can store XML data in k relations rank(M)=k• Conversely: if rank(M)=k what about storage ?
Oid E1 E2 E3 E4 … E5000
&1 1 0 0 1 … 0
&2 0 1 1 1 … 0
&3 0 1 1 1 … 0
&4 0 1 1 1 … 0
&5 1 1 0 0 … 0
&6 1 1 0 0 … 0
&7 0 0 0 1 ... …
… … …
&10000000 1 0 0 1 … 0
![Page 38: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/38.jpg)
XML Storage: Finale
Prediction: we will see several clever XML storage techniques discovered in the near future
![Page 39: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/39.jpg)
The Data Distribution
• Many data consumers, many places to cache• Data can be replicated, transformed
– How to transform it ? The view selection problem– Where to place it ? The data distribution problem.
NP-complete
Prediction: no predictions here (too early…)
![Page 40: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington](https://reader036.vdocuments.net/reader036/viewer/2022081512/5a4d1b617f8b9ab0599ad66d/html5/thumbnails/40.jpg)
Conclusions:Resurrection of Database Theory
• Is theory irrelevant ?– [Papadimitriou, 95]: wrong question to ask
• Respect for practice: only a recent development in human culture• Applicability pressure in CS: annoying trend of last 10 years or so
• Database theory: are we in a revolution ?– The past: researchers created artifacts for the industry– Today: society (Web, W3C) is creating artifacts for
researchers to study, improve
Prediction: there will be no difference betweentheory and practice… at least, in theory !