1 omar benjelloun - new bases for new data new bases for new data omar benjelloun stanford...
DESCRIPTION
3 Omar Benjelloun - New Bases for New Data … but data has changed Data is distributed, behind applications, dynamically changing Data is heterogeneous Data may be uncertain Today Data is stored in relational databases (or XML) Techniques for data integration, data exchange … Lots of code Traditional Database Management Systems (DBMS’s) are too rigid New characteristics should be represented in the data New bases are needed foundations (models and languages) Processing and optimization techniquesTRANSCRIPT
1Omar Benjelloun - New Bases for New Data
New Bases for New Data
Omar BenjellounStanford University January 27th, 2006
2Omar Benjelloun - New Bases for New Data
Relational databases are great
A simple, understandable model for data
High-level, declarative language for queries and updates: SQL
Efficient optimization techniques
Relational databases are the cornerstone of the management of homogeneous, regular, exact, centralized information
Boss Emp ManagerJoe Bill
Bill Steve
3Omar Benjelloun - New Bases for New Data
… but data has changed
• Data is distributed, behind applications, dynamically changing• Data is heterogeneous• Data may be uncertain
Today• Data is stored in relational databases (or XML) • Techniques for data integration, data exchange• … Lots of code
Traditional Database Management Systems (DBMS’s) are too rigid
New characteristics should be represented in the data
New bases are needed• foundations (models and languages) • Processing and optimization techniques
4Omar Benjelloun - New Bases for New Data
Applications
Information integration• Data is distributed on multiple heterogenous, independent sources• Conflicting information from the sources: inconsistency, uncertainty• Varying and evolving reliability of sources• Where data came from can be critical information
Scientific data management
Receptor (e.g., sensor) data management
Data cleaning (entity resolution)
And many others…
5Omar Benjelloun - New Bases for New Data
Agenda
Distributed and dynamic data: Active XML• A “glue” language to connect data and programs• XML documents with embedded calls to Web services• Distributed interactions through the exchange of AXML data• Techniques to query and control the exchange of AXML data
Uncertain data: ULDB’s• An extension of the relational model with uncertainty and lineage• Efficient query evaluation• Computing probabilities
Conclusion
6
Omar Benjelloun - New Bases for New Data
Active XML
7Omar Benjelloun - New Bases for New Data
Distributed data managementInformation is everywhere
services
XML XML
services
XML XMLXML XML
services
XML
services
XMLInternet
Webservice
Webservice
Data warehousesDatabasesWeb sitesPC, PDA, cell phones, home appliances, cars…
8Omar Benjelloun - New Bases for New Data
The golden triangle of distributed data management
XML a standard for data representation & exchange
• Extensible Markup Language• Labeled ordered trees• Rich types: XML Schema
Query languages• XPath, XQuery
Web services • Standards for distributed computing
XQuery XPath
XML
SOAPWSDL
9Omar Benjelloun - New Bases for New Data
What is Active XML (AXML)?
AXML is a declarative language
for distributed information management
and
an infrastructure to support this language,
in a peer-to-peer framework.
10Omar Benjelloun - New Bases for New Data
Active XML documents
XML documents with embedded calls to Web services
Intensional • Some of the data is given explicitly • Some is given intensionally
(i.e. the means to acquire data when needed are given)
Dynamic• If the external sources change, the same document will provide
different information• Reaction to world changes
11Omar Benjelloun - New Bases for New Data
Not a new idea in databases, nor on the Web
Mixing calls to data is an old idea• Procedural attributes in relational systems• Basis of Object-oriented Databases
In Web programming• Sun’s JSP, PHP+MySQL
Calls to Web services inside documents• Macromedia FLEX, Apache Jelly, Microsoft XAML
What is new is the exploitation of the idea…
12Omar Benjelloun - New Bases for New Data
Web services in brief
A number of standards• XML• SOAP: Exchange of messages between applications• WSDL: Description of service interfaces (e.g. input/output types)• UDDI: Advertisement and discovery of services• … other proposed standards (choreography, security, etc.)
For us: means to provide, invoke and describe remote functions with XML input/output.
They make AXML documents universally understandable.
13Omar Benjelloun - New Bases for New Data
A sample AXML document<?xml version=“1.0” ?><newspaper> <title>Le Monde</title> <date>06/10/2003</date> <call svc=“Yahoo.GetTemp”> <city>Paris</city> </call> <call svc=“TimeOut.GetEvents”> exhibits </call></newspaper>
GetTemp
city
“Paris”
newspaper
titledate
“06/10/2003”“Le Monde”
GetEvents
“Exhibits”
AXML documents may contain calls:• to any existing Web services
(e-bay.net, google.com…)• to any AXML Web services
(to be defined)
14Omar Benjelloun - New Bases for New Data
Materialization
• Replacing the call by its result is not the only option• Calls are not necessarily RPC-style synchronous invocations
<?xml version=“1.0” ?><newspaper> <title>Le Monde</title> <date>06/10/2003</date> <call svc=“Yahoo.GetTemp”> <city>Paris</city> </call> <call svc=“TimeOut.GetEvents”> exhibits </call></newspaper>
GetTemp
city
“Paris”
newspaper
titledate
“06/10/2003”“Le Monde”
GetEvents
“Exhibits”
Y!Y!
temp
“16°C”
SOAP call
<temp>16°C</temp>
15Omar Benjelloun - New Bases for New Data
AXML Web services
Parameters: AXML data
Result: AXML data
Distribute computations: by sending as parameters data containing service calls, one can delegate some work to other peers.
Partial computations: by returning data containing service calls, one can give to the receiver the control of these calls.
Great flexibility
16Omar Benjelloun - New Bases for New Data
Distributed interactions
17
Omar Benjelloun - New Bases for New Data
Exchanging Active XML
18Omar Benjelloun - New Bases for New Data
To call or not to call ?
GetEvents
“Exhibits”
newspaper
title date
“Le Monde”“06/10/2003”
GetTemp
city
“Paris”
temp
“16°C”
Y!Y!
Materialization can be performed by the sender, before sending a document… or by the receiver, after receiving it.
GetEvents
“Exhibits”
newspaper
title date
“Le Monde”“06/10/2003”
GetTemp
city
“Paris”
temp
“16°C”
19Omar Benjelloun - New Bases for New Data
Why control the materialization of calls?
For added functionality, e.g. • Intensional data allows to get up-to-date information.
For security reasons or capabilities, e.g.• I don’t trust this Web service/domain,• I don’t have the right credentials to invoke it, • It costs money,• Maybe the receiver doesn’t know Active XML!
For performance reasons, e.g.• A proxy can invoke all the services on behalf of a PDA.
… and many more reasons you can think of!
20Omar Benjelloun - New Bases for New Data
We extend XML Schema, with intensional types: XMLSchemaint
How to control it? Using types
Static analysis algorithms use signatures of services: WSDLint
... ...
r
......
...
... ...
gfq
...
CapabilitiesACLCost...
Sender
dataexchangeSchemaf q
g
CapabilitiesACLCost...
Receiver
gg
g
g
gg
q
q
q
f
fr
r
21Omar Benjelloun - New Bases for New Data
Data:newspaper = title.date.(GetTemp|temp).(GetEvents|exhibit*)
title = data
date = data
temp = data
city = data
exhibit = title.(GetDate|date)
Functions:GetTemp(city) -> temp
GetEvents(data) -> (exhibit|performance)*
GetDate(title) -> date
The extended schema language
Rewriting: replace call(s) by an arbitrary output of the service.
To simplify, we use here a DTD-like syntax
GetTemp
city
“Paris”
newspaper
titledate
“06/10/2003”“Le Monde”
GetEvents
“Exhibits”
22Omar Benjelloun - New Bases for New Data
Rewritings
The Goal:Given • an AXML document d • a schema s, Can we rewrite d so that it matches s?
Safe rewriting: one that for sure leads to s(we know without making any call)
Possible rewriting: one that may lead to s (depending on the answers of services)
23Omar Benjelloun - New Bases for New Data
Difficulties
Infinite search space• Vertical• Horizontal
Main problem • The result of a Web service call is unknown• We just know a signature (input/output types)
We want a very efficient solution
Foundations of the problem • String & tree automata, • with existential and universal transitions.
24Omar Benjelloun - New Bases for New Data
Results
The general problem is undecidable [MSS03]
Restrictions on the considered rewritings• Left-to-right: No “going back and forth”• K-depth: bound on the nesting of function calls (Search space still infinite but finitely representable)
Under these restrictions• We have algorithms to find safe/possible rewritings.• They are PTIME (for deterministic schemas).• We can also do it between schemas.
Implementation• demo at VLDB 2003 (customizable news syndication)
25Omar Benjelloun - New Bases for New Data
Safe rewriting algorithm (flavor)Build an FSA that accepts all k-depth rewritings of the initial word.
Build an FSA that recognizes the complement of the target type.
GetEvents
1wA
q1title
q6
dateq2 q3GetTemp
q0 q4
q5
q7
exhibit
performance
temp
p0 p1title p2date p3temp p4GetEvents p6*
p5
exhibit
exhibit
*
* * * *
*
A
26Omar Benjelloun - New Bases for New Data
Safe rewriting algorithmCompute the intersection of these languages:
A smart marking determines whether a safe rewriting exists.Then run the word on the marked automaton to find an actual rewriting.Optimizations: lazy construction of the automata
parallel evaluation of calls
q0,p0 q1,p1 q2,p2 q3,p3 q4,p4
q6,p3q5,p2
q3,p6q7,p6
q4,p6
q7,p6 q7,p3 q4,p3
q7,p5 q4,p5
title date
temp
GetEvents
GetEventsperformance
performance
GetTemp
performanceexhibit
exhibit
exhibit
exhibit
AAA kw ×=×
27
Omar Benjelloun - New Bases for New Data
Querying Active XML
28Omar Benjelloun - New Bases for New Data
Querying AXML Data
Given a (tree pattern) query:/newspaper[temp > 18°C]/exhibits//exhibit[location=“Le Louvre”]
Materialize the document?
Call only the services that may contributedata to the query answer.
The problem: Lazy evaluation of service callsTo call or not to call, this time when evaluating a query
GetTemp
city
“Paris”
newspaper
titlegetDate
“Le Monde”
GetEvents
“Exhibits”
exhibits
GetExhibits
“Paris”
City
temp
“19°C”
29Omar Benjelloun - New Bases for New Data
Lazy evaluation
Difficulties:• Calls can be found everywhere in the document• May appear dynamically (as a result of previous calls)• May become (ir)relevant due to previous invocations• Need to take signatures of calls into consideration
A possible approach: modify the query processor• Top-down evaluation• Trigger the calls found on the way• Not so great:
– Computation is blocked– Optimization opportunities are lost
30Omar Benjelloun - New Bases for New Data
NFQ’s
Given a query to evaluate:
Derive a set of
“node-focused” queries (NFQ),
that find the relevant calls
when evaluated on the document.
Need to be reevaluated, as the document evolves!
newspaper
temp
> 18°C
exhibitsexhibitlocation
“Le Louvre”
newspaper
temp
> 18°C
exhibits
*
**Etc.
31Omar Benjelloun - New Bases for New Data
Optimizations
Service calls sequencing• Analysis of the relationship between calls (through the NFQ’s)• Layering, and parallelization inside each layer.
Filtering by type analysis• Match output types of services to the data expected by queries
“Pushing” queries to capable servicesAcceleration:
• Via relaxation:– NFQ approximation– Superset of the relevant calls
• Via a special access structure, similar to a DataGuide:– Restricted to paths that lead to service calls– Indexes the calls
Experimental assessment• 10x speed-up when combining optimizations
32Omar Benjelloun - New Bases for New Data
There is more…
The AXML peer system • Manages persistent AXML documents • Provides AXML services • Open source
Language extensions to control the activation of calls
Continuous services
Theoretical foundations
…check out http://www.activexml.net
33
Omar Benjelloun - New Bases for New Data
Uncertain data
34Omar Benjelloun - New Bases for New Data
Basic Premise
Traditional relational DB• Every data item’s value must be exact• Every data item is in the database or not• Where data came from and how it evolves is not important
ULDB’s relax these constraints by making1. Data2. Uncertainty3. Lineage
all first-class interrelated concepts
35Omar Benjelloun - New Bases for New Data
Previous work
Models for uncertainty• Labeled nulls, c-tables, probabilistic models,...
Trade-off between • expressiveness• Simplicity of representation, complexity of operations• We investigated this space in [DBHM06]
Models for lineage• In relational databases, data warehouses• Definition of lineage can be tricky for complex queries
First to consider lineage together with uncertainty
36Omar Benjelloun - New Bases for New Data
Uncertainty
Possible worlds:
SAW Witness CarGranny VWCop Ford
Granny
BMW
Granny VWCop Ford
Granny BMWCop Ford
?
Cop Ford
x-tuple
alternate
maybeCop VW
Granny VWCop VW
Granny BMWCop VW Cop VW
Simple formalism • not complete• not closed under joins
37Omar Benjelloun - New Bases for New Data
Lineage
SAW Witness CarGranny VWCop Ford
OWNS Suspect CarChris VWChris BMWMike VWMike Ford
witness, suspect
ACCUSES Witness SuspectGranny ChrisGranny MikeCop Mike
38Omar Benjelloun - New Bases for New Data
ULDB’s
SAW Witness CarGranny VWCop Ford
OWNS Suspect CarChris VWChris BMWMike VWMike Ford
ACCUSES Witness SuspectGranny ChrisGranny MikeCop Mike
Granny
BMW ?
Granny Chris
??
?
39Omar Benjelloun - New Bases for New Data
ULDB’s
SAW Witness CarGranny VWCop Ford
OWNS Suspect CarChris VWChris BMWMike VWMike Ford
ACCUSES Witness SuspectGranny ChrisGranny MikeCop Mike
Granny
BMW
Granny Chris
?
??
?
40Omar Benjelloun - New Bases for New Data
Properties
ULDB’s are simple• x-tuples: set of alternate tuples, with or without ‘?’ • lineage: associates with each alternate a set of alternates / external
symbols
ULDB’s are expressive • Complete: can represent any finite set of possible worlds (with lineage)• Simple implementation of monotonic queries, with correct lineages• Natural probabilistic extension
ULDB’s are efficient• Query processing can use existing query optimizers• Tuple certainty/membership can be tested in polynomial time
41
Omar Benjelloun - New Bases for New Data
Query processing
42Omar Benjelloun - New Bases for New Data
Querying ULDB’s
D Q(D)
ULDB’s
Pos
sibl
e w
orld
s
D1, D2, …, Dn
Query semanticsQ(D1), Q(D2), …, Q(Dn)Q(Di): add query result
as new relation and lineage to Di
Algorithm
Relational databases(with lineage)
43Omar Benjelloun - New Bases for New Data
Algorithm
SAW Witness CarGranny VWCop Ford
OWNS Suspect CarChris VWChris BMWMike VWMike Ford
witness, suspectACCUSES Witness SuspectGranny ChrisGranny MikeCop Mike
BMWGrannyFordKid
Granny ChrisKid Mike
??
??
BMWGranny ?FordKid ?
MikeKid
44Omar Benjelloun - New Bases for New Data
Properties
Efficient algorithm• Query processing phase can use standard query optimizer• Lineages are easy to propagate • “Grouping” phase requires a single pass on the result
Initial prototype• represents a ULDB as a relational DB• uses simple query rewriting techniques
Algorithm works for any monotonic query (including SPJU queries)
45
Omar Benjelloun - New Bases for New Data
Probabilities
46Omar Benjelloun - New Bases for New Data
Probabilistic ULDB’s
Semantics: As before, with a probability for each possible world
Without lineages• Alternates of the same x-tuple correspond to disjoint events• Alternates of different x-tuples correspond to independent events
Lineages • Capture correlations• Help propagate probabilities for query results
SAW Witness CarGranny VW
Cop Ford
Granny BMW ?
Cop VW
0.2 0.5
0.3 0.7
0.3
47Omar Benjelloun - New Bases for New Data
Probabilistic query answering
Compute queries as before
Compute probabilities on demand• Traverse lineages transitively to the leaves• Combine probabilities of reached alternates
Optimizations: memoize probabilities, efficiently detect ‘closest independent ancestors’
?
?
??
?
0.2 0.3 0.4 0.1 0.3 0.5 1
48Omar Benjelloun - New Bases for New Data
Future work
Richer queries • Duplicate elimination, difference, aggregation• Supported through new kinds of lineages (e.g., disjunctive, negative)• Querying the uncertainty and the lineage
More operations• Updates (and their lineage), close to versioning• “Uncertain operations”, e.g., entity resolution, inconsistency repairs
More optimization techniques
More theory
49
Omar Benjelloun - New Bases for New Data
Conclusion
50Omar Benjelloun - New Bases for New Data
New “Bases” for new data
The database way• Simple models• Declarative languages• Optimization techniques
… for new features of data• Distribution and decentralization: Active XML• Uncertainty and lineage: ULDB’s
There are more challenges• Real-world side effects, semantic reasoning
and strong requirements• security, privacy, personalization
Big challenge: Doing it all in a coherent way• One “big” model?• Integration of models?
51
Omar Benjelloun - New Bases for New Data
Merci