25 nov 2001sdbi 20011 sigalit sima sigalit batiashvili kdoshim representing & querying xml with...
Post on 19-Dec-2015
217 views
TRANSCRIPT
25 nov 2001SDBI 20011
Sigalit Sima Sigalit
Batiashvili Kdoshim
Representing & Querying XML With Incomplete Information
Serge Abiteboul
Luc Segoufin
Victor Vianu
2
GoalsMotivation- Data on the WEB- Incompleteness problem
Representation System
Refine Algorithm
Querying The Incomplete Information- CWA approach- Answer using sub queries (CWA+OWA)
3
Introduction & Motivation
4
Data On The WEB
• Partial Information is known- expiration of data- unavailable sites- modification of data, etc.
• Irregular structure - self describing
Semistructured Data
5
Data On The Web ProblemData no longer fits into tables (no rigid structure).
We Want..Apply database-like functionality to access data on the WEB.
Focus: XML-ized portion of the WEB
6
XML
• eXtended Markup Language• The Lingua franca of the WEB• Facilitate the use of database
techniques to manage WEB data• Brings order
- nested tags (similar to record structure)- ordered sub-elements- structure (DTD, XML-Schema)
DTD
(Document Type Definition)
Define constrains on the XML Document Structure
7
XML Example
<person> <name> Jhon Smith </name>
<addr> Green Field Park N.Y. </addr>
<email> [email protected] </email>
</person>Person
nameaddremail
Jhon SmithGreen..Jhon.smith@..
8
View XML As Trees
<person> <name> John
Smith</name> <addr> Green Field Park N.Y. </addr> <email> john.smith@
infineon.com </email></person>
person
name addr email
John Green Field john.smith@
Smith Park N.Y. person name addr email*
DTD
9
Webhouse
Webhouse- A collection of website sources- context: XML- Hold a DTD that describes the
sources structure
WarehouseA collection of information from many sources
10
Webhouse Maintaince
The Webhouse continuously enriched by web sites exploration
webhouse
Technique: Crawling the web.
11
Webhouse
• Dynamic nature of WEB data• Limited storage capacity• Expiration of data• Modification of data• Etc.
Why?
Information held in the webhouse is never complete.
12
The Problem
- Missing documents satisfying the query in the webhouse- Missing the relevant data in the document
Posing a query against the webhouse may yield an incomplete answer
13
Solution Two main approaches•Closed World Assumption (CWA) If some information does not appear explicitly it does not hold.
- possible method: Best Effort
- possible method: Fetch Data
•Open World Assumption (OWA)Anything not ruled out is possible
14
Solution Methods
• Best EffortAnswer accordingly to the available information
• Fetch DataSeek the sources for additional information to provide a complete answer
15
Fetch Data
• Defining the missing portion of the data using the available information
• Thus, determining the additional exploration of WEB sources.
How?
We use the Fetch Data approach We would like toBe able to define what additional resource we are looking for.
16
Example
Given the DTDCatalog product+Product name price cat picture*Cat subcat
catalog
product
name price<200
cat=elec
subcat
Query1
Query 1Find the name, price & subcategories of electronics products with price < $200
17
Answer to query 1catalog
product productproduct
Canon 120 elec
camera
Nikton 199 elec
camera
Sony 175 elec
cdplayer
18
Given the DTDCatalog product+Product name price cat picture*
Query 2catalog
product
name cat=elec
subcat=camera
picture
Query 2
Finds the name & pictures of all cameras with picture
Cat subcat
19
Answer Strategy
We Already Have..
ElecPrice < $200Camera (with)
picture
Camera (with)Picture
Price < $200
Query 1
Query 2
20
Answer Strategy (cont)
We Need .. ElecPrice < $200
Camera (with)picture
Camera (with)Picture
Price < $200
Query 1
Query 2
Camera (with)Picture
Price >= $200
•No need to query the Web for the whole query•Define the missing information•Reducing the search space
21
Representation System
22
Framework
• Define the data model - for the webhouse repository (XML data)
• Define constraint model- simplified DTD
• Define query language• Define the representation system for
the incomplete information
23
Data Modelcatalog
productproduct
name
pricecat
subcat
name
pricecat
subcat
labeling function
N – set of nodes
=Canon =120
=elec
=camera
=Nikton =199=elec
=camera
value mapping
: N
= nodes labels
v: Q N
Q = data values
Tdata=<t, , v>
24
Data Tree Prefix
Tdata=<t, , v>
<t’, ’, v’>
25
Tree Typeroot (catalog)
a
ana2a1 …
DTD as Regular expression
(a) = a a1 a2 … an
w2w1 wn
DTD as Regular expression
(a) = a a1 a2 … an
Wi=1 exactly one child labeled ai
Wi=? at most one child labeled ai
Wi=+ at least one child labeled ai
Wi=* 0 or more child labeled ai
w1 w2 wn
Ttype=<, r, >
: element namesr: root label
26
Tree Type Satisfaction
catalog
product
name price cat picture
subcat
+
1 1 1
1
*
Ttype
product
nikon cat=elec
subcat=camera
c.jpeg199
catalogTdata
satisfies
rep(Ttype) = {Tdata: t Ttype}
27
Prefix-Selection Query
• We defined the structure of webhouse data using Tree Types.
• It is natural to define a pattern based query (tree format).
• The matching will thus be done by browsing the input tree.
• Such a query is called a PS query.
28
PS-Query Example
catalog
product
name price cat picture
subcat
+
1 1 1
1
*
Ttype
catalog
product
name pricecat
subcat
Tquery
data prefixesconstraints
=elec<200
Tquery=q=<t, , cond>
t: rooted tree: labeling functioncond: constraints (,,,,)
29
PS-Query Answer
• Denoted q(t’) where t’ is the data tree.
• Consists of a prefix of this tree matching the corresponding query tree nodes.
30
Answer Examplecatalog
productproduct
Canon 120
elec
camera
Nikton 199 elec
camera
catalog
product
name pricecat
subcat
=elec<200
Note:q1(t’), q2(t’) share tree data prefixes
(root and maybe more)
31
Incomplete Information
Data available
Prefixes of tree data enriched by previous queries
Missing portion
Simply define the missing information
Using the initial Ttype and queries
Missing portion
Simply define the missing information
Using the initial Ttype and queries
Incomplete
Tree
32
Conditional Tree Type
1 A Tree Type with a condition functionon the tree nodes.
Provides extensions to Tree Types
dealer
UsedcarsNewcars
ad
ad
model modelyear
* *
Corresponding DTD
dealer UsedCars |NewCars
UsedCars ad*
NewCars ad*
ad model year | model
2 Allow context dependent structure definition.
HOW?
33
Specialization
dealer
Usedcars
Newcars
adUsed
adNew
model model
year
* *
dealer
Usedcars
Newcars
adad
model model
year
* * = {dealer, UsedCars, NewCars, ad, model, year} = {dealer, UsedCars, NewCars, ad, model, year}
’= {adNew, adUsed}
: ’
34
Specialization
dealer
Usedcars
Newcars
adUsed
adNew
model model
year
* *
dealer
Usedcars
Newcars
adad
model model
year
* *
(adNew) = (adUsed) = ad
: ’
CTtype=<,cond,>
35
Incomplete Tree
catalog
product
name price<200 cat=elec
subcat
Query 1: Find the name, price & subcategories of electronics products with price < $200
A tree representing the incomplete information.
36
Incomplete Tree (cont)catalog
product productproduct
Canon 120
elec
camera
Nikton 199
elec
camera
Sony 175
elec
cdplayer
Query 1: Find the name, price & subcategories of electronics products with price < $200
What Is Missing?
?
37
What Is Missing?
product1
name cat!=elec
subcat
pictureprice
*
All products with category differ than electricity
product2
name cat=elec
subcat
pictureprice>200
*
All products with price > 200
38
catalog
product productproduct
Canon
120
elec
camera
Nikton 199 elec
camera
Sony175
elec
cdplayer
product2
name cat=elec
subcat
pictureprice>200
*
*
product1
name cat!=elec
subcat
pictureprice
*
*Incomplete Tree T
Available Information
Prefix of a full data tree (Tdata)
Missing Information
Conditional tree type (CTtype)
39 Query 2: Find the name & pictures of all cameras with picture
product
Olympus elec
camera
o.jpg
c.jpg
3 3 2a
catalog
product productproduct
Canon 120
elec
camera
Nikton 199
elec
camera
Sony 175
elec
cdplayer
**
product1
product2 What Is
Missing??
40
What Is Missing?
product1
name cat!=elec
subcat
pictureprice
*
All products with category differ than electricity
41
product2b
name cat=elec
subcatcamera
pictureprice200
*
All products with price 200 & subcategory
is not camera
What Is Missing?
product1
namecat!=elec
subcat
pictureprice
*
All products with category
differ than
electricity
42
product2b
namecat=elec
subcatcamera
picture
price200
*
What Is Missing?product1
namecat!=elec
subcat
pictureprice
*
product2c
name cat=elec
subcat=camera
price200
All products with price 200 & no picture
43
Incomplete Tree Definition
A Tree T which consists of the following• A data tree Tdata =<t, , v>
– Represents the known data
– Use labels from • A conditional tree type , CTtype
– Represent the missing portion of the data
– Use specialized alphabet ’
• A data labeling mapping ’ from Tdata nodes to element in ’.– E.g. ’(nN | (n)=product) = {product,product3…}
44
Rep(T) Definition
• Rep(T) is the set of trees represented by an incomplete tree T.
• Tdata Rep(T)
A possible completion on the prefix of the available data tree given by T.
45
Rep(T) Definition (cont)
Rep(T)
student
name
addrid
Ttype
Given a Ttype
student
name=shlomo
q
student
shlomo
addrid
Tdata
46
Acquiring Incomplete Information
• Refine Algorithm
47
Acquiring Incomplete Info.
• How this is done via WEB?- simply using answers to queries
• We now show how this can be done against the representation system
AssumptionThe input tree is a single document described by a tree type. We can merge few documents to a single one.
48
Refine Motivation
• Each query posed against the webhouse defines additional constraints
• Answers to these queries help us refine the partial information.
• We describe this partial information using incomplete tree.
• As we acquire the webhouse for more information we want to be able to define the current incomplete information
49
Refine Motivation (cont)
product2
name cat=elec
subcat
pictureprice200
*
Missing
All products with price 200
50
Refine Motivation (cont)product2
name cat=elec
subcat
pictureprice200
*
Strong constrain
t
product2b
namecat=elec
subcatcamera
picture
price200
*
product2refinemen
t
product2c
name cat=elec
subcat=camera
price200
no picture
51
Refine Algorithm
• Refine the incomplete information
Input
T: incomplete tree q: PS-query A: = q(T) answer to q
Output
T’: incomplete tree compatible with the answer A to q
52
Refine Algorithmwebhouse q
A=q(T)
q-1(A)The set of trees
compatible with the answer to q
But we only need trees
that match the so far
incomplete treeq-
1(A)
Rep(T) q-1(A)Rep(T’) =Rep(T) q-
1(A)
53
Refine Output
Defines a new incomplete tree T’
In order to do so we need to define1. CTtype to represent the missing
portion2. Tdata to represent the available data
Step 1
54
Refine Algorithm – step 1
1. Compute the conditional tree type of the negation of q.
I.e. Conditional tree for trees which return an empty answer to q.
55
Refine – step 1
1. Compute the conditional tree type of the negation of q.
…
tq
aa
ta
ta
ta
ta
a
an
a1 a2
cond’(ta
) =true
ta
a
an
a1 a2
ta
a
an
a1 a2
^
cond’(ta)
=¬condq(a)
cond’(ta)
=condq(a)
^
Define ’The labels for the new types will be
defined as specialization of label ‘a‘I.e. (a) = ’(ta)= ’(ta)= ’(ta)^
56
Refine – step 1 (cont)
We defined CT’ cond’ mappingWe defined the specialization
mapping ’ root’ has type (tr tr)
r: the root of query tree
^
ta t* … t* ,accept everything
ta t* … t* ,accept everything below a because there the
condition of q is not satisfied
ta i t* … t* t* … t* ,one of the children must not
satisfy a condition of q
a1 an
a1
a1
ai ai
an
an
^ ^
root’ type (tr tr)Lets define rules..
57
Refine - step 1 Example
product
cat=elec
subcat=camera
picture
tq
tr
ta1
t b
ta2
product
cat=elec
subcatcamera
picture
product
cat=elec
subcat=camera
no pictureTo provide a simple way to view the
disjunction as defined by
product
catelec
subcat
picture
t =tq negationq-1
Negation computation complexity
O(|q|*||)
q the tree query size max number of children
58
Refine – step 1 Example
CT’product2
namecat=elec
subcat
picture
price200
*
product1
namecat!=elec
subcat
pictureprice
*
CT
product
cat=elec
subcatcamera
picture
product
cat=elec
subcat=camera
no picture
product
catelec
subcat
picture t q-1
Note
This intersection yields exactly the missing types product1,
product2b and product2c
We next show it..
59
Refine – step 1 Example
CT’
product1
namecat!=elec
subcat
pictureprice
*product2
namecat=elec
subcat
picture
price200
*
product1
namecat!=elec
subcat
pictureprice
*
product
catelec
subcat
picture
60
Refine – step 1 Example
CT’
product2b
namecat=elec
subcatcamera
pictureprice 200
*product2
namecat=elec
subcat
picture
price200
*
product1
namecat!=elec
subcat
pictureprice
*
product
cat=elec
subcatcamera
picture
61
Refine – step 1 Example
CT’
product2c
namecat=elec
subcatcamera
price 200
product2
namecat=elec
subcat
picture
price200
*
product1
namecat!=elec
subcat
pictureprice
*
product
cat=elec
subcat=camera
no pictureno picture
62
Node ids Assumption• Persistent node ids
Distinct queries against an XML document return nodes with the same id iff the nodes are identical.
product
canonelec
camera
120
&231product
canonelec
camera
c.jpg120
*
&231product
canonelec
camera
c.jpg120
&231
=
63
Node ids Assumption (cont)
• Make it possible to enrich the information about a given node through consecutive queries
• Otherwise, the size of representation system will be too large to handle.
- the representation system will need to be extended in order to keep track of the various possible ways of matching nodes returned by different queries
• A crucial assumption
64
Refine Output
Defines a new incomplete tree T’
In order to do so we need to define1. CTtype to represent the missing
portion2. Tdata to represent the available data
Step 2
65
Refine – step 2
T’data is the join between Tdata
and A
Nodes in both A and Tdata
Compute the intersection.
E.g. product
Nodes in Tdata
But not in ANode type is
Specialized using the CT’ we just computed.
E.g. product3
Nodes in A But not in Tdata
Refinement ofexisting type
E.g. product2a
To compute..
66
Drawback –The Blowup Problem
root
a b
Given a tree type
root
a=i b=i
n queriesqi
(1 i n)
with emptyanswers
Lets follow CT construction
Where CT belongs to the incomplete tree based on queries q1… qi
qi
qi
67
The Blowup Problem
root
a=1 b=1
Query q1 Incomplete tree Tq1
Tdata
is empty
q1
root
a1 b
root
a b 1
CTq1
68
The Blowup Problem
root
a=2 b=2
Query q2 CTq2
1. Compute the q2-1 negation of
q2
root
a2 b
root
a b 2
69
The Blowup Problem
root
a=2 b=2
Query q2 CTq2
2 .Compute the intersection
q2-1 CT
q1
70
The Blowup Problem2. Compute the intersection q2
-1 CTq1
root
a1, a2 b
root
a b 1, b 2
root
a2 b 1
root
a1 b 2
root
a2 b
root
a b 2
root
a1 b
root
a b 1
CTq1 q2
-1
Continuing the computation yields:
|CT | = 4*2 = 23 = 8
|CT | = 2n
Refine algorithm yields a disjunction of 2n
new types
q3
qn
…
Continuing the computation yields:
|CT | = 4*2 = 23 = 8
|CT | = 2n
Refine algorithm yields a disjunction of 2n
multiplicity statements
Exponential blowup of representation system
q3
qn
…
71
Avoiding The BlowupWe consider two ways of avoiding the exponential blowup of incomplete trees:
Provide Extension to the incomplete tree. conjunctive incomplete trees
Put some restrictions on the tree type and the queries.
72
Conjunctive Incomplete Tree
root
a1 b
root
a b 1
Types defined only
as disjunction
I.e.
root a1 b root a b1
Define Type as conjunction of disjunctions
root (a1b ab1)
root (a1b ab1)… (anb abn)
• ai and bi are specialization of a and b, respectively
• cond(ai) = (i), 1in
• cond(bi) = (i), 1in
73
Conjunctive Incomplete Tree
With conjunctionThe incomplete information can be represented using only n conjunctions of disjunctions.
Without conjunction Algorithm Refine yields a disjunction of 2n multiplicity statements.
74
Heuristics
To deal with the case when the incomplete tree is already too large to be practical
• Shrink the incomplete tree by asking critical additional queries that help to eliminate the missing portion.
• Loose some information: allows a trade of accuracy against size of incomplete tree.
75
Acquiring Partial Information Summary
• Webhouse is acquired using answers to queries
• Each answer refines our partial information
• Partial information is described using incomplete trees
• We compute the new incomplete tree at each stage using Refine algorithm
76
Querying Incomplete Trees
77
Answering QueriesRemember .. The known data is of the format
product
name cat
subcat
pictureprice
*
product2a
name
cat=elec
subcat=camera
pictureprice200
product3
name cat=elec
subcatcamera
price
cameras with price200
elec products (not cameras) with price200
78
Answering QueriesGiven query 3:
Find the name, price & pictures of all cameras with price < $100 and have at least one picture.
product
name cat=elec
subcat=camera
pictureprice<100
+
We can provide a complete answer to query 3 using the available information.
79
Given query 4:
Find all cameras
product
name cat=elec
subcat=camera
pictureprice
*
No complete answer is available from the known information. We can do the following:
Answering Queries
1. Provide the complete list of cameras with price < 200
2. Provide the complete list of cameras with a picture
3. Tell the user there may be more cameras
(that are expensive and have no pictures)
80
Answering Queries• Provides an incomplete answer to the query given the knowledge available
• No data source access for further information
Next..
Mediator Approach: Provide a complete answer but seek the webhouse only for the missing information. The incomplete tree is used as a guide to the mediator.
81
Mediator ApproachAdditional queries may have to be generated against the input document to obtain the information needed to fully answer the query.
product
name cat=elec
subcat=camera
pictureprice
200
0
Seek the web only for cameras with price200 with no picture
82
Mediator Approach (cont)Assumption: The generated queries are local.
Local QueriesQueries that explore the input document starting from the nodes already available.
T Incomplete
Tree
root
…
Tdata
Data Treen
root
…
q PS-query
83
Local Query T
Incomplete Tree
root
…
Tdata
Data Tree n
root
…
q PS-query
Local ps-query: p@n
p: ps-query n: node in Tdata
root
…
n
84
Local QueryL: { p @n | p a local query }
n1
…
nk
……
root
…
n1
nk
Tdata Data Tree
p@n1 p@nk
L completes T if q(T)=q(T’) .
We want the set of queries to collect the additional
information to fully answer a given ps-query.
T’
T’ is obtained by extending each node n of Tdata for which
p@n L with p@n(T) Trep(T)
85
Local QueryUsing local queries help us avoid doing the work already done by previous queries.
We want the set of queries L to be non redundant
1. No nodes exist in T returns by query in L
2. No new nodes are returned by distinct queries of L.
3. Queries in L should always return non empty answer.
86
Mediator Approach Conclusion
Mediator approach defines combination of the CWA and OWA semantic.
CWA – describe the missing information. I.e. some facts are not knownOWA – some data still ignored may exist.
87
Assumptions
88
Order1. Origin XML documents define order on
elements.
Moving to tree representation lose the original ordering.
Assumption
No order is required in our representation system
2. The source DTD may describe the order of children at each node type.
3. Queries may use ordering in their selection patterns.
89
BranchingAssumption
A PS query tree patterns allow just one child with a given label.
root
product product
camera cdplayer
Branching
Allows multiple children with the same label
90
Branching
Tdata
root
a1 ana2…
root
a aa …
q: branching ps-query
b=1
…b=2 b=n
q(T) requires the description of n! possibilities of assigning the n values of
b to a1… an
91
References
• Representing and Querying XML with Incomplete Information. Serge Abiteboul, Luc Segoufin, Victor Vianu.
• Incomplete Information and XML Presentation.http://www-rocq.inria.fr/~abiteboul
• A WEB Odyssey: from Codd to XML. Victor Vianu.
• Incomplete Information in Relational Database Tomasz Imielinski and Jr. Lipski Witold.