edbt 2009 - provenance for nested subqueries
DESCRIPTION
Data provenance is essential in applications such as scientific computing, curated databases, and data warehouses. Several systems have been developed that provide provenance functionality for the relational data model. These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested subqueries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use. In this paper we address this limitation by exploring the problem of provenance derivation when complex queries are involved. More precisely, we demonstrate that the widely used definition of Why-provenance fails in the presence of nested subqueries, and show how the definition can be modified to produce meaningful results for nested subqueries. We further present query rewrite rules to transform an SQL query into a query propagating provenance. The solution introduced in this paper allows us to track provenance information for a far wider subset of SQL than any of the existing approaches. We have incorporated these ideas into the Perm provenance management system engine and used it to evaluate the feasibility and performance of our approach.TRANSCRIPT
![Page 1: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/1.jpg)
Provenance for Nested Subqueries
Boris Glavic
Database Technology Group
Department of Informatics University of Zurich
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
Gustavo Alonso
Systems GroupDepartment of Computer
Science ETH Zurich
![Page 2: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/2.jpg)
2
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
Overview
1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion
![Page 3: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/3.jpg)
3
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Query
Which input data item(s) influenced which output data item(s)? Granularity
Tuple Attribute Value ...
Contribution semantics Influence (Lineage / Why) Copy (Where) ...
![Page 4: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/4.jpg)
4
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Most application domains that benefit from provenance use complex queries Subqueries
Correlated Nested
Not supported by existing systems Semantics not clear Complex computation
![Page 5: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/5.jpg)
5
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Steps to solve this problem1. Establish sound semantics for
provenance of subqueries2. Algorithms for subquery provenance
computation3. Integrate algorithms into a Provenance
Management system (Perm)
![Page 6: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/6.jpg)
6
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Steps to solve this problem1. Establish sound semantics for
provenance of subqueries2. Algorithms for subquery provenance
computation3. Integrate algorithms into a Provenance
Management system (Perm)
![Page 7: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/7.jpg)
7
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction Definition of contribution semantics
Why/Influence-provenance Introduced in [Cui, Widom ICDE ‘00] Provenance represented as list of
subsets of the input relations Defined for a single algebra operator
and a single result tuple
![Page 8: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/8.jpg)
8
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction Definition 1: For a single algebra
operator op with input relations T1, ... , Tn a list (T1*, ... ,Tn*) of maximal subsets of the input relation is the provenance of a tuple t from the result of op iff:
u op(T1*, ..., Tn*) = t
u For all i and t* with t* in Ti*:op(T1*, ... Ti-1*, t* , Ti+1*, ... ,Tn*) !=
€
∅
![Page 9: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/9.jpg)
9
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction Perm
Provenance Extension of the Relational Model
Provenance Management System (PMS) “Pure” Relational representation of
provenance Provenance computation trough
algebraic query rewrite Implemented as extension of
PostgreSQL
![Page 10: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/10.jpg)
10
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Provenance representation
OriginalAttributes
Relation 1 Attributes
Relation n Attributes
Query
1
OriginalResult
2 n
![Page 11: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/11.jpg)
11
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Provenance representation
OriginalAttributes
Relation R Attributes
Relation S Attributes
Query
R
OriginalResult
S
r1
s 1r2
t 1
t 1 r1
t 1 r2
s 1
s 1
![Page 12: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/12.jpg)
12
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction Provenance Computation though
query rewrite: Given query q generate query q+ that
computes the provenance of q Representation as defined before
Rewrites operate on the algebraic representation of a query Rewrite rules for each operator op that
transform op into a algebra statement that propagates the provenance
![Page 13: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/13.jpg)
13
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Rewrite rules example:SELECT agg, GFROM TGROUP BY G
SELECT agg, G, prov(T)FROM
(SELECT agg, G FROM T GROUP BY G) AS agg,LEFT OUTER JOIN(SELECT G AS G’, prov(T) FROM T+) AS provON G = G’
![Page 14: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/14.jpg)
14
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
Rewrite rules example:SELECT sum(revenue) AS sum, shopFROM salesGROUP BY shop
shop month revenue
Migros Jan 100
Migros Feb 10
Migros Mar 10
Coop Jan 25
Coop Feb 25
salessum shop
120 Migros
50 Coop
result
![Page 15: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/15.jpg)
15
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
1. Introduction
SELECT sum, shop, pShop, pMonth, pRevenueFROM
(SELECT sum(revenue) AS sum, shop FROM sales GROUP BY shop) AS aggLEFT OUTER JOIN(SELECT shop AS shop’, pShop, pMonth, pRevenue FROM sales ) AS provON shop = shop’
sum shop pShop pMonth pRevenue
120 Migros Migros Jan 100
120 Migros Migros Feb 10
120 Migros Migros Mar 10
50 Coop Coop Jan 25
50 Coop Coop Feb 25
+
![Page 16: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/16.jpg)
16
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
Overview
1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion
![Page 17: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/17.jpg)
17
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Sublinks Subqueries in e.g. SELECT-clause
Correlated References outside attributes
Nested Sublink that contains sublinks
€
σ a IN σ (b=3) (S)(R)
€
σ a IN σ (b=a ) (S)(R)
€
σ a IN σ (b = ANY (T )) (S)(R)
![Page 18: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/18.jpg)
18
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
What is the provenance of a sublink according to Definition 1? Sublinks can be used in different
contexts Selection Projection ...
Sublink either Produces exactly one value Or produces a boolean value
![Page 19: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/19.jpg)
19
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Single uncorrelated ANY-sublinks in selection conditions
For other Types of sublinks Correlated sublinks Nested sublinks
![Page 20: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/20.jpg)
20
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
For other Types of sublinks Correlated sublinks Nested sublinks
READ THE PAPER!
![Page 21: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/21.jpg)
21
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Single uncorrelated ANY-sublinks in selection conditions The result of the sublink query is fixed For a given input tuple t the sublink
condition is either true or false
€
σ a =ANY σ (b=3) (S)(R)
![Page 22: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/22.jpg)
22
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Some terminology The query of a sublink
The conditional expression of a sublink
€
Tsub
€
q =σ a =ANY Πb (S)(R)
€
Πb(S)
€
a = ANY Πb (S)
€
Csub
€
Tsub€
Csub
![Page 23: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/23.jpg)
23
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Sublink condition can play different roles in a condition C of a selection (for one input tuple t): Reqtrue: the selection condition is true, iff is true Reqfalse: the selection condition is true,
iff is false
Ind: the selection condition is true indepedent of the result of €
Csub
€
Csub
€
Csub
![Page 24: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/24.jpg)
24
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Some more terminology All tuples from the sublink query that
fulfill the “unquantified” sublink condition
All tuples from the sublink query that do not fulfill the “unquantified” sublink condition€
Tsubtrue(t)
€
Tsubfalse(t)
€
Csub = (a = ANY σ b=3(S))
€
Csub° = (a = b)
![Page 25: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/25.jpg)
25
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Back to ANY-sublinks in selections Proposition:
€
Tsub*(t) =
Tsubtrue(t) reqtrue
Tsub reqfalse, ind
⎧ ⎨ ⎩
![Page 26: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/26.jpg)
26
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
a
1
2
3
b c
1 100
2 10
4 24
SR€
q =σ a =ANY Πb (S)(R)
a
1
2
Result
Compute provenance for
€
t = (1)
Example:
![Page 27: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/27.jpg)
27
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
€
Tsub = Πb (S)
€
Tsubtrue(t) = {(1)}
is reqtrue
€
Csub
€
Tsub* =Tsub
true
€
Csub° = (a = b)
€
q =σ a =ANY Πb (S)(R)
![Page 28: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/28.jpg)
28
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
€
Tsubtrue(t) = {(1)}
€
q =σ a =ANY Πb (S)(R)
b
1
2
4
Tsub
a
1
2
3
R
€
Csub° = (a = b)
Compute provenance for
€
t = (1)
![Page 29: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/29.jpg)
29
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
a
1
2
3
b c
1 100
2 10
4 24
SR€
q =σ a =ANY Πb (S)(R)
a
1
b
1
R* Tsub*b
1
2
4
Tsub
a
1
2
Result
Compute provenance for
€
t = (1)
![Page 30: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/30.jpg)
30
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Definition 1 is ambiguous for queries with more than one sublink!
b
1
2
100
c
1
5
SR
€
q =σ C1∨C2(U )
C1 = (a =ANY R)
C2 = (a > ALL S)
€
t = (5)
a
5
Resulta
5
U
![Page 31: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/31.jpg)
31
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Definition 1 is ambiguous for queries with more than one sublink!
b
1
2
100
c
1
5
SR
€
q =σ C1∨C2(U )
C1 = (a =ANY R)
C2 = (a > ALL S)
€
t = (5)
a
5
Resulta
5
U
true
false
![Page 32: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/32.jpg)
32
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5c
1
5
S*R*€
q =σ C1∨C2(U )
C1 = (a =ANY R)
C2 = (a > ALL S)
€
t = (5)
a
5
U*b
1
100
R*b
1
S*a
5
U*Solution 1 Solution 2
![Page 33: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/33.jpg)
33
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5c
1
5
S*R*€
q =σ C1∨C2(U )
C1 = (a =ANY R)
C2 = (a > ALL S)
€
t = (5)
a
5
U*b
1
100
R*b
1
S*a
5
U*Solution 1 Solution 2
true
false
![Page 34: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/34.jpg)
34
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5c
1
5
S*R*€
q =σ C1∨C2(U )
C1 = (a =ANY R)
C2 = (a > ALL S)
€
t = (5)
a
5
U*b
1
100
R*b
1
S*a
5
U*Solution 1 Solution 2
false
true
![Page 35: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/35.jpg)
35
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Reasons for this ambiguity: The definition requires the provenance
to produce the same result But not to produce the same results for
the sublinks
-> Definition 1 produces false positives
![Page 36: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/36.jpg)
36
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Solution: Extend definition 1 Add a third condition: For each sublink:
If computed for one result tuple t one tuple from the provenance of the sublink
Produces same sublink result as in the original query
![Page 37: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/37.jpg)
37
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
b
5c
5
S*R*€
q =σ C1∨C2(U )
C1 = (a =ANY R)
C2 = (a > ALL S)
€
t = (5)
a
5
U*b
1
100
R*b
1
S*a
5
U*Solution 1 Solution 2
![Page 38: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/38.jpg)
38
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
How to compute the provenance according to the extended definition?
Use query rewrite Generic strategy (Gen) Specialized strategies
Use un-nesting Check: does not change the provenance
![Page 39: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/39.jpg)
39
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
2. The Provenance of Subqueries
Gen-strategy For queries we cannot un-nest
1. Join original query with all possible provenance tuples (base relations)
2. Rewrite the sublink query3. Introduce additional correlation to
simulate a join between 1) and 2)
![Page 40: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/40.jpg)
40
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
Overview
1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion
![Page 41: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/41.jpg)
41
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
3. Experimental Results TPC-H benchmark (10 MB size)
![Page 42: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/42.jpg)
42
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
3. Experimental Results TPC-H benchmark (1 GB size)
![Page 43: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/43.jpg)
43
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
Overview
1. Introduction2. The Provenance of Subqueries3. Experimental Results4. Conclusion
![Page 44: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/44.jpg)
44
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
4. Conclusion
Definition 1 fails in the presence of sublinks Can be extended to deal with sublinks
Provenance computation for sublinks By using query rewrites Implemented in the Perm
Future Work Physical provenance-aware operators
![Page 45: EDBT 2009 - Provenance for Nested Subqueries](https://reader035.vdocuments.net/reader035/viewer/2022062707/5584243dd8b42a86478b496a/html5/thumbnails/45.jpg)
45
Zur Anzeige wird der QuickTime™ Dekompressor „“
benötigt.
Questions
? ? ?