provenance semirings

26
1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania PODS 2007

Upload: tolla

Post on 01-Feb-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Provenance Semirings. T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania. PODS 2007. Provenance. First studied in data warehousing Lineage [ Cui,Widom,Wiener 2000 ] Scientific applications (to assess quality of data) Why-Provenance [ Buneman,Khanna,Tan 2001 ] - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Provenance Semirings

1

Provenance Semirings

T.J. Green, G. Karvounarakis, V. TannenUniversity of Pennsylvania

PODS 2007

Page 2: Provenance Semirings

PODS 2007 2

Provenance

●First studied in data warehousing

▪Lineage [Cui,Widom,Wiener 2000]

●Scientific applications (to assess quality of data)

▪Why-Provenance [Buneman,Khanna,Tan 2001]

●Our interest: P2P data sharing in the ORCHESTRA system (project headed by Zack Ives)

▪Trust conditions based on provenance

▪Deletion propagation

Page 3: Provenance Semirings

PODS 2007 3

Annotated relations

●Provenance: an annotation on tuples

●Our observation: propagating provenance/lineage through views is similar to querying

▪Incomplete Databases (conditional tables)

▪Probabilistic Databases (independent tuple tables)

▪Bag Semantics Databases (tuples with multiplicities)

●Hence we look at queries on relations with annotated tuples

Page 4: Provenance Semirings

PODS 2007 4

semantics: a set of instances

Incomplete databases: boolean C-tables

a b c p

d b e r

f g e s

a b c

f g e{ }I(R)= , , , , , , ,; d b ea b c f g ed b e

f g e

a b cd b e

f g e

R

a b c

d b e

boolean variables

Page 5: Provenance Semirings

PODS 2007 5

Imielinski & Lipski (1984): queries on C -tables

R a b c p

d b e r

f g e s

a c (p Æ p) Ç (p Æ p)

a e p Æ r

d c r Æ p

d e (r Æ r) Ç (r Æ r) Ç (r Æ s)

f e (s Æ s) Ç (s Æ s) Ç (s Æ r)

p

p Æ r

p Æ r

r

s

=

q(R)

q(x,z) :- R(x, _,z), R(_, _,z)

q(x,z) :- R(x,y, _), R(_ ,y,z)

r r

r r

sunion of conjunctive queries (UCQ)

a c

f e

p=true

r=false

s=true

Page 6: Provenance Semirings

PODS 2007 6

Why-provenance/lineage

a b c p

d b e r

f g e s

R

Which input tuples contribute to the presence of a tuple in the output?

q(R)tuple ids

same query

a c {p}

a e {p,r}

d c {p,r}

d e {r,s}

f e {r,s}

[Cui,Widom,Wiener 2000][Buneman,Khanna,Tan 2001]

Page 7: Provenance Semirings

PODS 2007 7

C –tables vs. Lineage

a c ({p} {p}) ({p} {p})

a e {p} {r}

d c {r} {p}

d e ({r} {r}) ({r} {r}) ({r} {s})

f e ({s} {s}) ({s} {s}) ({s} {r})

a c (p Æ p) Ç (p Æ p)

a e p Æ r

d c r Æ p

d e (r Æ r) Ç (r Æ r) Ç (r Æ s)

f e (s Æ s) Ç (s Æ s) Ç (s Æ r)

c-table calculations

lineage calculations

The structure of the calculations is the same!

Page 8: Provenance Semirings

PODS 2007 8

Another analogy, with bag semantics

a b c 2

d b e 5

f g e 1

R tuple multiplicities

a c 2 ¢ 2 + 2 ¢ 2

a e 2 ¢ 5

d c 5 ¢ 2

d e 5 ¢ 5 + 5 ¢ 5 + 5 ¢ 1

f e 1 ¢ 1 + 1 ¢ 1 + 1 ¢ 5

q(R)a c 8

a e 10

d c 10

d e 55

f e 7

multiplicity calculation

s

same query

a c (p Æ p) Ç (p Æ p)

a e p Æ r

d c r Æ p

d e (r Æ r) Ç (r Æ r) Ç (r Æ s)

f e (s Æ s) Ç (s Æ s) Ç (s Æ r)

c-table calculations

The structure of the calculations is the same!

Page 9: Provenance Semirings

PODS 2007 9

Abstracting the structure of these calculations

These expressions capture the abstract structure of the calculations, which encodes the logical derivation of the output tuples We shall use these expressions as provenance

C-tables

BagsLineage

Abstract

join Æ ¢ [ ¢union Ç + [ +

a c (p ¢ p) + (p ¢ p)

a e p ¢ r

d c r ¢ p

d e (r ¢ r) + (r ¢ r) + (r ¢ s)

f e (s ¢ s) + (s ¢ s) + (s ¢ r)

abstract calculation

s

Page 10: Provenance Semirings

PODS 2007 10

Technical Development

●Abstractly annotated relations (K-relations) and their relational algebra

▪K must be semiring

▪For provenance, K is semiring of polynomials

●Datalog on K-relations

▪For provenance, K consists of (possibly infinite) formal power series

Page 11: Provenance Semirings

PODS 2007 11

K-relations

●Annotations are elements from an algebraic structure (K,+,¢, 0, 1)

●If D is the domain of database values, an n-ary K-relation is a function:

R: Dn ! K

Although the notation resembles arithmetic, these are abstract

operations

All possible tuples

Page 12: Provenance Semirings

PODS 2007 12

K-relations, annotated tables

● K-relation corresponds to table: R: Dn ! K

●If R(t)=k, then t “is annotated by k”

●For all but finitely many tuples t, R(t) = 0

▪we omit those tuples from the table representation

tuple1 k1

tuple2 k2

tuple3

. . .

k3

Page 13: Provenance Semirings

PODS 2007 13

Positive K-relational algebra

●We define an RA+ on K-relations:

▪The ¢ corresponds to join:

▪The + corresponds to union and projection

▪0 and 1 are used for selection predicates

▪Details in the paper (but recall how we evaluated the UCQ q earlier and we will see another example later)

Page 14: Provenance Semirings

PODS 2007 14

RA+ identities imply semiring structure!

●Common RA+ identities

▪Union and join are associative, commutative

▪Join distributes over union

▪etc. (but not idempotence!)

These identities hold for RA+ on K-relations iff

(K, +, ¢, 0, 1) is a commutative semiring (K,+,0) is a commutative monoid (K, ¢,1) is a commutative monoid ¢ distributes over +, etc

Page 15: Provenance Semirings

PODS 2007 15

Calculations on annotated tables are particular cases

(B, Ç, Æ, false, true) usual relational algebra

(N, +, ¢, 0, 1) bag semantics

(PosBool(B), Ç, Æ, false, true)

boolean C-tables

(P( ), [, Å, ;, )probabilistic event tables

(P(X), [, [, ;, ;)lineage/why-provenance

Page 16: Provenance Semirings

PODS 2007 16

Provenance Semirings

●X = {p, r, s, …}: indeterminates (provenance “tokens” for base tuples)

●N[X] : multivariate polynomials with coefficients in N and indeterminates in X

●(N[X], +, ¢, 0, 1) is the most “general” commutative semiring: its elements abstract calculations in all semirings

●N[X] –relations are the relations with provenance!

▪The polynomials capture the propagation of provenance through (positive) relational algebra

Page 17: Provenance Semirings

PODS 2007 17

A provenance calculation

a b c p

d b e r

f g e s

Ra c 2p2

a e pr

d c pr

d e 2r2 + rs

f e 2s2 + rs

q(R)a c {p}

a e {p,r}

d c {p,r}

d e {r,s}

f e {r,s}

Lineage

●Not just why- but also how-provenance (encodes derivations)!

●More informative than lineage

same lineage, different provenance

q(x,z) :- R(x, _,z), R(_, _,z)q(x,z) :- R(x,y, _), R(_ ,y,z)

Page 18: Provenance Semirings

PODS 2007 18

p: justified by Moer: justified by Larrys: justified by Curly

Trust assesment

a b c p

d b e r

f g e s

Ra c 2p2

a e pr

d c pr

d e 2r2 + rs

f e 2s2 + rs

q(R)

One alternative needs Larry and Curly Two others only need Larry, twice

2 alternatives, both need Moe, twice

Needs both Moe and Larry

q(x,z) :- R(x, _,z), R(_, _,z)q(x,z) :- R(x,y, _), R(_ ,y,z)

Which output tuples can be trusted after Larry is jailed?

Page 19: Provenance Semirings

PODS 2007 19

More Technical Development

●The semiring structure on annotations works out nicely for (positive) relational algebra.

●What more do we need for Datalog queries?

●-continuous semirings (so fixed points exist)!

▪N is not -continuous, but N1 ≜ N [ {1} is

●Here we show only what we need for Datalog provenance (formal power series)

Page 20: Provenance Semirings

PODS 2007 20

Beyond RA+: Datalog

a b

a c

c b

b d

d d

R

R(a b)

q(a b)

R(b d)

q(b d)

q(a d)

r1 r1

r2 r2

R(d d)

q(d d)

r1

q(a d)r2 r2

…r1: q(X, Y) :- R(X, Y)

r2: q(X, Y) :- q(X, Z), q(Z, Y)

R(a b)

q(a b)

R(b d)

q(b d)

q(a d)

r1 r1

r2 r2

R(d d)

q(d d)

r1

Page 21: Provenance Semirings

PODS 2007 21

Provenance: Encoding Infinite Derivations

●Polynomials do not suffice, since they are finite!

●Instead, we use infinite formal power series

●Nonetheless, provenance is finitely representable through a system of equations

Page 22: Provenance Semirings

PODS 2007 22

Provenance equations

a b m

a c n

c b p

b d r

d d s

R a b x

a c y

c b z

b d u

d d v

a d w

q(R)x = m + yz

y = n

z = p

u = r + uv

v = s + v2

w = xu + wv

The provenances x,y etc. are the power series that solve this system of equations (see next)

r1: q(X, Y) :- R(X, Y)

r2: q(X, Y) :- q(X, Z), q(Z, Y)

Polynomials are the provenance of the immediate consequence operator (in RA+)

Page 23: Provenance Semirings

PODS 2007 23

Solutions: formal power series

x = m + np y = nz = p v = s + s2 + 2s3 + 5s4 + 14s5 + …u = r v*

w = r(m+np)(v*)2

wherev* ≜ 1 + v + v2 + v3 + …

In general we need coefficients from: N1 ≜ N [ {1}

Coefficients have the form 2k! k!(k+1)!

Page 24: Provenance Semirings

PODS 2007 24

Algorithmic results for Datalog provenance

●Given t q(I), it is decidable whether the provenance of t is a proper (infinite) power series;

●From CFG ambiguity, we know that testing whether all coefficients are · 1 is undecidable

●However, given t q(I), and a monomial , the coefficient of in the power series that is the provenance of t is computable (including when it is 1)

Page 25: Provenance Semirings

PODS 2007 25

Related Work

●Foundations: semirings/systems of equations/formal power series first used in CS in theory of formal languages [Chomsky,Schutzenberger 1963]

●Our work is related to and shares similar goals with “Debugging schema mappings with routes” [Chiticariu,Tan VLDB2006], where “routes” are like minimal finite portions of our how-provenance

▪See also tutorial at SIGMOD tomorrow!

Page 26: Provenance Semirings

PODS 2007 26

Further work

●Application: P2P data sharing in the ORCHESTRA system (thanks to our collaborator Zack Ives):

▪Need to express trust conditions based on provenance of tuples

▪Incremental propagation of deletions

▪Semiring provenance itself is incrementally maintainable

▪See demo of ORCHESTRA in SIGMOD on Thursday!

●Future extension: full relational algebra. For difference we need semirings with “proper subtraction”