circuits for datalog provenance

of 35 /35
Circuits for Datalog Provenance Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

Upload: kemp

Post on 23-Feb-2016

58 views

Category:

Documents


0 download

DESCRIPTION

Circuits for Datalog Provenance. Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania. A Simple Example of Data Provenance. “ Boolean Provenance/Lineage ” as a Boolean formula Q is true on D   F Q,D is true - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Circuits for  Datalog Provenance

Circuits for Datalog Provenance

Daniel DeutchTel Aviv Univ.

Tova MiloTel Aviv Univ.

Sudeepa RoyUniv. of Washington

Val TannenUniv. of Pennsylvania

Page 2: Circuits for  Datalog Provenance

“Boolean Provenance/Lineage” as a Boolean formula Q is true on D FQ,D is true Poly-size, Poly-time computable (data complexity) But Q is a RA+ query This talk: What if Q is a Datalog Program?

A Simple Example of Data ProvenanceAsthmaPatien

tAnnBob

FriendAnn JoeAnn TomBob Tom

Smoker

JoeTom

Boolean query Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

x1

x2

z1

z2

y1

y2

y3Database D

FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

Page 3: Circuits for  Datalog Provenance

3

Provenance– Reliability and repeatability– View management and deletion propagation– Trust and security management– Query answering in probabilistic database, ….

Datalog– Datalog is popular again! (two keynotes this ICDT/EDBT)– Data extraction in Web, declarative networking– Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna)

Finding suitable “Provenance for Datalog” is important– Both from theoretical and practical viewpoints

How do we compute, store, and interpret provenance for datalog programs efficiently and effectively?

Motivation

Page 4: Circuits for  Datalog Provenance

4

Can we get poly-size Boolean formulas for datalog provenance?

No, even if we allow unbounded time

Do we have a solution? Yes! Use Boolean Circuits!

What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07]

It depends on the semiring

Overview of Our Results

Page 5: Circuits for  Datalog Provenance

5

Background

Circuits for Boolean Provenance

Circuits for General Provenance Semirings

Outline

Page 6: Circuits for  Datalog Provenance

6

Background

Circuits for Boolean Provenance

Circuits for General Provenance Semirings

Outline

Page 7: Circuits for  Datalog Provenance

7

T(x, y) :- R(x, y)T(x, y) :- R(x, z), T(z, y)S(x) :- T(a, x)

DatalogDatalog program for Transitive Closure and Single-source Reachability

EDB (base) relation for edges: R

IDB (derived) relations─ Transitive closure (T)─ Single-source reachability from vertex ‘a’ (S)

IDB(Intensional Databases)

EDB(Extensional Databases)

Page 8: Circuits for  Datalog Provenance

8

Boolean Provenance PosBool(X)-Database

Tuples are annotated with variables from a set X– Here X = {x1, x2, y1, y2, ….}

For n tuples in X, 2n possible worlds by assignments : X {True, False}

Useful in query evaluation on incomplete or probabilistic databases

AsthmaPatient

AnnBob

FriendAnn JoeAnn TomBob Tom

Smoker

JoeTom

x1

x2

z1

z2

y1

y2

y3

PosBool(X)-database D

Page 9: Circuits for  Datalog Provenance

9

RA+ over PosBool(X)-Database

Annotation propagates from input to output– Join = , Projection/Union =

Output tuples are annotated by monotone Boolean formula – FQ,D is the annotation of the unique output tuple

AsthmaPatient

AnnBob

FriendAnn JoeAnn TomBob Tom

Smoker

JoeTom

RA+ Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

x1

x2

z1

z2

y1

y2

y3PosBool(X)-Database D

FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

Page 10: Circuits for  Datalog Provenance

10

Two Important Properties:RA+ over PosBool(X)-Database

For all RA+ query Q, D, and assignment 1. (Faithful Representation) Q(D)= [Q(D)]

2. (Poly-size overhead) The size of FQ,D is poly in |D| and can be computed in poly-time.

AsthmaPatient

AnnBob

FriendAnn JoeAnn TomBob Tom

Smoker

JoeTom

RA+ Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

x1

x2

z1

z2

y1

y2

y3

FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

True

False

True

False

TrueTrue

False

= False

= FalsePosBool(X)-Database D

Page 11: Circuits for  Datalog Provenance

Semantics using Derivation Trees (Green et al. 2007)

Annotation of T(a, b):

11

Datalog over PosBool(X) DatabaseT(x, y) :- R(x, y)T(x, y) :- R(x, y), T(y, z)S(x) :- T(a, x)

Ra aa b

pqa

b

Trees Leaves t of Annot(t)

= (q) (pq) (ppq) …

• Infinitely many trees• But always has a finite equivalent form

= q

But not necessarily poly-size

T(a, b)

R(a, a) T(a, b)

R(a, a) T(a, b)

R(a, b)

T(a, b)

R(a, a) T(a, b)

R(a, b)

R(a, b)

T(a, b)

Page 12: Circuits for  Datalog Provenance

12

Theorem:Given PosBool(X)-database D and datalog program P,

provenance of tuples in P(D) cannot have a faithful representation using

Boolean formulas of size polynomial in |D|

Lower Bound: Boolean formulas for Datalog Provenance on PosBool(X)

Proof outline:• st-connectivity on n nodes requires n(logn)-size monotone Boolean formula

• Karchmer-Wigderson, 1988

• Faithful representation requires: for all True/False assignments to X, P(D)= [P(D)]

• Reduce to the hard instance with right when P = transitive closure

Solution: Boolean Circuit!

Page 13: Circuits for  Datalog Provenance

13

Background

Circuits for Boolean Provenance or PosBool(X)

Circuits for General Provenance Semirings

Outline

Page 14: Circuits for  Datalog Provenance

14

Circuit is a DAG– use common subexpressions– Boolean formula = tree

Leaf nodes: – EDB vars in X

Internal nodes – : IDB/EDB vars used in one derivation– : Alternative derivations

Roots: – IDB vars

Boolean CircuitsR

a aa b

pq

T(x, y) :- R(x, y)T(x, y) :- R(x, y), T(y, z)S(x) :- T(a, x)

XT(a, b)

q pXT(a, b)

XR(a, b)XR(a, a)

a

b

Page 15: Circuits for  Datalog Provenance

15

Theorem:

Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented

using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time)

Upper Bound: Boolean Circuits for PosBool(X)

Page 16: Circuits for  Datalog Provenance

16

1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalogprogram P to EDB/IDB tuples [Green et al. 2007]

Proof Skecth

2. A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011]

• N = #IDB tuples• Build a circuit with N+1 layers from the system of equations

Two key ideas from previous work

• EDB tuples constants, IDB tuples variables • Iteratively solve this system of equations• Fixpoint = provenance for all IDB tuples

Page 17: Circuits for  Datalog Provenance

17

IllustrationT(x, y) :- R(x, y)T(x, y) :- R(x, y), T(y, z)S(x) :- T(a, x)

Ra aa b

pqa

b

Step1 : Build system of equations by all possible instantiations: x, y, z a, b

XT(a, a) = p (p XT(a, a))XT(a, b) = q (p XT(a, b))XS(b) = XT(a, b)

XS(a) = XT(a, a)

Step 2: Build a circuit with 4 + 1 layers (N = 4) …

varConst

Page 18: Circuits for  Datalog Provenance

18

XT(a,a),0XS(b),0 XT(a,a),0XT(a,b),0XS(a),0

pq

XT(a,a),1 XS(b),1 XT(a,a),1

XT(a,b),1

XS(a),1

XS(a),2

XT(a,a),2 XS(b),2 XTa,a),2XT(a,b),2

Level 1

Level 2

false false falsefalsefalse

IllustrationXT(a, a) = p (p XT(a, a))XT(a, b) = q (p XT(a, b))XS(b) = XT(a, b)

XS(a) = XT(a, a)

Assign leaf IDB vars to false

Multiple roots for multiple IDB vars

Page 19: Circuits for  Datalog Provenance

19

1. Store only two levels of circuit instead of N+1 levels– Evaluate iteratively

2. Embed circuit construction in semi-naïve evaluation– Check for new derivations, not only new IDB variables– Sound and Complete

3. Remove self-dependency of IDB vars– works for PosBool(X) and also some other semirings…

XT(a, a) = p (p XT(a, a))XT(a, b) = q (p XT(a, b))XS(b) = XT(a, b)

XS(a) = XT(a, a)

Optimizations

Page 20: Circuits for  Datalog Provenance

20

Illustration (From here…)

XT(a,a),0XS(b),0 XT(a,a),0XT(a,b),0XS(a),0

pq

XT(a,a),1 XS(b),1 XT(a,a),1

XT(a,b),1

XS(a),1

XS(a),2

XT(a,a),2 XS(b),2 XTa,a),2XT(a,b),2

Level 1

Level 2

false false falsefalsefalse

Page 21: Circuits for  Datalog Provenance

21

Illustration (…To here)

XT(a,a),bottomXT(a,b),bottomXS(a),bottom

pq

XT(a,a),topXT(a,b),topXS(a),top

With all these optimizations

Top Level

Bottom Level

Page 22: Circuits for  Datalog Provenance

22

Linear-time deletion propagation (in circuit-size)

Approximation for probabilistic databases– even when only the circuit (and not the database) is available

Circuits can be computed “offline”– Only linear-time evaluation is required when needed (e.g. deletion

propagation) compared to storing and solving a system of equations iteratively, or re-evaluating datalog program

Can use existing techniques for efficient and parallel circuit evaluation

Applications of PosBool(X)-Circuits

Page 23: Circuits for  Datalog Provenance

23

Background

Circuits for Boolean Provenance or PosBool(X)

Circuits for General Provenance Semirings

Outline

Page 24: Circuits for  Datalog Provenance

24

(K, +K, K, 0K, 1K)– domain K – +K, K : associative, commutative, have neutral elements 0K, 1K

– K distributes over +K , i.e. a K (b +K c) = a K b +K a K c

– 0K cancels any element in K, i.e. a K 0K = 0K K a = 0K

Examples:

– (B, , , False, True) Set semantics

– (N, +, , 0, 1) Bag semantics

– (N {}, min, +, , 0) Tropical semiring to compute cost (e.g. cost of a shortest path)

Commutative Semirings

Page 25: Circuits for  Datalog Provenance

25

Generalization of PosBool(X)

(K, +K, K, 0K, 1K)– Tuples are annotated with variables from X– K is of the form Prov(X)– +K denotes alternative usage– K denotes joint usage

Examples:– (PosBool(X), , , False, True)

– (Lin(X), , , , ) tracks contributing tuples [Cui et. al. ’00]

– (Why(X), , , , {}) : pairwise union of subsets, tracks contributing tuples in alternative derivations

[Buneman et. al. ’01]

Provenance Semirings

Page 26: Circuits for  Datalog Provenance

26

Key property needed for applications like deletion propagation, trust management, cost computation, …

Prov(X) specializes correctly to K, if any valuation v : X K extends uniquely to a homomorphism hv : Prov(X) K (which correctly maps +, of Prov(X) to that of K)

Further, some provenance semirings are “more informative” than the others

Provenance Specialization

Page 27: Circuits for  Datalog Provenance

27

Provenance Semiring HierarchyN[X]

Why(X)

Lin(X)PosBool(X)

Sorp(X)

Tropical

N (bag)

Security Boolean (set)

Defined later

Specializes correctly

More informative

Less informative

Page 28: Circuits for  Datalog Provenance

28

Datalog Provenance for General Semirings

Trees Leaves t of Annot(t)

Trees Leaves t of Annot(t)

PosBool(X)

General Prov(X)

+kk

• Infinite sums should be well-defined

• Need to consider “–continuous semirings” and “–continuous homomorphism”

Page 29: Circuits for  Datalog Provenance

29

Provenance Semiring HierarchyN[X]

Why(X)

Lin(X)PosBool(X)

Sorp(X)

Tropical

N (bag)

Security Boolean (set)

Finite so -continuous

Need to add

N[[X]] and N N[[X]] : Most informative provenance semiring [Green et al. ’07]

Page 30: Circuits for  Datalog Provenance

30

Poly-size overhead is not valid because of infinite sum But can outputs have finite annotations (with X, , +) that specializes

correctly to semirings with finite domains?

How good is N[[X]] w.r.t. Size of Datalog Provenance?

Theorem:It is not possible to annotate with finite provenance expressions the output of datalog programs following N[[X]] -semanticsthat specialize “correctly” to the semiring Why(X)

Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X)

─ Need more levels in the circuit from system of equations─ Need a different argument for correctness

Finite annotations won’t specialize correctly to Why(X)

Page 31: Circuits for  Datalog Provenance

31

We propose Sorp(X)– Most general absorptive semiring

a + a.b = a– N[X] but keep polynomials that are not “absorbed” by the others

e.g. pq + p2q3 pq p2q + pq2 p2q + pq2

The same algorithm, proof, and optimizations to construct poly-size circuits hold– Circuits are more general than Boolean circuit

Can we still have a good general semiring w.r.t. size?

1. Specializes correctly to interesting semirings2. Outputs can be annotated by poly-size circuits

Page 32: Circuits for  Datalog Provenance

32

Provenance Semiring HierarchyN[X]

Why(X)

Lin(X)PosBool(X)

Sorp(X)

Tropical

N (bag)

Security Boolean (set)

Page 33: Circuits for  Datalog Provenance

33

Data Provenance– e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08]

Circuits– Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g.

[Arora-Barak ’09] (book)

Provenance for Datalog– System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07]– Poly-size c-tables with Boolean formulas for datalog with contradictions

[Abiteboul et al. 2014]

Related Work

Page 34: Circuits for  Datalog Provenance

34

Circuits to represent and store Datalog Provenance– for PosBool(X) and other semirings– Semantics, Algorithms, Limitations, Applicability

– Preliminary experiments support our results we compared circuits for deletion propagation with iteratively solving

system of equations and reevaluation of datalog from scratch

Future Work:– A complete implementation, evaluation, new applications

Conclusions

Page 35: Circuits for  Datalog Provenance

35

Thank You

Questions?