# circuits for datalog provenance

of 35 /35
Circuits for Datalog Provenance Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania

Post on 23-Feb-2016

58 views

Category:

## Documents

Tags:

• #### polysize boolean formulas

DESCRIPTION

Circuits for Datalog Provenance. Daniel Deutch Tel Aviv Univ. Tova Milo Tel Aviv Univ. Sudeepa Roy Univ. of Washington Val Tannen Univ. of Pennsylvania. A Simple Example of Data Provenance. “ Boolean Provenance/Lineage ” as a Boolean formula Q is true on D   F Q,D is true - PowerPoint PPT Presentation

TRANSCRIPT

Circuits for Datalog Provenance

Daniel DeutchTel Aviv Univ.

Tova MiloTel Aviv Univ.

Sudeepa RoyUniv. of Washington

Val TannenUniv. of Pennsylvania

“Boolean Provenance/Lineage” as a Boolean formula Q is true on D FQ,D is true Poly-size, Poly-time computable (data complexity) But Q is a RA+ query This talk: What if Q is a Datalog Program?

A Simple Example of Data ProvenanceAsthmaPatien

tAnnBob

FriendAnn JoeAnn TomBob Tom

Smoker

JoeTom

Boolean query Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

x1

x2

z1

z2

y1

y2

y3Database D

FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

3

Provenance– Reliability and repeatability– View management and deletion propagation– Trust and security management– Query answering in probabilistic database, ….

Datalog– Datalog is popular again! (two keynotes this ICDT/EDBT)– Data extraction in Web, declarative networking– Academic/commercial systems (Webdamlog, LogicBlox, Dedalus, Dyna)

Finding suitable “Provenance for Datalog” is important– Both from theoretical and practical viewpoints

How do we compute, store, and interpret provenance for datalog programs efficiently and effectively?

Motivation

4

Can we get poly-size Boolean formulas for datalog provenance?

No, even if we allow unbounded time

Do we have a solution? Yes! Use Boolean Circuits!

What about general “provenance semirings” beyond Boolean provenance? ref. [Green et. al. ’07]

It depends on the semiring

Overview of Our Results

5

Background

Circuits for Boolean Provenance

Circuits for General Provenance Semirings

Outline

6

Background

Circuits for Boolean Provenance

Circuits for General Provenance Semirings

Outline

7

T(x, y) :- R(x, y)T(x, y) :- R(x, z), T(z, y)S(x) :- T(a, x)

DatalogDatalog program for Transitive Closure and Single-source Reachability

EDB (base) relation for edges: R

IDB (derived) relations─ Transitive closure (T)─ Single-source reachability from vertex ‘a’ (S)

IDB(Intensional Databases)

EDB(Extensional Databases)

8

Boolean Provenance PosBool(X)-Database

Tuples are annotated with variables from a set X– Here X = {x1, x2, y1, y2, ….}

For n tuples in X, 2n possible worlds by assignments : X {True, False}

Useful in query evaluation on incomplete or probabilistic databases

AsthmaPatient

AnnBob

FriendAnn JoeAnn TomBob Tom

Smoker

JoeTom

x1

x2

z1

z2

y1

y2

y3

PosBool(X)-database D

9

RA+ over PosBool(X)-Database

Annotation propagates from input to output– Join = , Projection/Union =

Output tuples are annotated by monotone Boolean formula – FQ,D is the annotation of the unique output tuple

AsthmaPatient

AnnBob

FriendAnn JoeAnn TomBob Tom

Smoker

JoeTom

RA+ Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

x1

x2

z1

z2

y1

y2

y3PosBool(X)-Database D

FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

10

Two Important Properties:RA+ over PosBool(X)-Database

For all RA+ query Q, D, and assignment 1. (Faithful Representation) Q(D)= [Q(D)]

2. (Poly-size overhead) The size of FQ,D is poly in |D| and can be computed in poly-time.

AsthmaPatient

AnnBob

FriendAnn JoeAnn TomBob Tom

Smoker

JoeTom

RA+ Q: x y AsthmaPatient(x) Friend (x, y) Smoker(y)

x1

x2

z1

z2

y1

y2

y3

FQ,D = (x1y1z1) (x1y2z2) (x2y3z2)

True

False

True

False

TrueTrue

False

= False

= FalsePosBool(X)-Database D

Semantics using Derivation Trees (Green et al. 2007)

Annotation of T(a, b):

11

Datalog over PosBool(X) DatabaseT(x, y) :- R(x, y)T(x, y) :- R(x, y), T(y, z)S(x) :- T(a, x)

Ra aa b

pqa

b

Trees Leaves t of Annot(t)

= (q) (pq) (ppq) …

• Infinitely many trees• But always has a finite equivalent form

= q

But not necessarily poly-size

T(a, b)

R(a, a) T(a, b)

R(a, a) T(a, b)

R(a, b)

T(a, b)

R(a, a) T(a, b)

R(a, b)

R(a, b)

T(a, b)

12

Theorem:Given PosBool(X)-database D and datalog program P,

provenance of tuples in P(D) cannot have a faithful representation using

Boolean formulas of size polynomial in |D|

Lower Bound: Boolean formulas for Datalog Provenance on PosBool(X)

Proof outline:• st-connectivity on n nodes requires n(logn)-size monotone Boolean formula

• Karchmer-Wigderson, 1988

• Faithful representation requires: for all True/False assignments to X, P(D)= [P(D)]

• Reduce to the hard instance with right when P = transitive closure

Solution: Boolean Circuit!

13

Background

Circuits for Boolean Provenance or PosBool(X)

Circuits for General Provenance Semirings

Outline

14

Circuit is a DAG– use common subexpressions– Boolean formula = tree

Leaf nodes: – EDB vars in X

Internal nodes – : IDB/EDB vars used in one derivation– : Alternative derivations

Roots: – IDB vars

Boolean CircuitsR

a aa b

pq

T(x, y) :- R(x, y)T(x, y) :- R(x, y), T(y, z)S(x) :- T(a, x)

XT(a, b)

q pXT(a, b)

XR(a, b)XR(a, a)

a

b

15

Theorem:

Given any PosBool(X)-database D and datalog program P, provenance of tuples in P(D) can be faithfully represented

using monotone Boolean Circuits of poly-size in |D| (and can be computed in poly-time)

Upper Bound: Boolean Circuits for PosBool(X)

16

1. Datalog Provenance can be represented by a system of equations by instantiating vars in the datalogprogram P to EDB/IDB tuples [Green et al. 2007]

Proof Skecth

2. A System of equations with N Boolean variables can be solved in N+1 iterations [Esparza et al. 2011]

• N = #IDB tuples• Build a circuit with N+1 layers from the system of equations

Two key ideas from previous work

• EDB tuples constants, IDB tuples variables • Iteratively solve this system of equations• Fixpoint = provenance for all IDB tuples

17

IllustrationT(x, y) :- R(x, y)T(x, y) :- R(x, y), T(y, z)S(x) :- T(a, x)

Ra aa b

pqa

b

Step1 : Build system of equations by all possible instantiations: x, y, z a, b

XT(a, a) = p (p XT(a, a))XT(a, b) = q (p XT(a, b))XS(b) = XT(a, b)

XS(a) = XT(a, a)

Step 2: Build a circuit with 4 + 1 layers (N = 4) …

varConst

18

XT(a,a),0XS(b),0 XT(a,a),0XT(a,b),0XS(a),0

pq

XT(a,a),1 XS(b),1 XT(a,a),1

XT(a,b),1

XS(a),1

XS(a),2

XT(a,a),2 XS(b),2 XTa,a),2XT(a,b),2

Level 1

Level 2

false false falsefalsefalse

IllustrationXT(a, a) = p (p XT(a, a))XT(a, b) = q (p XT(a, b))XS(b) = XT(a, b)

XS(a) = XT(a, a)

Assign leaf IDB vars to false

Multiple roots for multiple IDB vars

19

1. Store only two levels of circuit instead of N+1 levels– Evaluate iteratively

2. Embed circuit construction in semi-naïve evaluation– Check for new derivations, not only new IDB variables– Sound and Complete

3. Remove self-dependency of IDB vars– works for PosBool(X) and also some other semirings…

XT(a, a) = p (p XT(a, a))XT(a, b) = q (p XT(a, b))XS(b) = XT(a, b)

XS(a) = XT(a, a)

Optimizations

20

Illustration (From here…)

XT(a,a),0XS(b),0 XT(a,a),0XT(a,b),0XS(a),0

pq

XT(a,a),1 XS(b),1 XT(a,a),1

XT(a,b),1

XS(a),1

XS(a),2

XT(a,a),2 XS(b),2 XTa,a),2XT(a,b),2

Level 1

Level 2

false false falsefalsefalse

21

Illustration (…To here)

XT(a,a),bottomXT(a,b),bottomXS(a),bottom

pq

XT(a,a),topXT(a,b),topXS(a),top

With all these optimizations

Top Level

Bottom Level

22

Linear-time deletion propagation (in circuit-size)

Approximation for probabilistic databases– even when only the circuit (and not the database) is available

Circuits can be computed “offline”– Only linear-time evaluation is required when needed (e.g. deletion

propagation) compared to storing and solving a system of equations iteratively, or re-evaluating datalog program

Can use existing techniques for efficient and parallel circuit evaluation

Applications of PosBool(X)-Circuits

23

Background

Circuits for Boolean Provenance or PosBool(X)

Circuits for General Provenance Semirings

Outline

24

(K, +K, K, 0K, 1K)– domain K – +K, K : associative, commutative, have neutral elements 0K, 1K

– K distributes over +K , i.e. a K (b +K c) = a K b +K a K c

– 0K cancels any element in K, i.e. a K 0K = 0K K a = 0K

Examples:

– (B, , , False, True) Set semantics

– (N, +, , 0, 1) Bag semantics

– (N {}, min, +, , 0) Tropical semiring to compute cost (e.g. cost of a shortest path)

Commutative Semirings

25

Generalization of PosBool(X)

(K, +K, K, 0K, 1K)– Tuples are annotated with variables from X– K is of the form Prov(X)– +K denotes alternative usage– K denotes joint usage

Examples:– (PosBool(X), , , False, True)

– (Lin(X), , , , ) tracks contributing tuples [Cui et. al. ’00]

– (Why(X), , , , {}) : pairwise union of subsets, tracks contributing tuples in alternative derivations

[Buneman et. al. ’01]

Provenance Semirings

26

Key property needed for applications like deletion propagation, trust management, cost computation, …

Prov(X) specializes correctly to K, if any valuation v : X K extends uniquely to a homomorphism hv : Prov(X) K (which correctly maps +, of Prov(X) to that of K)

Provenance Specialization

27

Provenance Semiring HierarchyN[X]

Why(X)

Lin(X)PosBool(X)

Sorp(X)

Tropical

N (bag)

Security Boolean (set)

Defined later

Specializes correctly

Less informative

28

Datalog Provenance for General Semirings

Trees Leaves t of Annot(t)

Trees Leaves t of Annot(t)

PosBool(X)

General Prov(X)

+kk

• Infinite sums should be well-defined

• Need to consider “–continuous semirings” and “–continuous homomorphism”

29

Provenance Semiring HierarchyN[X]

Why(X)

Lin(X)PosBool(X)

Sorp(X)

Tropical

N (bag)

Security Boolean (set)

Finite so -continuous

N[[X]] and N N[[X]] : Most informative provenance semiring [Green et al. ’07]

30

Poly-size overhead is not valid because of infinite sum But can outputs have finite annotations (with X, , +) that specializes

correctly to semirings with finite domains?

How good is N[[X]] w.r.t. Size of Datalog Provenance?

Theorem:It is not possible to annotate with finite provenance expressions the output of datalog programs following N[[X]] -semanticsthat specialize “correctly” to the semiring Why(X)

Theorem: However, we can generate poly-size circuits in poly-time directly for Why(X)

─ Need more levels in the circuit from system of equations─ Need a different argument for correctness

Finite annotations won’t specialize correctly to Why(X)

31

We propose Sorp(X)– Most general absorptive semiring

a + a.b = a– N[X] but keep polynomials that are not “absorbed” by the others

e.g. pq + p2q3 pq p2q + pq2 p2q + pq2

The same algorithm, proof, and optimizations to construct poly-size circuits hold– Circuits are more general than Boolean circuit

Can we still have a good general semiring w.r.t. size?

1. Specializes correctly to interesting semirings2. Outputs can be annotated by poly-size circuits

32

Provenance Semiring HierarchyN[X]

Why(X)

Lin(X)PosBool(X)

Sorp(X)

Tropical

N (bag)

Security Boolean (set)

33

Data Provenance– e.g. [Cui et. al.’00, Buneman et al. ’08, Cheney et al. ’09, Benjelloun et al. ’08]

Circuits– Circuit complexity (size, /depth, parallelism) has been studied for decades, e.g.

[Arora-Barak ’09] (book)

Provenance for Datalog– System of equations, derivation trees, infinite sum [Grahne’91, Green et al. ’07]– Poly-size c-tables with Boolean formulas for datalog with contradictions

[Abiteboul et al. 2014]

Related Work

34

Circuits to represent and store Datalog Provenance– for PosBool(X) and other semirings– Semantics, Algorithms, Limitations, Applicability

– Preliminary experiments support our results we compared circuits for deletion propagation with iteratively solving

system of equations and reevaluation of datalog from scratch

Future Work:– A complete implementation, evaluation, new applications

Conclusions

35

Thank You

Questions?