declarative datalog debugging for mere mortals

Sven Köhler1 Bertram Ludäscher1,3 Yannis Smaragdakis2,3

1University of California, Davis2University of Athens, Greece3LogicBlox Inc., Atlanta, USA

UC DAVISDepartment ofComputer Science

Datalog2.0, Vienna Logic Weeks, 2012

Outline

¨ Motivation ¤ Debugging and Profiling Declarative Rules

¨ Basic Idea¤ Capture derivations (provenance) in an enriched model M’¤ … then run Datalog queries on M’ (when in Rome, … )

¨ Simple “Tricks” for Mere Mortals¤ F: record rule firings (TP instances)¤ G: reify firings as nodes in a firing graph¤ S: keep track of firing stages (Statelog)¤ Query the enriched model (provenance graph)!

¨ Debugging and Profiling Examples¨ Musings & Conclusions

¤ Graph-based Provenance Analyzer (GPad/DLV, GPad/LB)

Declarative Debugging

¨ Resurgence/Renaissance of Datalog … ¤ … and Declarative Programming

n e.g., parallel programming beyond MapReduce¤ … an old dream: Executable Specifications!

¨ But writing large declarative programs is still tricky¤ 9-valued logics anyone? (cf. Kunen’s PhD effect)¤ How can we empower “regular” Datalog programmers

(mere mortals)?è Simple tools & techniques for debugging (and profiling)

¨ Ideally:¤ Don’t tie the approach to a particular computation model¤ Instead devise a declarative debugging approachè should work for different implementations

Running Example

Pop Quiz: Why/how come tc(a,b) ?

¨ Why/how is (a,b) in the transitive closure tc of e ?¨ What about ?-tc(e,X) vs ?-tc(X,e)

Declarative Debugging: Prolog

Hmm.. Many answers …

(infinitelymany…)

?-tc(a,b)

:-e(a,b) :-e(a,_G263),tc(_G263,b)

:-true

Answer:tc(a,b)

:-tc(b,b)

:-e(b,b) :-e(b,_G347),tc(_G347,b)

:-tc(c,b)

:-e(c,b) :-e(c,_G431),tc(_G431,b)

:-true

Answer:tc(a,b)

:-tc(b,b) :-tc(d,b)

:-e(b,b) :-e(b,_G515),tc(_G515,b)

:-tc(c,b)

:-e(c,b) :-e(c,_G599),tc(_G599,b)

:-true

Answer:tc(a,b)

:-tc(b,b) :-tc(d,b)

:-e(b,b) :-e(b,_G683),tc(_G683,b)

:-tc(c,b)

:-e(c,b) :-e(c,_G767),tc(_G767,b)

:-true

Answer:tc(a,b)

:-tc(b,b) :-tc(d,b)

:-e(b,b) :-e(b,_G851),tc(_G851,b) :-e(d,b) :-e(d,_G851),

tc(_G851,b)

:-e(d,b) :-e(d,_G683),tc(_G683,b)

:-e(d,b) :-e(d,_G515),tc(_G515,b)

Different ways to say “No”!

(Infinitely!) Many branches saying “Yes”

è finitely failed tree

infinitely failed tree…

Declarative Debugging: DATALOG

Debug this!

¨ Evaluating P on Iyields model MP(I)

¨ Too much information!

.. but also ..¨ Not enough

information!

Declarative Debugging: DATALOG

¨ Scope of this paper: ¤ positive Datalog programs (recursion OK, negation: not yet)¤ Why/How provenance (but no Why-Not provenance… yet)

Some Debugging and Profiling Use Cases

Solving the Provenance Quiz

¨ Example: Datalog program P = {[r1] tc(X,Y) :- e(X,Y).

[r2] tc(X,Y) :- e(X,Z), tc(Z,Y). }

¨ EDB instance I = {e(a,b). e(b,c). e(c,b). e(c,d). }

¨ Question: How can we justify / explain …

¤ .. why (how) it is, that tc(a,b) is in MP(I)?

[r1] tc(X,Y) :- e(X,Y)[r2] tc(X,Y) :- e(X,Z), tc(Z,Y)

[r1] tc(X,Y) :- e(X,Y)[r2] tc(X,Y) :- e(X,Z), tc(Z,Y)

A firing [F] à (H) is called unfounded, if all derivations of F require H as an assumption!

Here tc(a,b) has (at least) two different derivations, neither of which is unfounded.However, [r2] à tc(c,b) is unfounded: The firing of r2 depends on tc(b,b) which can only be derived by already assuming the desired conclusion tc(c,b)!

DATALOG Rewritings (GPAD)Firing graph: • captures the “full” provenance• reasonable overhead!? • has been/is being used (e.g. Orchestra, LogicBlox)• can be easily constructed!

• Provenance-enabled Debugging and Profiling for the rest of us!

Step 1: Capturing Rule Firings (“F-trick”)

¨ Capture rule firings and keep “witness info” (existential variables)¤ no premature projections in the rule head please!

¨ Example. Instead of a given rule …

tc(X,Y) :- e(X,Z), tc(Z,Y).

… we rather use these two rules, keeping witnesses Z around:

fire2(X,Z,Y) :- e(X,Z), tc(Z,Y).tc(X,Y) :- fire2(X,Z,Y).

��

��

��

��

��

��

Example rule firings

Step 2: Graph Transformation (“G-trick”)

¨ Reify provenance atoms & firings in a labeled graph g/3¨ Example for N = 2 subgoals and 1 head atom …

fire2(X,Z,Y) :- e(X,Z), tc(Z,Y). % two in-edgestc(X,Y) :- fire2(X,Z,Y). % one out-edge

… generates N+1 “reification rules” (Skolems are safe):g( e(X,Z), in, skfire2(X,Z,Y) ) :- fire2(X,Z,Y).g( tc(Z,Y), in, skfire2(X,Z,Y) ) :- fire2(X,Z,Y).

g( skfire2(X,Z,Y), out, tc(X,Y) ) :- fire2(X,Z,Y).

e(a,b)

fire2(a,b,d)

in

tc(a,d)out

tc(b,d)

in

Example instance generated by these rules

Step 3: Using Statelog (“S-Trick”)

¨ Use Statelog to keep record of firing rounds: ¤ Add state (=stage) argument to provenance rules and graph relations¤ EDB facts are derived in state 0.¤ Subsequently: extract earliest round for firings and IDB facts

¨ Example:

rin : firer(S1, X) :- B1(S, X1), … , Bn(S, Xn), next(S, S1).rout : H(S, Y) :- firer(S, X).

e(a,b) r1 [1]

r2 [3]

tc(a,b)[1]e(b,c)

r2 [2] tc(b,b)[2]

e(c,b)r1 [1]

r2 [3]

tc(c,b)[1]

How long (does it take) Provenance!

¨ These definitions are recursive but well-founded¨ The numbers can be easily obtained via Statelog

More Provenance Querying

¨ Provenance Views: ¤ Provenance subgraph relevant for debug atom Q:

ProvView(Q,X, out, Q) :- g(_,X,out,Q).ProvView(Q,X, L, Y) :- ProvView(Q,Y,_,_), g(_,X,L,Y).

¨ Length of derivations: ¤ first round this firing occurred

len(F,LenF) :- newFiring(S,F), LenF=S.

¨ Length of an atom: ¤ first round it was derived:

len(A,LenA) :- newAtom(S,A), LenA=S.

Declarative Profiling

Prr:tc(X,Y) :- e(X,Y).tc(X,Y) :- e(X,Z), tc(Z,Y).

Pdr:tc(X,Y) :- e(X,Y).tc(X,Y) :- tc(X,Z), tc(Z,Y).


¨ Number of Facts:

derived(H) :-g(_,out, H).

derivedHeadCount(C) :-C = count{H : derived(H)

}.

¨ Number of Firings:firing(F) :- g(_,F,out,_).

firingCount(C) :-C = count{F : firing(F)}.

e(a,b) 1

2

3

4

tc(a,b)[1]

tc(a,c)[2]

tc(a,d)[3]

tc(a,e)[4]

e(b,c) 1

2

3

tc(b,c)[1]

tc(b,d)[2]

tc(b,e)[3]

e(c,d)1

2

tc(c,d)[1]

tc(c,e)[2]

e(d,e) 1 tc(d,e)[1]

3

tc(a,d)[3]3

3 tc(a,e)[3]

3 tc(b,e)[3]

3

4

4

e(a,b) 1 tc(a,b)[1]

e(b,c) 1 tc(b,c)[1]

e(c,d) 1 tc(c,d)[1]

e(d,e) 1 tc(d,e)[1]

2

2

2

tc(a,c)[2]

tc(b,d)[2]

tc(c,e)[2]

(a) right-recursive

(b) doubly-recursive


¨ Number of Rederivations:

reDerivation(S,F) :-g(S,F,out,A), len(A,LenA), LenA < S.

reDerivCount(S,C) :-C = count{F : reDerivation(S,F)

}.

reDerivTotal(T) :-T = sum{C : reDerivCount(S,C)

}.

¨ Schema-Level Profiling: ¤ Number of new facts per relation

used in each round to derive new facts

factInRound(S,R,A) :-g(S, A, in, _),relName(A,R).

factInRound(S1,R,A) :-g(S,_, out, A), next(S,S1), relName(A,R).

newFact(S,R,A) :-g(S,_,out,A), not factsInRound(S,R,A),relName(A,R).

newFactsCount(S,R,C) :-C = count{ A : newFact(S,R,A)

}.

Profiling Example: Transitive Closure

¨ 45 facts¨ 45 rule firings¨ 10 rounds¨ 285 rederivations

¨ 45 facts¨ 129 rule firings¨ 6 rounds¨ 325 rederivations

Right Recursive Double Recursive

[Related factoid: a chain of length N has exactly one derivation in the right-recursive program but Catalan-number(N) many derivations in the doubly-recursive program!]

Real-WorldProfilingExample

¨ Provenance-basedprofilingcanexplainreal-worldbehavior

¨ E.g.,realisticgraph,~1700nodes~4000edges:¤ doublyrecursivetrans.closure>64Mrulefirings¤ right-recursivetrans.closure ~560Krulefirings¤ explainsexecutiontimedifference:>15secvs.2.5sec

BurdenofDeclarativeProfiling

¨ Theprovenancetransformincurscost¤ e.g.,fordouble-recursivetc:from~15secto~51secexecutiontime

¨ Veryhighspacecostistobeexpected¨ Justfortransitiveclosurerule:

¤ tc(X,Z):- e(X,Y),tc(Y,Z)Need a3-columntable (X,Y,Z)instead of2-column(X,Z)when recording provenance

¨ Approach will not scale to large provenance graphs¤ ...unless we invent specialized datastructures or customlogic for provenance

ProvenanceinLogicBloxDatalog

• Provenance transformation is already implemented in LB Datalog

• add to any program:lang:provenance[] = true.lang:provenance:recordConstants[]=true.

• Produces new provenance relations, per original rule, capturing values in rule firing

• Can write queries using such relations, to implement our GPAD– Graph-based Provenance Analyzer and Debugger

StateloginLogicBloxDatalog

¨ LBDatalog hasnonativeStatelog support¨ Wesimulateitinvariousways,typicallybyintroducinganexplicit“time” dimension

¨ Affectsperformancesignificantly¤ extrafactorofNtoasymptoticcomplexity

¨ Recoveringperformancethrough“unsafetricks”¤ Safeuseofrecursionthroughnegation

Provenance capture: F, G, SProvenance query: (g.u)+

Musings

“Elegance is not optional.”— Richard O’Keefe

¨ There is no tension between writing a beautiful program and writing an efficient program. If your code is ugly, the chances are that you either don’t understand your problem or you don’t understand your programming language, and in neither case does your code stand much chance of being efficient. In order to ensure that your program is efficient, you need to know what it is doing, and if your code is ugly, you will find it hard to analyse.

Query “Macros” for Debugging, Profiling

¨ What is the provenance of atom A? ¨ Regular Path Query (RPQ) ans(X,Y) :- (X, (g.u)+, Y).

¤ g = out-1 (OPM’s “was-generated-by”) ¤ u = in-1 (OPM “used”)

More musings:Many gurus are better than one!

¨ Are we too fragmented? Defragment your mind!¤ Discover the homomorphisms, relationships between

subcommunities!

¨ Look around, be promiscuous, interbreed!¤ idea-wise I mean!

¨ For example, look at … ¤ Theorem proving¤ Declarative LP semantics (e.g. well-founded models)¤ Procedural/production rule semantics (e.g. inflationary)

n Fixpoint logics

¤ PL e.g. Functional programming

Hamming Numbers in a Dataflow Network(= executable Kepler workflow)

Compute Hamming numbers H in order, where

a.k.a. regular numbers or 5-smooth numbers (numbers whose prime factors are <= 5).

Hamming “3-loops”

Hamming “1-loop”

Hamming Traces: “Debugged”

1

2

3

5

4

6

10

9

15

25

8

12

20

18

30

50

27

45

75

16

24

40

36

60

100

125

54

90

150

32

48

80

72

120

200

81

135

225

250

108

180

300

375

64

96

160

144

240

400

162

270

450

500

216

360

600

625

243

405

675

750

128

192

320

288

480

800

324

540

900

1000 432

720

486

810

256

384

640

576

960

648

729

864

972

512

768

1

2

3

5

4

6

10

9

15

25

8

12

20

18

30

50

27

45

75

16

24

40

36

60

100

125

54

90

150

32

48

80

72

120

200

81

135

225

250

108

180

300

375

64

96

160

144

240

400

162

270

450

500

216

360

600

625

243

405

675

750

128

192

320

288

480

800

324

540

900

1000

432

720

486

810

256

384

640

576

960

648

729

864

972

512

768

Provenance ofH1 ("Fish")

Provenance ofH3 ("Sail")

For each H-number, there are many paths

è many re-derivations!

For each H-number, there is exactly one path to the root = unique derivation!

Datalog as a Lingua Franca for Provenance Querying and Reasoning, Dey et al, TaPP’12

Conclusions

¨ Declarative Debugging for the rest of us!¤ Simple program transformations: Pà {F, G, S}à P’¤ Apply Datalog queries (RPQ, aggregation, …) on MP’

¤ “Turn Datalog on itself!”¨ Prototypical implementations underway:

¤ GPad/DLVn uses SWI-Prolog as gluen … but could benefit e.g. from DLV IDE and tools!

¤ GPad/LBn uses LogicBlox platform and tools (MoreBlox, …)

¨ Coming up next:¤ Finish GPad(s), add library of common queries (RPQ, LCA,…)¤ Clarify connection to provenance semirings¤ Extending to Datalog-neg, why-not, …

declarative datalog debugging for mere mortals

Data & Analytics