declarative datalog debugging for mere mortals
TRANSCRIPT
Sven Köhler1 Bertram Ludäscher1,3 Yannis Smaragdakis2,3
1University of California, Davis2University of Athens, Greece3LogicBlox Inc., Atlanta, USA
UC DAVISDepartment ofComputer Science
Datalog2.0, Vienna Logic Weeks, 2012
Outline
¨ Motivation ¤ Debugging and Profiling Declarative Rules
¨ Basic Idea¤ Capture derivations (provenance) in an enriched model M’¤ … then run Datalog queries on M’ (when in Rome, … )
¨ Simple “Tricks” for Mere Mortals¤ F: record rule firings (TP instances)¤ G: reify firings as nodes in a firing graph¤ S: keep track of firing stages (Statelog)¤ Query the enriched model (provenance graph)!
¨ Debugging and Profiling Examples¨ Musings & Conclusions
¤ Graph-based Provenance Analyzer (GPad/DLV, GPad/LB)
Declarative Debugging
¨ Resurgence/Renaissance of Datalog … ¤ … and Declarative Programming
n e.g., parallel programming beyond MapReduce¤ … an old dream: Executable Specifications!
¨ But writing large declarative programs is still tricky¤ 9-valued logics anyone? (cf. Kunen’s PhD effect)¤ How can we empower “regular” Datalog programmers
(mere mortals)?è Simple tools & techniques for debugging (and profiling)
¨ Ideally:¤ Don’t tie the approach to a particular computation model¤ Instead devise a declarative debugging approachè should work for different implementations
Pop Quiz: Why/how come tc(a,b) ?
¨ Why/how is (a,b) in the transitive closure tc of e ?¨ What about ?-tc(e,X) vs ?-tc(X,e)
?-tc(a,b)
:-e(a,b) :-e(a,_G263),tc(_G263,b)
:-true
Answer:tc(a,b)
:-tc(b,b)
:-e(b,b) :-e(b,_G347),tc(_G347,b)
:-tc(c,b)
:-e(c,b) :-e(c,_G431),tc(_G431,b)
:-true
Answer:tc(a,b)
:-tc(b,b) :-tc(d,b)
:-e(b,b) :-e(b,_G515),tc(_G515,b)
:-tc(c,b)
:-e(c,b) :-e(c,_G599),tc(_G599,b)
:-true
Answer:tc(a,b)
:-tc(b,b) :-tc(d,b)
:-e(b,b) :-e(b,_G683),tc(_G683,b)
:-tc(c,b)
:-e(c,b) :-e(c,_G767),tc(_G767,b)
:-true
Answer:tc(a,b)
:-tc(b,b) :-tc(d,b)
:-e(b,b) :-e(b,_G851),tc(_G851,b) :-e(d,b) :-e(d,_G851),
tc(_G851,b)
:-e(d,b) :-e(d,_G683),tc(_G683,b)
:-e(d,b) :-e(d,_G515),tc(_G515,b)
Different ways to say “No”!
(Infinitely!) Many branches saying “Yes”
è finitely failed tree
infinitely failed tree…
Debug this!
¨ Evaluating P on Iyields model MP(I)
¨ Too much information!
.. but also ..¨ Not enough
information!
Declarative Debugging: DATALOG
¨ Scope of this paper: ¤ positive Datalog programs (recursion OK, negation: not yet)¤ Why/How provenance (but no Why-Not provenance… yet)
Solving the Provenance Quiz
¨ Example: Datalog program P = {[r1] tc(X,Y) :- e(X,Y).
[r2] tc(X,Y) :- e(X,Z), tc(Z,Y). }
¨ EDB instance I = {e(a,b). e(b,c). e(c,b). e(c,d). }
¨ Question: How can we justify / explain …
¤ .. why (how) it is, that tc(a,b) is in MP(I)?
[r1] tc(X,Y) :- e(X,Y)[r2] tc(X,Y) :- e(X,Z), tc(Z,Y)
A firing [F] à (H) is called unfounded, if all derivations of F require H as an assumption!
Here tc(a,b) has (at least) two different derivations, neither of which is unfounded.However, [r2] à tc(c,b) is unfounded: The firing of r2 depends on tc(b,b) which can only be derived by already assuming the desired conclusion tc(c,b)!
DATALOG Rewritings (GPAD)Firing graph: • captures the “full” provenance• reasonable overhead!? • has been/is being used (e.g. Orchestra, LogicBlox)• can be easily constructed!
• Provenance-enabled Debugging and Profiling for the rest of us!
Step 1: Capturing Rule Firings (“F-trick”)
¨ Capture rule firings and keep “witness info” (existential variables)¤ no premature projections in the rule head please!
¨ Example. Instead of a given rule …
tc(X,Y) :- e(X,Z), tc(Z,Y).
… we rather use these two rules, keeping witnesses Z around:
fire2(X,Z,Y) :- e(X,Z), tc(Z,Y).tc(X,Y) :- fire2(X,Z,Y).
�����������������
�������������� ������
���
���
������
���
Example rule firings
Step 2: Graph Transformation (“G-trick”)
¨ Reify provenance atoms & firings in a labeled graph g/3¨ Example for N = 2 subgoals and 1 head atom …
fire2(X,Z,Y) :- e(X,Z), tc(Z,Y). % two in-edgestc(X,Y) :- fire2(X,Z,Y). % one out-edge
… generates N+1 “reification rules” (Skolems are safe):g( e(X,Z), in, skfire2(X,Z,Y) ) :- fire2(X,Z,Y).g( tc(Z,Y), in, skfire2(X,Z,Y) ) :- fire2(X,Z,Y).
g( skfire2(X,Z,Y), out, tc(X,Y) ) :- fire2(X,Z,Y).
e(a,b)
fire2(a,b,d)
in
tc(a,d)out
tc(b,d)
in
Example instance generated by these rules
Step 3: Using Statelog (“S-Trick”)
¨ Use Statelog to keep record of firing rounds: ¤ Add state (=stage) argument to provenance rules and graph relations¤ EDB facts are derived in state 0.¤ Subsequently: extract earliest round for firings and IDB facts
¨ Example:
rin : firer(S1, X) :- B1(S, X1), … , Bn(S, Xn), next(S, S1).rout : H(S, Y) :- firer(S, X).
e(a,b) r1 [1]
r2 [3]
tc(a,b)[1]e(b,c)
r2 [2] tc(b,b)[2]
e(c,b)r1 [1]
r2 [3]
tc(c,b)[1]
How long (does it take) Provenance!
¨ These definitions are recursive but well-founded¨ The numbers can be easily obtained via Statelog
More Provenance Querying
¨ Provenance Views: ¤ Provenance subgraph relevant for debug atom Q:
ProvView(Q,X, out, Q) :- g(_,X,out,Q).ProvView(Q,X, L, Y) :- ProvView(Q,Y,_,_), g(_,X,L,Y).
¨ Length of derivations: ¤ first round this firing occurred
len(F,LenF) :- newFiring(S,F), LenF=S.
¨ Length of an atom: ¤ first round it was derived:
len(A,LenA) :- newAtom(S,A), LenA=S.
Declarative Profiling
Prr:tc(X,Y) :- e(X,Y).tc(X,Y) :- e(X,Z), tc(Z,Y).
Pdr:tc(X,Y) :- e(X,Y).tc(X,Y) :- tc(X,Z), tc(Z,Y).
Declarative Profiling
¨ Number of Facts:
derived(H) :-g(_,out, H).
derivedHeadCount(C) :-C = count{H : derived(H)
}.
¨ Number of Firings:firing(F) :- g(_,F,out,_).
firingCount(C) :-C = count{F : firing(F)}.
e(a,b) 1
2
3
4
tc(a,b)[1]
tc(a,c)[2]
tc(a,d)[3]
tc(a,e)[4]
e(b,c) 1
2
3
tc(b,c)[1]
tc(b,d)[2]
tc(b,e)[3]
e(c,d)1
2
tc(c,d)[1]
tc(c,e)[2]
e(d,e) 1 tc(d,e)[1]
3
tc(a,d)[3]3
3 tc(a,e)[3]
3 tc(b,e)[3]
3
4
4
e(a,b) 1 tc(a,b)[1]
e(b,c) 1 tc(b,c)[1]
e(c,d) 1 tc(c,d)[1]
e(d,e) 1 tc(d,e)[1]
2
2
2
tc(a,c)[2]
tc(b,d)[2]
tc(c,e)[2]
(a) right-recursive
(b) doubly-recursive
Declarative Profiling
¨ Number of Rederivations:
reDerivation(S,F) :-g(S,F,out,A), len(A,LenA), LenA < S.
reDerivCount(S,C) :-C = count{F : reDerivation(S,F)
}.
reDerivTotal(T) :-T = sum{C : reDerivCount(S,C)
}.
¨ Schema-Level Profiling: ¤ Number of new facts per relation
used in each round to derive new facts
factInRound(S,R,A) :-g(S, A, in, _),relName(A,R).
factInRound(S1,R,A) :-g(S,_, out, A), next(S,S1), relName(A,R).
newFact(S,R,A) :-g(S,_,out,A), not factsInRound(S,R,A),relName(A,R).
newFactsCount(S,R,C) :-C = count{ A : newFact(S,R,A)
}.
Profiling Example: Transitive Closure
¨ 45 facts¨ 45 rule firings¨ 10 rounds¨ 285 rederivations
¨ 45 facts¨ 129 rule firings¨ 6 rounds¨ 325 rederivations
Right Recursive Double Recursive
[Related factoid: a chain of length N has exactly one derivation in the right-recursive program but Catalan-number(N) many derivations in the doubly-recursive program!]
Real-WorldProfilingExample
¨ Provenance-basedprofilingcanexplainreal-worldbehavior
¨ E.g.,realisticgraph,~1700nodes~4000edges:¤ doublyrecursivetrans.closure>64Mrulefirings¤ right-recursivetrans.closure ~560Krulefirings¤ explainsexecutiontimedifference:>15secvs.2.5sec
BurdenofDeclarativeProfiling
¨ Theprovenancetransformincurscost¤ e.g.,fordouble-recursivetc:from~15secto~51secexecutiontime
¨ Veryhighspacecostistobeexpected¨ Justfortransitiveclosurerule:
¤ tc(X,Z):- e(X,Y),tc(Y,Z)Need a3-columntable (X,Y,Z)instead of2-column(X,Z)when recording provenance
¨ Approach will not scale to large provenance graphs¤ ...unless we invent specialized datastructures or customlogic for provenance
ProvenanceinLogicBloxDatalog
• Provenance transformation is already implemented in LB Datalog
• add to any program:lang:provenance[] = true.lang:provenance:recordConstants[]=true.
• Produces new provenance relations, per original rule, capturing values in rule firing
• Can write queries using such relations, to implement our GPAD– Graph-based Provenance Analyzer and Debugger
StateloginLogicBloxDatalog
¨ LBDatalog hasnonativeStatelog support¨ Wesimulateitinvariousways,typicallybyintroducinganexplicit“time” dimension
¨ Affectsperformancesignificantly¤ extrafactorofNtoasymptoticcomplexity
¨ Recoveringperformancethrough“unsafetricks”¤ Safeuseofrecursionthroughnegation
“Elegance is not optional.”— Richard O’Keefe
¨ There is no tension between writing a beautiful program and writing an efficient program. If your code is ugly, the chances are that you either don’t understand your problem or you don’t understand your programming language, and in neither case does your code stand much chance of being efficient. In order to ensure that your program is efficient, you need to know what it is doing, and if your code is ugly, you will find it hard to analyse.
Query “Macros” for Debugging, Profiling
¨ What is the provenance of atom A? ¨ Regular Path Query (RPQ) ans(X,Y) :- (X, (g.u)+, Y).
¤ g = out-1 (OPM’s “was-generated-by”) ¤ u = in-1 (OPM “used”)
More musings:Many gurus are better than one!
¨ Are we too fragmented? Defragment your mind!¤ Discover the homomorphisms, relationships between
subcommunities!
¨ Look around, be promiscuous, interbreed!¤ idea-wise I mean!
¨ For example, look at … ¤ Theorem proving¤ Declarative LP semantics (e.g. well-founded models)¤ Procedural/production rule semantics (e.g. inflationary)
n Fixpoint logics
¤ PL e.g. Functional programming
Hamming Numbers in a Dataflow Network(= executable Kepler workflow)
Compute Hamming numbers H in order, where
a.k.a. regular numbers or 5-smooth numbers (numbers whose prime factors are <= 5).
Hamming Traces: “Debugged”
1
2
3
5
4
6
10
9
15
25
8
12
20
18
30
50
27
45
75
16
24
40
36
60
100
125
54
90
150
32
48
80
72
120
200
81
135
225
250
108
180
300
375
64
96
160
144
240
400
162
270
450
500
216
360
600
625
243
405
675
750
128
192
320
288
480
800
324
540
900
1000 432
720
486
810
256
384
640
576
960
648
729
864
972
512
768
1
2
3
5
4
6
10
9
15
25
8
12
20
18
30
50
27
45
75
16
24
40
36
60
100
125
54
90
150
32
48
80
72
120
200
81
135
225
250
108
180
300
375
64
96
160
144
240
400
162
270
450
500
216
360
600
625
243
405
675
750
128
192
320
288
480
800
324
540
900
1000
432
720
486
810
256
384
640
576
960
648
729
864
972
512
768
Provenance ofH1 ("Fish")
Provenance ofH3 ("Sail")
For each H-number, there are many paths
è many re-derivations!
For each H-number, there is exactly one path to the root = unique derivation!
Datalog as a Lingua Franca for Provenance Querying and Reasoning, Dey et al, TaPP’12
Conclusions
¨ Declarative Debugging for the rest of us!¤ Simple program transformations: Pà {F, G, S}à P’¤ Apply Datalog queries (RPQ, aggregation, …) on MP’
¤ “Turn Datalog on itself!”¨ Prototypical implementations underway:
¤ GPad/DLVn uses SWI-Prolog as gluen … but could benefit e.g. from DLV IDE and tools!
¤ GPad/LBn uses LogicBlox platform and tools (MoreBlox, …)
¨ Coming up next:¤ Finish GPad(s), add library of common queries (RPQ, LCA,…)¤ Clarify connection to provenance semirings¤ Extending to Datalog-neg, why-not, …