technical report: exception-enriched rule learning...

19
Technical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed Gad-Elrab a , Daria Stepanova a , Jacopo Urbani b , Gerhard Weikum a a Max Planck Institute of Informatics Saarbr¨ ucken, Germany b VU University Amsterdam Amsterdam, The Netherlands {gadelrab,dstepano,weikum}@mpi-inf.mpg.de, [email protected] Abstract. Advances in information extraction have enabled the automatic con- struction of large knowledge graphs (KGs) like DBpedia, Freebase, YAGO and Wikidata. These KGs are inevitably bound to be incomplete. To fill in the gaps, data correlations in the KG can be analyzed to infer Horn rules and to predict new facts. However, Horn rules do not take into account possible exceptions, so that predicting facts via such rules introduces errors. To overcome this problem, we present a method for effective revision of learned Horn rules by adding excep- tions (i.e., negated atoms) into their bodies. This way errors are largely reduced. We apply our method to discover rules with exceptions from real-world KGs. Our experimental results demonstrate the effectiveness of the developed method and the improvements in accuracy for KG completion by rule-based fact prediction. 1 Introduction Motivation and Problem. Recent advances in information extraction have led to huge graph-structured knowledge bases (KBs) also known as knowledge graphs (KGs), with prominent projects like NELL [5], DBpedia [2], Freebase [4], YAGO [24] and Wikidata [9]. These KGs contain millions or billions of relational facts in the form of subject- predicate-object (SPO) triples. As such KGs are automatically constructed, they are incomplete and contain er- rors. To complete and curate a KG, inductive logic programming and data mining tech- niques (e.g., [6,31,13]) have been used to identify strong correlations, such as “Mar- ried people live in the same place”, and cast them in the form of Horn rules, such as: r1 : livesIn (Y , Z )isMarriedTo (X , Y ),livesIn (X , Z ). This has twofold benefits. First, since KGs operate under the Open World Assump- tion (OWA) (i.e., absent facts are treated as unknown rather then false), the rules can be used to derive additional facts. For example, applying the rule r1 mined from the graph in Figure 1a, the missing living place of Dave can be deduced based on the data about his wife Clara. Second, rules can be used to eliminate errors in the KG. For example, assuming that livesIn is a functional relation, Amsterdam as a living place of Alice could be questioned as it differs from her husband’s. State of the Art and its Limitations. Methods for learning rules from KGs are typ- ically based on inductive logic programming or association rule mining (see [13] and

Upload: others

Post on 12-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

Technical Report: Exception-enriched Rule Learningfrom Knowledge Graphs

Mohamed Gad-Elraba, Daria Stepanovaa, Jacopo Urbanib, Gerhard Weikuma

aMax Planck Institute of InformaticsSaarbrucken, Germany

bVU University AmsterdamAmsterdam, The Netherlands

{gadelrab,dstepano,weikum}@mpi-inf.mpg.de, [email protected]

Abstract. Advances in information extraction have enabled the automatic con-struction of large knowledge graphs (KGs) like DBpedia, Freebase, YAGO andWikidata. These KGs are inevitably bound to be incomplete. To fill in the gaps,data correlations in the KG can be analyzed to infer Horn rules and to predict newfacts. However, Horn rules do not take into account possible exceptions, so thatpredicting facts via such rules introduces errors. To overcome this problem, wepresent a method for effective revision of learned Horn rules by adding excep-tions (i.e., negated atoms) into their bodies. This way errors are largely reduced.We apply our method to discover rules with exceptions from real-world KGs. Ourexperimental results demonstrate the effectiveness of the developed method andthe improvements in accuracy for KG completion by rule-based fact prediction.

1 Introduction

Motivation and Problem. Recent advances in information extraction have led to hugegraph-structured knowledge bases (KBs) also known as knowledge graphs (KGs), withprominent projects like NELL [5], DBpedia [2], Freebase [4], YAGO [24] and Wikidata[9]. These KGs contain millions or billions of relational facts in the form of subject-predicate-object (SPO) triples.

As such KGs are automatically constructed, they are incomplete and contain er-rors. To complete and curate a KG, inductive logic programming and data mining tech-niques (e.g., [6,31,13]) have been used to identify strong correlations, such as “Mar-ried people live in the same place”, and cast them in the form of Horn rules, such as:r1 : livesIn(Y ,Z )←isMarriedTo(X ,Y ),livesIn(X ,Z ).

This has twofold benefits. First, since KGs operate under the Open World Assump-tion (OWA) (i.e., absent facts are treated as unknown rather then false), the rules can beused to derive additional facts. For example, applying the rule r1 mined from the graphin Figure 1a, the missing living place of Dave can be deduced based on the data abouthis wife Clara. Second, rules can be used to eliminate errors in the KG. For example,assuming that livesIn is a functional relation, Amsterdam as a living place of Alicecould be questioned as it differs from her husband’s.State of the Art and its Limitations. Methods for learning rules from KGs are typ-ically based on inductive logic programming or association rule mining (see [13] and

Page 2: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

references given there). However, these methods are limited to Horn rules where allpredicates in the rule body are positive. This is insufficient to capture rules that haveexceptions, such as “Married people live in the same place unless one is a researcher”:r2 :livesIn(Y ,Z )←isMarriedTo(X ,Y ),livesIn(X ,Z ),not researcher(Y ).This ad-ditional knowledge could be an explanation for Alice living in an unexpected place. Ifr2 often holds, then one can no longer complete the missing living place for Daveby assuming that he lives with his wife Clara. Thus, for KG completion and curation,understanding exceptions is crucial.

Our goal is to learn rules with exceptions, also known as non-monotonic rules.Learning non-monotonic rules under a Closed World Assumption is a well studiedproblem that lies at the intersection of inductive and abductive logic programming (e.g.,[28,29]). However, these methods cannot be applied to KGs treated under the OWA.Approach and Contribution. We present a novel method that takes a KG and a setof (previously learned) Horn rules as input and yields a set of exception-enriched rulesas output. The output rules are no longer necessarily Horn clauses (e.g., rule r2 abovecould be in our output). So we essentially tackle a knowledge revision problem.

Our method proceeds in four steps. First, we compute what we call “exceptionwitnesses”: predicates that are potentially involved in explaining exceptions (e.g., re-searcher in our example). Second, we generate non-monotonic rule candidates thatwe could possibly add to our KG rules. Third, we devise quality measures for non-monotonic rules to quantify their strength w.r.t the KG. In contrast to prior work, we donot merely give measures for individual rules in isolation, but consider their cross-talkthrough a new technique that we call “partial materialization”. Fourth and last, we rankthe non-monotonic rules by their strengths and choose a cut-off point such that the ob-tained rules describe the KG’s content as well as possible with awareness of exceptions.

The salient contributions of our paper are:

– A framework for non-monotonic rule mining as a knowledge revision task, to cap-ture exceptions from Horn rules and overcome the limitations of prior work on KGrule mining.

– An algorithm for computing exception candidates, measuring their quality, andranking them based on a novel technique with partial materialization of judiciouslyselected rules.

– Experiments with the YAGO3 and IMDB KGs where we show the gains of ourmethod for rule quality as well as fact quality when performing KG completion.

The rest of the paper is structured as follows. Sec. 2 introduces necessary notationand definitions. Sec. 3 presents our approach to non-monotonic rule mining. Sec. 4 givesdetails on computing exceptions and knowledge revision candidates. Sec. 5 describeshow we measure the quality of rules. Sec. 6 presents an experimental evaluation of howexception-enriched rules can improve the KG quality. Sec. 7 discusses related work.An extended version of this paper is available as technical report at [12].

2 Preliminaries

Knowledge Graphs. On the Web, knowledge graphs (KG) are often encoded usingthe RDF data model [18], which represents the content of the graph with a set of triples

Page 3: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

(a) Rule mining for KG completion and KG cleaning

born

InU

S

livesI

nU

S

state

less

imm

igra

nt

sin

ger

poet

hasU

SP

ass

p1 X X X Xp2 X X Xp3 X X X X Xp4 X X Xp5 X X Xp6 X Xp7 X Xp8 X X Xp9 X X Xp10 X X X Xp11 X X X X

(b) US inhabitants KG

Fig. 1: Examples of Knowledge Graphs

of the form 〈subject predicate object〉. These triples encode positive facts about theworld, and they are naturally treated under the OWA.

In this work, we focus on KGs without blank nodes or schema (TBox in the OWLterminology). For simplicity, we represent the triples using unary and binary predicates.The unary predicates are the objects of the RDF isA predicate while the binary onescorrespond to all other RDF predicates. We call this the factual representation AG (thesubscript G is omitted when clear from context) of the KG G defined over the signatureΣAG = 〈C,R, C〉, where C, R and C are sets of unary predicates, binary predicatesand constants respectively.

Example 1. The factual representation of the graph G from Figure 1a is as follows:

AG=

(1) ism(brad , ann); (2) ism(john, kate); (3) ism(bob, alice); (4) ism(alex , dave);

(5) hb(ann, john); (6) li(brad , berlin); (7) li(ann, berlin); (8) li(john, chicago);(9) li(kate, chicago); (10) li(bob, berlin); (11) li(clara, chicago);(12) li(alice, amsterdam); (13) r(alice); (14) r(dave).

where ism, li , hb, r stand for isMarriedTo, livesIn, hasBrother , and researcher resp.The signature of AG is ΣAG=〈C,R, C〉, where C={r}, R={ism, li, hb} andC={alex , john, ann, brad , dave, clara, alice, kate, bob, chicago, berlin, amsterdam}.

ut

In this work, we focus primarily on mining rules over unary predicates. Binary re-lations can be translated into multiple unary ones by concatenating the binary predicateand one of its arguments, e.g. the binary predicate livesIn of Fig. 1a can be translatedinto three unary ones livesInAmsterdam, livesInBerlin, livesInChicago. We apply thisconversion to a KG in the input so that it consists of a collection of unary facts.

Nonmonotonic Logic Programs. We define a logic program in the usual way [23]. Inshort, a (nonmonotic) logic program P is a set of rules of the form

H ← B,not E (1)

Page 4: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

whereH is a standard first-order atom of the form a(X),B is a conjunction of atoms ofthe form b1(Y1), . . . , bk(Yk) and not E with slight abuse of notation denotes the con-junction of atoms not bk+1(Yk+1), . . . ,not bn(Yn), where not is called negation asfailure (NAF) or default negation. X,Y1, . . . ,Yn are tuples of either constants or vari-ables whose length corresponds to the arity of the predicates a, b1, . . . , bn respectively.The signature of P is given as ΣP = 〈P, C〉, where P and C are sets of predicates andconstants occurring in P respectively.

The left-hand side H of r denoted with Head(r) is called its head, while the right-hand side denoted with Body(r) is the rule’s body. We distinguish Body+(r) andBody−(r) as the sets of positive and negative atoms occurring in Body(r) respectively.The rule r is positive or Horn if Body−(r) = ∅, and it is a fact if Body(r) = ∅.

A logic program P is ground if it consists of only ground rules, i.e. rules withoutvariables. Ground instantiation Gr(P ) of a nonground program P is obtained by sub-stituting variables with constants in all possible ways. The Herbrand universe HU (P)(resp. Herbrand base HB(P)) of P , is the set of all constants occurring in P , i.e.HU (P) = C (resp. the set of all possible ground atoms that can be formed with predi-cates in P and constants in C). We refer to any subset of HB(P) as a Herbrand inter-pretation. By MM (P) we denote the set-inclusion minimal Herbrand interpretation ofa ground positive program P .

An interpretation I of P is an answer set (or stable model) of P iff I ∈ MM (P I),where P I is the Gelfond-Lifschitz reduct [14] of P , obtained from Gr(P ) by removing(i) each rule r such that Body−(r) ∩ I 6= ∅, and (ii) all the negative atoms from theremaining rules. The set of answer sets of a program P is denoted by AS(P ).

Example 2. Consider the program

P =

{(1) bornInUS(alex ); (2) bornInUS(mat); (3) immigrant(mat);

(4) livesInUS(X )← bornInUS(X ),not immigrant(X )

}The ground instantiation Gr(P ) of P is obtained from P by substituting X with mat and alex.For interpretation I={bornInUS(alex ),bornInUS(mat),immigrant(mat),livesInUS(alex )},the GL-reduct P I of P contains the rule livesInUS(alex )← bornInUS(alex ) and the facts (1)-(3). As I is a minimal model of P I , it holds that I is an answer set of P . ut

The answer set semantics for nonmonotonic logic programs is based on the ClosedWorld Assumption (CWA), under which whatever can not be derived from a program isassumed to be false. Nonmonotonic logic programs are widely applied for formalizingcommon sense reasoning from incomplete information.

Definition 1 (Rule-based KG completion). Let G be a KG andA its factual represen-tation over the signature ΣA = 〈C,R, C〉. Let, moreover, R be a set of rules minedfrom G, i.e. rules over the signature ΣR = 〈C∪R, C〉. Then completion of G (resp.A)w.r.t.R is a graph GR constructed from any answer set AR ∈ AS(R∪A).

Example 3. Consider a factual representation A of a KG G given in a tabular form inTable 1b, where a tick appears in an intersection of a row s and a column o, if o(s) ∈ A(resp. 〈s isA o〉 ∈ G). Suppose we are given a set of rulesR = {r1 , r2}, where

r1 : livesInUS (X )← bornInUS (X ),not immigrant(X );

Page 5: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

r2 : livesInUS (X )← hasUSPass(X ).

The programA∪R has a single answer setAR=A∪{livesInUS (pi) | i=6 , 7 , 11},from which the completion GR of G can be reconstructed. ut

3 Learning Exception-enriched Rules

Horn rule revision. Before we formally define our problem, we introduce the notionof an incomplete data source following [8].

Definition 2 (Incomplete data source). An incomplete data source is a pair G =(Ga,Gi) of two KGs, where Ga ⊆ Gi and ΣAGa = ΣAGi . We call Ga the availablegraph and Gi the ideal graph.

The graph Ga is the graph that we have available as input. The ideal graph Gi is theperfect completion of Ga, which is supposed to contain all correct facts with entitiesand relations from ΣAGa that hold in the current state of the world.

Given a potentially incomplete graph Ga and a set of Horn rules RH mined fromGa, our goal is to add default negated atoms (exceptions) to the rules inRH and obtaina revised ruleset RNM such that the set difference between GaRNM

and Gi is smallerthan between GaRH

and Gi. If in addition the set difference between GaRNMand Gi is the

smallest among the ones produced by other revisions R′NM of RH then we call RNM

an ideal nonmonotonic revision. For single rules such revision is defined as follows:

Definition 3 (Ideal nonmonotonic revision). Let G = (Ga,Gi) be an incomplete datasource. Moreover, let r : a ← b1, . . . , bk be a Horn rule mined from Ga. An idealnonmonotonic revision of r w.r.t. G is any rule

r′ : a← b1, . . . , bk,not bk+1,not bn, (2)

such that Gi4Gar′ ⊂ Gi4Gar 1, i.e. the completion of Ga based on r′ is closer to Gi thenthe completion of Ga based on r, and Gar′′4Gi ⊂ Gar′4Gi for no other nonmonotonicrevision r′′ 6= r′ of r. If k=n, then the revision coincides with the original rule.

In our work, we assume that the ideal graph Gi is not available (otherwise nothingwould need to be learnt). Therefore, we cannot verify whether a revision is ideal forRH . What we can do, however, is to estimate using some quality functions whether agiven revision produces an approximation of Gi that is better then the approximationproduced by the original Horn ruleset. For this purpose, we introduce a generic qualityfunction q which receives as input a revision RNM of the ruleset RH and a graph G,and returns a real value that reflects the quality of the revised set RNM . We can nowformally define our problem:

1 G14G2 = (G1\G2) ∪ (G2\G1)

Page 6: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

Problem: quality-based Horn rule revisionGiven: KG G, set of nonground Horn rulesRH mined from G, quality function qFind: set of rulesRNM obtained by adding default negated atoms to Body(r)

for some r ∈ RH , such that q(RNM ,G) is maximal.

Note that so far we did not specify the details of the quality function q. In our ap-proach, we estimate the quality of a ruleset by exploiting well-established measuresproposed in the field of data mining [3]. Even though none of these measures can offerany sort of guarantee, our hypothesis is that they still indicate to some extent the per-centage of correctly predicted facts obtained as a result of completing a KG based on agiven ruleset. We discuss in Sec. 5 in more details how q can be defined.Approach Overview. Fig. 2 illustrates the main phases of our approach. In Step 1,we launch an off-the-shelf algorithm to mine Horn rules from the input KG (we useFPGrowth [15], but others, e.g., [6] or [13] can be applied). Then, for each rule wecompute normal and abnormal instance sets, defined as

Definition 4 (r-(ab)normal instance set). LetA be the factual representation of a KGG and r : a(X)← b1(X), . . . , bk(X) a Horn rule mined from G. Then

– NS (r,A)={c | b1(c), . . . , bk(c), a(c)∈A} is an r-normal instance set;– ABS (r,A)={c | b1(c), . . . , bk(c)∈A, a(c) 6∈ A} is an r-abnormal instance set.

Example 4. For A from Table 1b and the rule r : livesInUS (X )←bornInUS (X ),r-normal and r-abnormal instance sets are given as NS(r,A) = {p1 , . . . , p5} andABS (r,A) = {p6 , . . . , p11} respectively. ut

Intuitively, if the given data was complete, then the r-normal and r-abnormal in-stance sets would exactly correspond to instances for which the rule r holds (resp. doesnot hold) in the real world. Since the KG is potentially incomplete, this is no longer thecase and some r-abnormal instances might in fact be classified as such due to data in-completeness. In order to distinguish the “wrongly” and “correctly” classified instancesin the r-abnormal set, in Step 2 we construct exception witness sets (EWS ), which aredefined as follows:

Definition 5. Let A be the factual representation of a KG G and let r be a Horn rulemined from G. An r-exception witness set EWS (r ,A) = {e1 , . . . , el} is a set of pred-icates, such that

(i) ei(c′) ∈ A for some c′ ∈ ABS (r,A), 1 ≤ i ≤ l and

(ii) e1 (c), . . . , el(c) 6∈ A for all c ∈ NS (r ,A).

Example 5. For A and r from Ex. 4 EWS (r ,A)={immigrant} is an r-exception wit-ness set. For A′=A\{p5} it holds that EWS (r ,A′)={immigrant , stateless}. ut

After EWSs are computed for all rules in RH , we use them to create potentialrevisions in Step 3. Then, we rank the newly created revisions and select the best onesusing different criteria (Step 4). These selected rules will constitute the newRNM .

Page 7: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

Knowledge GraphSTEP 1:

Extract Horn rules STEP 2:

Create exceptionwitness sets

STEP 3:Construct rule

revisions

STEP 4:Select bestrevisions

Ruleset RNM

Horn ruleset

Exception witness setsCandidate rule revisions

!" =ℎ% ← '%% ( …ℎ* ← '%* ( …

…ℎ+ ← '%+ ( …

,-.% = /%%, /*%, … , /1% ,-.* = /%*, /**, … , /2* ,-.3 = ∅

…,-.+ = /%+, /*+, … , /++

!% =ℎ% ( ← '%%, … , 678/%%(()

…ℎ% ( ← '%%, … , 678/2% (()

!+ =ℎ+ ( ← '%+, … , 678/%+(()

…ℎ+ ( ← '%+, … , 678/2+ (()

!" =ℎ% ← '%% ( …ℎ* ← '%* ( …

…ℎ+ ← '%+ ( …

,-.% = /%%, /*%, … , /1% ,-.* = /%*, /**, … , /2* ,-.3 = ∅

…,-.+ = /%+, /*+, … , /6+

!% =ℎ% ( ← '%%, … , 789/%%(()

…ℎ% ( ← '%%, … , 789/1%(()

!+ =ℎ+ ( ← '%+, … , 789/%+(()

…ℎ+ ( ← '%+, … , 789/6+(()

Quality function q

!" =ℎ% ← '%% ( …ℎ* ← '%* ( …

…ℎ+ ← '%+ ( …

,-.% = /%%, /*%, … , /1% ,-.* = /%*, /**, … , /2* ,-.3 = ∅

…,-.+ = /%+, /*+, … , /6+

!% =ℎ% ( ← '%%, … , 789/%%(()

…ℎ% ( ← '%%, … , 789/1%(()

!+ =ℎ+ ( ← '%+, … , 789/%+(()

…ℎ+ ( ← '%+, … , 789/6+(()

Fig. 2: Exception-enriched Rule Learning: General Overview

Algorithm 1: ComputeEWS : compute EWS(r,A)Input: KB A, rule r : a(X)← b1(X), . . . , bk(X)Output: EWS(r ,A)

(a) N ← NS(r,A); A← ABS(r,A)(b) E+ ← {not a(c) | c ∈ A}; E− ← {not a(c) | c ∈ N }(c) Re ← Learn(E+, E−,A)(d) EWS ← {predicate p in Body+(r′) | r′ ∈ Re, s.t. , p is not in Body+(r)}(e) return EWS

4 Computing Exception Witnesses and Potential Rule Revisions

In this section we describe how we calculate the exception witness sets for Horn rules(Fig. 2, Step 2) and how we create potential rule revisions (Fig. 2, Step 3).

Computing Exception Witness Sets. For constructing exception witness sets we usethe algorithm ComputeEWS (Alg. 1), which given a factual representation A of a KGand a rule r ∈ RH as input, outputs the set EWS (r ,A).

The algorithm works as follows: First in (a) r-normal NS (r ,A) and r-abnormalABS (r ,A) instance sets are found and stored resp. in N and A. Then in (b) for thefresh predicate not a the facts not a(c) are added to E+ for all c ∈ ABC (r ,A).In the same step the facts not a(c) for c ∈ N are stored in E−. In (c) a variant ofa classical inductive learning procedure Learn(E+,E−,A), e.g., [25] is employed toinduce a set of hypothesis Re in the form of Horn rules with single body atoms, s.t.A∪Re |= e for as many as possible e ∈ E+, andA∪Re 6|= e′ for e′ ∈ E−. Finally, in(d) the bodies of rules inRe not containing predicates from Body+(r) are put in EWS ,which is output in (e).

The correctness of ComputeEWS follows from the correctness of the procedureLearn . Indeed, by (d) for p ∈ EWS, a rule r′ with p occurring in Body(r ′) exists inRe. Since r′∪A 6|= not a(c) for not a(c) ∈ E−, we have that p(c) 6∈ A for r-normal cdue to (a) and (b). Moreover, p(c′) ∈ A for some r-abnormal c′, as otherwise r′ 6∈ Re.Hence, (i) and (ii) of Def. 5 hold, i.e. EWS is an exception witness set for r w.r.t. A.

Page 8: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

Rule Measure Formula for r : H ← B

Confidence conf (r ,A) = n(HB)

n(B)

Lift lift(r ,A) = n(HB)

n(H) ∗ n(B)

Jaccard coef. jc(r ,A) = n(HB)

n(H ) + n(B)− n(HB)Table 1: Rule evaluation measures for r

Constructing Candidate Rule Revisions. After all EWSs are calculated for Hornrules in RH , we construct a search space of potential revisions by adding to rulebodies exceptions in the form of default negated atoms. More specifically, for ev-ery ri : a(X) ← b1(X), . . . , bn(X) in RH we create n = |EWS (ri ,A)| revi-sion candidates, i.e. rules reji , s.t. Head(r

ej

i ) = Head(ri), Body+(rej

i ) = Body(ri),Body−(ri) = ej (X ), where ej ∈ EWS (ri ,A). We denote withRi the set all reji .

Example 6. For EWS (r ,A′) = {immigrant , stateless} from Ex. 5 in Step 3 revi-sion candidates rim : livesInUS (X )←bornInUS (X ),not immigrant(X ) and rst :livesInUS (X )←bornInUS (X ),not stateless(X ) are created. ut

5 Rules Quality Assessment

Given a potential RNM , the function q should approximate the closeness between thecompletion GaRNM

of the input KG Ga and the ideal KG Gi. In this work, we follow usualpractice in data mining and adapt standard association rule measures2 to our needs. Letrm be a generic rule measure, e.g. one defined in Tab. 1. Then, naively generalizing rmfor rulesets by taking the average of rm values for all rules in a given set we obtain

qrm(RNM ,A) =∑

r∈RNMrm(r ,A)

|RNM |(3)

In our case, qrm alone is not sufficiently representative for being the target qualityfunction q for two reasons: (1) it does not penalize rules with noisy exceptions3; (2) itdoes not measure how many contradicting beliefs our revisions reflect.

Example 7. (1) For r : livesInUS (X )←hasUSPass(X ),not poet(X ) and A (fromFig. 1b) we have conf (r ,A)=1, as all 3 non-poets with US passports live in the US,

2 Tab. 1 reports the definition of confidence, lift and Jaccard coefficient – three commonly-usedrule measures [1]. Here, n(B) (resp. n(H )) denotes the number of instances for which thebody (resp. head) of a rule H ← B is satisfied in a KB A or in data mining terminology thenumber of transactions in A with items from B (resp. H ).

3 e is a noisy exception for r if e(c) ∈ A for some r-normal c.

Page 9: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

i.e., r gets the highest individual score based on confidence. However, poet is a noisyexception due to p3 , who is a poet possessing a US passport and living in the US.(2) LetRNM = {r1 : lu(X )←hu(X ), st(X ), r2 : lu(X )←bu(X ),not im(X ),r3 : im(X )← st(X )}, where lu, hu, bu, st , im stand for livesInUS , hasUSPass,bornInUS ,stateless and immigrant . Although im in r2 may perfectly fit as exceptionw.r.t. some (unspecified here) original KG; once the KG is completed based on r1 andr3 , im might become noisy for r2 . Indeed, r1 can easily bring new instances c in lu ,while r3 can predict facts im(c). If this is the case, i.e., r2 ∈ RNM becomes noisyafter other rules inRNM are applied, then intuitively rules inRNM do not agree on thebeliefs about Gi they express. ut

To resolve the above issues we introduce an additional quality function qconflict ,next to qrm , whose purpose is to evaluate the ruleset w.r.t (1) and (2).

To measure qconflict for RNM , we create an extended set of rules Raux, whichcontains every revised rule r : a(X )← b(X ),not e(X ) in RNM and its auxiliaryversion raux : not a(X )← b(X ), e(X ), where not a is a fresh predicate collectinginstances that are not in a. Notice that raux is meaningless, and thus void in Raux , forrules r with positive bodies. Formally, we define qconflict as follows

qconflict(RNM ,A) =∑

p∈pred(Raux )

|c | p(c),not p(c) ∈ ARaux ||c |not p(c) ∈ ARaux |

(4)

where pred(Raux ) is the set of predicates appearing inRaux .Intuitively, ARaux contains both positive predictions of the form p(c) and negative

ones not p(c) produced by the rules in Raux . The function qconflict computes the ra-tio of “contradicting” pairs {p(c),not p(c)} over the number of not p(c)4 in ARaux ,which reflects how much the rules in RNM disagree with each other on beliefs aboutthe ideal KG Gi they express. The smaller qconflict , the better is the rulesetRNM .Revision based on partial materialization. Our goal in Step 4 is to find a set of revi-sionsRNM , for which qrm(RNM ,A) is maximal and qconflict(RNM ,A) is minimal.

To determine such globally best set RNM many candidate rule combinations haveto be checked, which is unfortunately not feasible because of the large size of our Aand EWS. Therefore, we propose an approach where we incrementally build RNM byconsidering every ri ∈ RH and choosing the best revision rji ∈ Ri for it. In order toselect the best rji , we use a special ranking function, which estimates how well a ruler at hand describes the data and how noisy its exceptions are. In the remaining of thissection, we will propose four different ranking functions, starting from the simplest tothe most sophisticated one.Naive-ranker. The first implementation, which we call rank naive , calculates theaverage value of the rm scores of r and raux and uses it to rank the rules. Formally, theaverage is computed by the following function:

estrm(r,A) = rm(

r︷ ︸︸ ︷H ← B,not E,A) + rm(

raux︷ ︸︸ ︷not H ← B,E,A)

2(5)

4 Ration of the number of p(c) is possible, but then qconflict is smaller and less representative.

Page 10: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

where rm is one of the measures in Tab. 1. E.g., plugging in conf instead of rm , gives

estconf (r ,A) =1

2

(n(BH)− n(BHE)

n(B)− n(BE)+n(BE)− n(BHE)

n(BE)

)(6)

Example 8. For r and A from Ex. 7 (1) estconf (r ,A) = 0.75, i.e., due to noisiness ofpoet the value of estconf decreased. ut

PM -ranker. The main problem of rank naive is that it does not exploit any knowl-edge about the properties that a final revisionRNM might have. In other words, rankingof revisions of a rule at hand is completely independent from ranking of revisions forother rules. To address this issue, we propose a second implementation called revisionbased on partial materialization (denoted as rank pm). Here, the idea is to apply estrm

for a rule r not on A but on completion of A based on other rules, which according toour estimates constitute some approximation ofRNM .

Example 9. Consider a rule r1 :lu(X )←bu(X ),not im(X ), and suppose there is onlya single other rule r2 :lu(X )←hu(X ) given, for which EWS (r2 ,A) = ∅ for A fromFig. 1b. This knowledge can be exploited when ranking r1 . We have estconf (r1 ,A) =0.8, while estconf (r1 ,Ar2

) = 0.875 due to the materialized fact livesInUS (p11 ). Thisincrease gives us an indication that r1 agrees with r2 on prediction they make.

On the contrary, for the rulesr3 : lu(X )←hu(X ),not pt(X )}where pt stands for poet and r4 : lu(X )←bu(X );

we have that estconf (r3 ,A) = 0.75, while estconf (r3 ,Ar4 ) = 0.5, which witnessesthat beliefs of r3 and r4 contradict. ut

The function rank pm first constructs the temporary rule set Rt , which contains,for every rule ri ∈ RH , a rule rt

i with all exceptions from EWS (ri ,A) incorporated,i.e., Rt predicts the smallest number of facts, which are also predicted by any possiblerevision RNM . Then, for each ri ∈ RH , we compute the estrm value for all revisioncandidates rji based on ARt\rti Formally,

rank pm(r ji ,A) = estrm(r j

i ,ARt\r ti) (7)

Once the scores for all revision candidates rji for ri are computed, we pick the revisionwith the highest score, add it to the current snapshot ofRNM and move to ri+1.

OPM -ranker. With rank pm, facts inferred by rules of low quality might have a sig-nificant impact on more promising rules. To handle this issue, we propose a variation ofrank pm called revision with ordered partial materialization (abbr. rank opm), whichproceeds as follows. First we rank Horn rules based on some rm ′ (possibly same asrm) and obtain an ordered list osRH

. Then we go through osRHand for every rule ri

we compute a snapshot Ai of A by materializing only those rules rtk ∈ Rt , for which

rk is ordered higher in the list osRHthen ri. More formally,

rank opmrm(ri ,A) = estrm(ri ,Ai) (8)

where Ai = ARt\{rtk | osRH[k]=rk; i≥k}.

Page 11: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

OWPM -ranker. With rank opm as we have defined it, the facts inferred by rulescount the same as the true facts in A. Since the predicted facts are inferred based onstatistically-supported assumptions, it is natural to distinguish them from the facts thatare explicitly present in A. To achieve this, we propose one last ranking function thatexploits weights assigned to facts. Here, there is a clear distinction between facts fromA (which get maximal weight) and the predicted facts (which inherit weights fromrules that inferred them). We call this method revision with ordered weighted partialmaterialization (abbr. rank owpm).

The method rank owpm differs from rank opm in that weights are used to esti-mate the revisions’ scores. It is convenient (and a common practice) to assign weightsof probabilistic nature between 0 and 1 (e.g., confidence can be exploited). There areseveral ways to produce weighted partial materialization; for example, using probabilis-tic logic programming systems, e.g., Problog [10] or PrASP [27].

However, normally, in such systems facts predicted by some rules in a ruleset athand are used as input to other rules, i.e., uncertainty is propagated through rule chains,which might be undesired in our setting. To avoid such propagation, when computingweighted partial materialization of A we keep predicted facts (i.e., derived using rules)separately from the explicit facts (i.e., those in A), and infer new facts using only A.

The method rank owpm works as follows. Initially, we sort the rules in RH andcreate theAis with the same procedure as described for rank opm. The only differenceis that here every inferred fact in ARt receives a specific weight that corresponds torm(r′,A) where r′ is the positive version of the rule that inferred the fact5. If the samefact is derived by multiple rules, we keep the highest weight.

The weights play a role when we evaluate a rule w.r.t. the partially materializedKB. To this end, we slightly change the rm function so that it considers weighted facts(we denote such function as rmw ). E.g., conf w (r ,A) calculates a weighted sum of theinstances for which the head (resp. body) of r is satisfied w.r.t. A (instead of a normalsum used in conf ). Formally, rank owpm computes a score for a revision rji as follows:

rank owpmrmw (rji ,A) = estrmw (r j

i ,Awi ) (9)

where Awi is the weighted version of Ai from Eq. 8. In the following section, we will

analyze the performance of these four functions on some realistic KGs.

6 Evaluation

Experimental setup. We considered two knowledge graphs: A slice of almost 10Mnon-literal facts from YAGO3 [24], a general purpose KG, and a RDF version of IMDB6

data with 2M facts, a well known domain-specific KG of movies and artists. We chosethese two KGs in order to evaluate our method’s performance on bothgeneral-purposeand domain-specific KGs. Our experiments were performed on a machine with 40 cores

5 We cannot consider the entire rule (i.e. with all exceptions attached), since standard measureslike confidence will return values very close to 1 for such rules.

6 http://imdb.com

Page 12: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

and 400GB RAM. The used datasets and the experimental code are publicly availableat http://people.mpi-inf.mpg.de/˜gadelrab/rules_iswc.

Outline. First we evaluate different configurations of our method using the quality func-tions qrm and qconflict , defined in Sec 5. Then, we report the results of a manual assess-ment that we performed to evaluate the quality of the predictions, reporting good andbad examples produced by our method.

6.1 Evaluation of the revision steps

Step 1. Initially, we considered the Horn rules produced by AMIE [13]. However, theymainly focus on unsupported binary predicates and the only unary rules are restrictedto the isA predicate, which was too limiting for us. Therefore, we decided to minethe Horn rules using the association rule mining implementation based on standardFPGrowth [15] offered by SPMF Library7. In order to avoid over-fitting rules as well asto reduce the computation, we limited the extraction to rules with maximum four bodyatoms, a single head atom, a minimum support of 0.0001 × # entities and a minimumconfidence of 0.25 for YAGO. Since IMDB is smaller and more connected, we set ahigher minimum support of 0.005 × # entities and confidence of 0.6. On our machine,this process took apprx. 10 seconds on YAGO and 2.5 second on IMDB, and it generatedabout 10K and 25K rules respectively.

Steps 2 and 3. We implemented a simple inductive learning procedure as manipula-tions on the set of facts instantiating the rule and its body to get the EWS. This processtook about 50 seconds for YAGO and 30 seconds for IMDB to generate EWS with aminimum support of 0.05. We could find EWSs for about 6K rules mined from YAGO,and 22K rules mined from IMDB. On average, the EWSs for the YAGO’s rules con-tained 3 exceptions, and 28 exceptions on IMDB.

Step 4. We evaluated the quality of our rule selection procedure on two dimensions,which reflect the two q proposed in Sec. 5: Average of the rules’ confidence (qconf ), andthe number of conflicts (qconflict ). The average confidence shows how well the revisedrules adhere to the input. The number of conflicts indicates how consistent the revisedrules set is w.r.t the final predictions it makes. Due to space constraints, we report theresults using only confidence as rule evaluation function (def. in Eq. 6) and lift as ruleordering criterion, as we found this combination to be a good representative.

Fig. 3 reports the obtained average rules’ confidence using the five ranking functionsto select the best revisions. Horn reports the average confidence of the original Hornrules; while Naive, PM, OPM and OWPM are our ranking method described in Sec. 5.For both inputs, we show the results on the top 10, . . . , 100% rules ranked by lift.

We make two important observations. First, we notice that in general enrichingHorn rules with exceptions increases the average confidence (appr. 11% for YAGO,3.5% for IMDB). This indicates that our method is useful to mine rules that reflectmore precisely the data. Second, the comparison between the four ranking methodsshows that the highest confidence is achieved by the non-materialized function (Naive)followed by the weighted procedures.

7 http://www.philippe-fournier-viger.com/spmf/

Page 13: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

10 20 30 40 50 60 70 80 90 100

Avg

. Con

fiden

ce

Top-K Rules

Horn Naive PM OPM OWPM

(a) Confidence for top-k YAGO revised rules

0.84

0.85

0.86

0.87

0.88

0.89

0.9

0.91

0.92

0.93

0.94

10 20 30 40 50 60 70 80 90 100

Avg

. Con

fiden

ce

Top-K Rules

Horn Naive PM OPM OWPM

(b) Confidence for top-k IMDB revised rules

Fig. 3: Average rules’ confidence on YAGO and IMDB (higher is better)

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

200 400 600 800 1000

Con

flict

Rat

io

Top-K Rules

Naive PM OPM OWPM

(a) Ratio of conflicts for YAGO rules

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

200 400 600 800 1000

Con

flict

Rat

io

Top-K Rules

Naive PM OPM OWPM

(b) Ratio of conflicts for IMDB rules

Fig. 4: Ratio of conflicts on YAGO and IMDB (lower is better)

The higher value of Naive was expected; In fact, this procedure is designed to maxi-mize the confidence. However, confidence alone is not a good indicator to determine theoverall rule’s quality, as we explained in Section 5. Fig. 4 shows the number of conflicts(for YAGO and IMDB) that were obtained by executing the revised rules and their cor-responding auxiliary versions (raux) using the DLV system [21]. Unfortunately, DLVwas unable to scale to the entire ruleset; hence, we used up to 1000 rules. In our exper-iment, a conflict occurs when we derive both p(x) and not p(x). The graphs report theratio between the number of conflicts and negated derived facts. From them, we observethat the both PM and OWPM produce less conflicts than the Naive function in most ofthe cases. By comparing the OPM and OWPM functions, we find that the weighted ver-sion is better, especially on the IMDB dataset when we can reduce the conflicts from775 to 685 on a base of about 2000 negated facts.

6.2 Exception-Enriched Rules vs. Horn rules

We executed the top-1000 revised rules using DLV and counted the number of deriva-tions that our exceptions prevented. With the original Horn rules, the reasoner inferred924591 new triples. On the other hand, our exception-enriched rule set decreased thethe number of inferred triples to 888215 (Naive), 892707 (PM), 892399 (OPM), and891007 (OWPM). For IMDB we observed a smaller reduction. With the Horn rules the

Page 14: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

Y1 : isMountain(X )← isLocatedInAustria(X ), isLocatedInItaly(X ),not[isRiver(X )|isLocatedInRussia(X )]

Y2 : bornInUSA(X )← actedInMovie(X ), createdMovie(X ), isPerson(X ),not[wonFilmfareAwards(X )|bornInNewYork(X )]

Y3 : isPoliticianOfUSA(X )← bornInUSA(X ), isGovernor(X ),not[isPolitOfPuertoRico(X)—isPolitOfHawaii(X)]

I1 : hasLanguageEnglish(X )← hasGenreDrama(X ), hasGenreTriller(X ), hasGenreCrime(X ),not[producedInIndia(X )|createdByNovelist(X )]

I2 : hasGenreAnimation(X )← directedByActor(X ), hasLanguageEnglish(X ), producedUSA(X ),hasGenreFamily(X ), not[hasGenreDrama(X )|producedIn1984(X )]

Fig. 5: Anecdotal example rules (Y=YAGO, I=IMDB) with good and bad exceptions

reasoner derived 38609 triples, while the revised rulesets the inference set decreased to36069 (Naive), 36355 (PM), 36021 (OPM), and 36028 (OWPM) triples.

Unfortunately, there is no automatic way available to assess whether the removedinference consists of genuine errors. Therefore, we selected the revised ruleset producedby the OWPM function and sampled 259 random facts from YAGO (we selected threefacts for each binary predicate to avoid skewness). Then, we manually consulted onlineresources like Wikipedia to determine whether these triples were indeed incorrect. Wefound that 74.3% of these triples consisted of factual mistakes. This number provides afirst empirical evidence that our method is indeed capable of detecting good exceptionsand hence can improve the general quality of the Horn rules.

We conclude reporting some anecdotal examples of rules on YAGO and IMDBin Fig. 5. Between the brackets we show examples of both good (in underlined text)and bad exceptions. In some cases, the rules have high quality exceptions such as ruley1. In others, we found that the highest ranked exceptions mainly refers to disjointclasses of the head. The complete list of mined rules with the scores given to the variousexceptions is available in our repository.

7 Related Work

The problem of automatically learning patterns from KGs and exploiting them for pre-dicting new facts has gained a lot of attention in the recent years. Approaches for pre-dicting unseen data in KGs can be roughly divided into two groups: statistics-based, andlogic-based. The firsts apply well-known techniques like tensor factorization, or neural-embedding-based models (see [26] for overview). The second group focuses more onlogical rule learning (e.g., [13,31]). The most relevant works for us are in the last group.These, however, typically focus on learning Horn rules, rather then nonmonotonic (i.e.,exception-enriched) as we do.

In the association rule mining community, some works focused on finding (interest-ing) exception rules (e.g. [30]), which are defined as rules with low support (rare) andhigh confidence. Our work differs from this line of research because we do not neces-sarily look for rare interesting rules, but care about the quality of their predictions.

Another relevant stream of research is concerned with learning Description LogicTBoxes or schema (e.g., [20]). However, these techniques focus on learning conceptdefinitions rather then nonmonotonic rules. The main difference between us and these

Page 15: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

approaches is that we aim at finding hypothesis that consistently predict unseen data,while DL learners focus more on learning models which perfectly describe the data.

In the context of inductive and abductive logic, learning nonmonotonic rules fromcomplete datasets [11] was studied in several works ([28,29,28,7,19,17]. These meth-ods rely on CWA and focus on describing a dataset at hand exploiting negative example,which are explicitly given unlike in our setting. Learning nonmonotonic rules in pres-ence of incompleteness was studied in hybrid settings in [16] and [22] respectively.There a background theory or a hypothesis can be represented as a combination of aDL ontology and Horn or nonmonotonic rules. While the focus of these works is onthe complex interaction between reasoning components, we are more concerned withtechniques for deriving rules with high predictive quality from huge KGs.

8 Conclusions and Future Work

We have presented a method for mining nonmonotonic rules from KGs: first learninga set of Horn rules and then revising them by adding negated atoms into their bodieswith the goal of improving the quality of a rule set for data prediction. To select thebest revision from potential candidates we devised rule-set ranking measures, based ondata mining measures and the novel concept of partial materialization. We evaluatedour method with various configurations on both general-purpose and domain-specificKGs and observed significant improvements over a baseline Horn rule mining.

There are various directions for future work. First, we look into extracting evidencefor or against exceptions from text and web corpora. Second, our framework can beenhanced by partial completeness assumptions for certain predicates (e.g., all countriesare available in KG) or constants (e.g., knowledge about Barack Obama is complete).Finally, an overriding future direction is to extend our work to more complex nonmono-tonic rules, with higher-arity predicates, aggregates and disjunctions in rule heads.

References

1. Agrawal, R., Carey, M.J., Livny, M.: Concurrency control performance modeling: Alterna-tives and implications. In: Performance of Concurrency Control Mechanisms in CentralizedDatabase Systems, pp. 58–105 (1996)

2. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: A nucleusfor a web of open data. In: Proc. of ISWC. pp. 722–735 (2007)

3. Azevedo, P.J., Jorge, A.M.: Comparing Rule Measures for Predictive Association Rules. In:Proceedings of ECML. pp. 510–517 (2007)

4. Bollacker, K.D., Cook, R.P., Tufts, P.: Freebase: A Shared Database of Structured GeneralHuman Knowledge. In: Proc. of AAAI. pp. 1962–1963 (2007)

5. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Jr., E.R.H., Mitchell, T.M.: Toward anArchitecture for Never-Ending Language Learning. In: Proc. of AAAI (2010)

6. Chen, Y., Goldberg, S., Wang, D.Z., Johri, S.S.: Ontological Pathfinding: Mining First-OrderKnowledge from Large Knowledge Bases. In: in Proc. of SIGMOD 2016. p. to appear (2016)

7. Corapi, D., Russo, A., Lupu, E.: Inductive logic programming as abductive search. In: Pro-ceedings of ICLP. pp. 54–63 (2010)

Page 16: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

8. Darari, F., Nutt, W., Pirro, G., Razniewski, S.: Completeness statements about RDF datasources and their use for query answering. In: Proceedings of ISWC. pp. 66–83 (2013)

9. Erxleben, F., Gunther, M., Krotzsch, M., Mendez, J., Vrandecic, D.: Introducing Wikidata tothe Linked Data Web. In: Proceedings of ISWC. pp. 50–65 (2014)

10. Fierens, D., den Broeck, G.V., Renkens, J., Shterionov, D.S., Gutmann, B., Thon, I., Janssens,G., Raedt, L.D.: Inference and learning in probabilistic logic programs using weightedboolean formulas. TPLP 15(3), 358–401 (2015)

11. Flach, P.A., Kakas, A.: Abduction and Induction: essays on their relation and integration,vol. 18. Applied Logic Series (2000)

12. Gad-Elrab, M., Stepanova, D., Urbani, J., Weikum, G.: Exception-enriched Rule Learn-ing from Knowledge Graphs. Tech. rep. (2016), http://people.mpi-inf.mpg.de/˜gadelrab/ExRules_TR.pdf

13. Galrraga, L., Teflioudi, C., Hose, K., Suchanek, F.M.: Fast Rule Mining in OntologicalKnowledge Bases with AMIE+. In: VLDB Journal (2015)

14. Gelfond, M., Lifschitz, V.: The stable model semantics for logic programming. In: Proc. of5th Int. Conf. and Symp. on Logic Programming, ICLP 1988. pp. 1070–1080 (1988)

15. Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: Afrequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004)

16. Jozefowska, J., Lawrynowicz, A., Lukaszewski, T.: The role of semantics in mining frequentpatterns from knowledge bases in description logics with rules. TPLP 10(3), 251–289 (2010)

17. Katzouris, N., Artikis, A., Paliouras, G.: Incremental learning of event definitions with in-ductive logic programming. Machine Learning 100(2-3), 555–585 (2015)

18. Lassila, O., Swick, R.R.: Resource description framework (RDF) model and syntax specifi-cation (February 1999)

19. Law, M., Russo, A., Broda, K.: The ILASP system for learning answer set programs.https://www.doc.ic.ac.uk/˜ml1909/ILASP (2015)

20. Lehmann, J., Auer, S., Buhmann, L., Tramp, S.: Class expression learning for ontology en-gineering. J. Web Sem. 9(1), 71–81 (2011)

21. Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., Perri, S., Scarcello, F.: The dlv systemfor knowledge representation and reasoning. ACM Transactions on Computational Logic(TOCL) 7(3), 499–562 (2006)

22. Lisi, F.A.: Inductive Logic Programming in Databases: From Datalog to DL+log. TPLP10(3), 331–359 (2010)

23. Lloyd, J.W.: Foundations of Logic Programming, 2nd Edition. Springer (1987)24. Mahdisoltani, F., Biega, J., Suchanek, F.M.: YAGO3: A knowledge base from multilingual

wikipedias. In: Proceedings of CIDR (2015)25. Muggleton, S., Feng, C.: Efficient induction of logic programs. In: ALT. pp. 368–381 (1990)26. Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning

for knowledge graphs. Proceedings of the IEEE 104(1), 11–33 (2016)27. Nickles, M., Mileo, A.: A hybrid approach to inference in probabilistic non-monotonic logic

programming. In: Proceedings of ICLP. pp. 57–68 (2015)28. Ray, O.: Nonmonotonic abductive inductive learning. Journal of Applied Logic 3(7), 329–

340 (2008)29. Sakama, C.: Induction from answer sets in nonmonotonic logic programs. ACM Trans. Com-

put. Log. 6(2), 203–231 (2005)30. Taniar, D., Rahayu, W., Lee, V., Daly, O.: Exception rules in association rule mining. Applied

Mathematics and Computation 205(2), 735–750 (2008)31. Wang, Z., Li, J.: RDF2Rules: Learning rules from RDF knowledge bases by mining frequent

predicate cycles. CoRR abs/1512.07734 (2015)

Page 17: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

0.028

0.03

0.032

0.034

0.036

0.038

0.04

0.042

0.044

0.046

200 400 600 800 1000

Con

flict

Rat

io

Top-K Rules

Naive PM OPM OWPM

(a) Ratio of Conflicts for YAGO rules

0.014

0.015

0.016

0.017

0.018

0.019

0.02

0.021

0.022

200 400 600 800 1000

Con

flict

Rat

io

Top-K Rules

Naive PM OPM OWPM

(b) Ratio of Conflicts for IMDB rules

Fig. 6: Ratio of conflicts to positive predictions on YAGO and IMDB (lower is better)

0.02

0.04

0.06

0.08

0.1

0.12

0.14

10 20 30 40 50 60 70 80 90 100

Avg

. Ja

ccar

d

Top-K Rules

Horn Naive PM OPM OWPM

(a) Jacc. Coeff. for top-k YAGO revised rules

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

10 20 30 40 50 60 70 80 90 100

Avg

. Ja

ccar

d

Top-K Rules

Horn Naive PM OPM OWPM

(b) Jacc. Coeff. for top-k IMDB revised rules

Fig. 7: Average rules’ Jaccard Coeff. while using Jaccard Coeff. as rm function (higheris better)

A Additional Experimental Results

A.1 Prediction Conflict Ratios

Fig 6 shows the ratio between the conflicting predictions and positive predictions forthe experimental setting described in Sec. 6.

A.2 Jaccard Coeficient as rm function

We report the results of our approach for Jaccard Coefficient being used as a rule eval-uation function rm and lift as a rule ordering criterion.

Fig. 7 shows the average Jaccard Coefficient for top-k revised rules sorted in de-scending order based on the lift value. Since lift and Jaccard Coefficient are correlated(both depend on the rule coverage), the average Jaccard Coefficient decreases with theincreasing number k of the considered rules.

Fig. 8 reports the average confidence of the top-k rules revised using the JaccardCoefficient as the rm function. Compared to the results presented in Fig. 3 the increasein the average confidence of the revised rules is less prominent.

Page 18: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

10 20 30 40 50 60 70 80 90 100

Avg

. Con

fiden

ce

Top-K Rules

Horn Naive PM OPM OWPM

(a) Confidence for top-k YAGO revised rules

0.84

0.86

0.88

0.9

0.92

0.94

0.96

10 20 30 40 50 60 70 80 90 100

Avg

. Con

fiden

ce

Top-K Rules

Horn Naive PM OPM OWPM

(b) Confidence for top-k IMDB revised rules

Fig. 8: Average rules’ confidence while using Jaccard Coeff. as rm function (higher isbetter)

0

0.002

0.004

0.006

0.008

0.01

0.012

200 400 600 800 1000

Con

flic

t Rat

io

Top-K Rules

Naive PM OPM OWPM

(a) Ratio of conflicts for YAGO rules

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

0.065

200 400 600 800 1000

Con

flic

t Rat

io

Top-K Rules

Naive PM OPM OWPM

(b) Ratio of conflicts for IMDB rules

Fig. 9: Ratio of conflicts to positive predictions while using Jaccard Coeff. as rm func-tion (lower is better)

Furthermore, the usage of the Jaccard Coefficient as the rm function results in moreconflicting predictions then the usage of confidence (Fig. 9 and 10 vs Fig. 6 and Fig. 4resp.) For the top 600 and 800 rules revised either via PM or Naive methods, DLVterminated without generating any stable model. On the other hand, the revised set ofrules in the case of OPM and OWPM was consistent and DLV managed to generatestable model.

Page 19: Technical Report: Exception-enriched Rule Learning …people.mpi-inf.mpg.de/~gadelrab/ExRules_TR.pdfTechnical Report: Exception-enriched Rule Learning from Knowledge Graphs Mohamed

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

200 400 600 800 1000

Con

flic

t Rat

io

Top-K Rules

Naive PM OPM OWPM

(a) Ratio of conflicts for YAGO rules

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

200 400 600 800 1000

Con

flic

t Rat

io

Top-K Rules

Naive PM OPM OWPM

(b) Ratio of conflicts for IMDB rules

Fig. 10: Ratio of conflicts to negative predictions while using Jaccard Coeff. as rm func-tion (lower is better)