montague meets markov: combining logical and distributional semantics

111

Montague meets Markov:Combining Logical and Distributional Semantics

Raymond J. MooneyKatrin Erk

Islam BeltagyUniversity of Texas at Austin

Logical AI Paradigm

• Represents knowledge and data in a binary symbolic logic such as FOPC.

+ Rich representation that handles arbitrary sets of objects, with properties, relations, quantifiers, etc.

Unable to handle uncertain knowledge and probabilistic reasoning.

Probabilistic AI Paradigm

• Represents knowledge and data as a fixed set of random variables with a joint probability distribution.

+ Handles uncertain knowledge and probabilistic reasoning.

Unable to handle arbitrary sets of objects, with properties, relations, quantifiers, etc.

Statistical Relational Learning (SRL)

• SRL methods attempt to integrate methods from predicate logic (or relational databases) and probabilistic graphical models to handle structured, multi-relational data.

• 5

SRL Approaches(A Taste of the “Alphabet Soup”)

• Stochastic Logic Programs (SLPs) (Muggleton, 1996)• Probabilistic Relational Models (PRMs) (Koller, 1999)• Bayesian Logic Programs (BLPs) (Kersting & De Raedt, 2001)• Markov Logic Networks (MLNs)

(Richardson & Domingos, 2006)• Probabilistic Soft Logic (PSL)

(Kimmig et al., 2012)

SRL Methods Based onProbabilistic Graphical Models

• BLPs use definite-clause logic (Prolog programs) to define abstract templates for large, complex Bayesian networks (i.e. directed graphical models).

• MLNs use full first order logic to define abstract templates for large, complex Markov networks (i.e. undirected graphical models).

• PSL uses logical rules to define templates for Markov nets with real-valued propositions to support efficient inference.

• McCallum’s FACTORIE uses an object-oriented programming language to define large, complex factor graphs.

• Goodman & Tanenbaum’s CHURCH uses a functional programming language to define, large complex generative models. 6

7

Markov Logic Networks [Richardson & Domingos, 2006]

Set of weighted clauses in first-order predicate logic.

Larger weight indicates stronger belief that the clause should hold.

MLNs are templates for constructing Markov networks for a given set of constants

)()(),(,)()(

ySmokesxSmokesyxFriendsyxxCancerxSmokesx

1.15.1

MLN Example: Friends & Smokers

Example: Friends & Smokers

)()(),(,)()(


1.15.1

Two constants: Anna (A) and Bob (B)

8


)()(),(,)()(


1.15.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)


9


)()(),(,)()(


1.15.1

Cancer(A)


Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)


10


)()(),(,)()(


1.15.1

Cancer(A)


Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)


11

iii xnw

ZxXP )(exp1)(

Weight of formula i No. of true groundings of formula i in x

12

Probability of a possible world

A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases.

a possible world

x iii xnwZ )(exp

MLN Inference Infer probability of a particular query

given a set of evidence facts. P(Cancer(Anna) | Friends(Anna,Bob),

Smokes(Bob)) Use standard algorithms for inference in

graphical models such as Gibbs Sampling or belief propagation.

MLN Learning Learning weights for an existing set of

clauses EM Max-margin On-line

Learning logical clauses (a.k.a. structure learning) Inductive Logic Programming methods Top-down and bottom-up MLN clause

learning On-line MLN clause learning 14

Strengths of MLNs

• Fully subsumes first-order predicate logic– Just give weight to all clauses

• Fully subsumes probabilistic graphical models.– Can represent any joint distribution over an

arbitrary set of discrete random variables. • Can utilize prior knowledge in both

symbolic and probabilistic forms.• Large existing base of open-source software

(Alchemy)15

Weaknesses of MLNs

• Inherits computational intractability of general methods for both logical and probabilistic inference and learning.– Inference in FOPC is semi-decidable– Inference in general graphical models is P-space

complete• Just producing the “ground” Markov Net can

produce a combinatorial explosion. – Current “lifted” inference methods do not help

reasoning with many kinds of nested quantifiers.16

PSL: Probabilistic Soft Logic[Kimmig & Bach & Broecheler & Huang & Getoor, NIPS 2012]

● Probabilistic logic framework designed with efficient inference in mind.

● Input: set of weighted First Order Logic rules and a set of evidence, just as in BLP or MLN

● MPE inference is a linear-programming problem that can efficiently draw probabilistic conclusions.

17

PSL● Atoms have continuous

truth values in the interval [0,1].

● Inference finds truth value of all atoms that best satisfy the rules and evidence.

● MPE inference: Most Probable Explanation.

● Linear optimization problem.

PSL vs. MLN

MLN● Atoms have boolean truth

values {0, 1}.● Inference finds probability of

atoms given the rules and evidence.

● Calculates conditional probability of a query atom given evidence.

● Combinatorial counting problem.

18

PSL Example

● First Order Logic weighted rules

● EvidenceI(friend(John,Alex)) = 1 I(spouse(John,Mary)) = 1I(votesFor(Alex,Romney)) = 1 I(votesFor(Mary,Obama)) = 1● Inference

– I(votesFor(John, Obama)) = 1 – I(votesFor(John, Romney)) = 0

19

PSL’s Interpretation of Logical Connectives

● Łukasiewicz relaxation of AND, OR, NOT– I(ℓ1 ∧ ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1}– I(ℓ1 ∨ ℓ2) = min {1, I(ℓ1) + I(ℓ2) }– I(¬ ℓ1) = 1 – I(ℓ1)

● Distance to satisfaction– Implication: ℓ1 → ℓ2 is Satisfied iff I(ℓ1) ≤ I(ℓ2)– d = max {0, I(ℓ1) - I(ℓ2) }

● Example– I(ℓ1) = 0.3, I(ℓ2) = 0.9 ⇒ d = 0– I(ℓ1) = 0.9, I(ℓ2) = 0.3 ⇒ d = 0.6

20

PSL Probability Distribution

● PDF:

Weight of formula r

Distance to satisfaction of rule r

Normalization constant

a possible continuous truth assignment

For all rules

21

● MPE Inference: (Most probable explanation)

– Find interpretation that maximizes PDF

– Find interpretation that minimizes summation

– Distance to satisfaction is a linear function

– Linear optimization problem

PSL Inference

22

Semantic Representations

• Formal Semantics– Uses first-order

logic– Deep– Brittle

Distributional Semantics

Statistical method

Robust

Shallow

23

• Combining both logical and distributional semantics

– Represent meaning using a probabilistic logic • Markov Logic Network (MLN) • Probabilistic Soft Logic (PSL)

– Generate soft inference rules• From distributional semantics

System Architecture[Garrette et al. 2011, 2012; Beltagy et al., 2013]

24

Sent1BOXER Rule

Base

result

Sent2

LF1

LF2Dist. Rule

Constructor

Vector Space MLN/PSL Inference

• BOXER [Bos, et al. 2004]: maps sentences to logical form

• Distributional Rule constructor: generates relevant soft inference rules based on distributional similarity

• MLN/PSL: probabilistic inference

• Result: degree of entailment or semantic similarity score (depending on the task)

Markov Logic Networks[Richardson & Domingos, 2006]

• Two constants: Anna (A) and Bob (B)

• P(Cancer(Anna) | Friends(Anna,Bob), Smokes(Bob))

1.15.1

Cancer(A)


Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

25

∀ x Smokes( x )⇒ Cancer ( x )∀ x,y Friends ( x,y )⇒ ( Smokes ( x )⇔ Smokes ( y ) )

Recognizing Textual Entailment (RTE)

• Premise: “A man is cutting pickles”

x,y,z. man(x) ∧ cut(y) ∧ agent(y, x) ∧ pickles(z) ∧ patient(y, z)• Hypothesis: “A guy is slicing cucumber”

x,y,z. guy(x) ∧ slice(y) ∧ agent(y, x) ∧ cucumber(z) ∧ patient(y, z)• Inference: Pr(Hypothesis | Premise)

– Degree of entailment 26

27

Distributional Lexical Rules

• For all pairs of words (a, b) where a is in S1 and b is in S2 add a soft rule relating the two

– x a(x) → b(x) | wt(a, b)– wt(a, b) = f( cos(a, b) )

• Premise: “A man is cutting pickles”• Hypothesis: “A guy is slicing cucumber”

– x man(x) → guy(x) | wt(man, guy)– x cut(x) → slice(x) | wt(cut, slice)– x pickle(x) → cucumber(x) | wt(pickle, cucumber)– x man(x) → cucumber(x) | wt(man, cucumber)– x pickle(x) → guy(x) | wt(pickle, guy)

→ →

Distributional Phrase Rules

• Premise: “A boy is playing”• Hypothesis: “A little kid is playing”• Need rules for phrases

– x boy(x) → little(x) ∧ kid(x) | wt(boy, "little kid")• Compute vectors for phrases using vector

addition [Mitchell & Lapata, 2010]

– "little kid" = little + kid

28

Paraphrase Rules [by: Cuong Chau]

• Generate inference rules from pre-compiled paraphrase collections like Berant et al. [2012]

• e.g,“X solves Y” => “X finds a solution to Y ” | w

29

Evaluation (RTE using MLNs)

• Dataset• RTE-1, RTE-2, RTE-3• Each dataset is 800 training pairs and 800

testing pairs

• Use multiple parses to reduce impact of misparses

30

Evaluation (RTE using MLNs)[by: Cuong Chau]

RTE-1 RTE-2 RTE-3

Bos & Markert[2005] 0.52 ––

MLN 0.570.58 0.55

MLN-multi-parse 0.56 0.580.57

MLN-paraphrases 0.60 0.600.60

31

Logic-only baseline

KB is wordnet

Semantic Textual Similarity (STS)

• Rate the semantic similarity of two sentences on a 0 to 5 scale

• Gold standards are averaged over multiple human judgments

• Evaluate by measuring correlation to human ratingS1S2 score

A man is slicing a cucumber A guy is cutting a cucumber5

A man is slicing a cucumber A guy is cutting a zucchini4

A man is slicing a cucumber A woman is cooking a zucchini3

A man is slicing a cucumber A monkey is riding a bicycle1

32

Softening Conjunction for STS

33

• Premise: “A man is driving”x,y. man(x) ∧ drive(y) ∧ agent(y, x)• Hypothesis: “A man is driving a bus”x,y,z. man(x) ∧ drive(y) ∧ agent(y, x) ∧ bus(z) ∧ patient(y, z)• Break the sentence into “mini-clauses” then combine their

evidences using an “averaging combiner” [Natarajan et al., 2010]

• Becomes– x,y,z. man(x) ∧ agent(y, x)→ result()– x,y,z. drive(y) ∧ agent(y, x)→ result()– x,y,z. drive(y) ∧ patient(y, z) → result()– x,y,z. bus(z) ∧ patient(y, z) → result()

Evaluation (STS using MLN)

34

• Microsoft video description corpus (SemEval 2012)

– Short video descriptions

System Pearson r

Our System with no distributional rules [Logic only] 0.52

Our System with lexical rules0.60

Our System with lexical and phrase rules0.63

PSL: Probabilistic Soft Logic[Kimmig & Bach & Broecheler & Huang & Getoor, NIPS 2012]

● MLN's inference is very slow

● PSL is a probabilistic logic framework designed with efficient inference in mind

● Inference is a linear program

35

● Łukasiewicz relaxation of AND is very restrictive– I(ℓ1 ∧ ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1}

● Replace AND with weighted average– I(ℓ1 ∧ … ∧ ℓn) = w_avg( I(ℓ1), …, I(ℓn))– Learning weights (future work)

• For now, they are equal

● Inference– “weighted average” is a linear function– no changes in the optimization problem

STS using PSL - Conjunction

36

Evaluation (STS using PSL)

msr-vidmsr-par SICK

vec-add (dist. only) 0.78 0.24 0.65

vec-mul (dist. only) 0.76 0.12 0.62

MLN (logic + dist.) 0.63 0.16 0.47

PSL-no-DIR (logic only) 0.74 0.46 0.68

PSL (logic + dist.) 0.79 0.53 0.70

PSL+vec-add (ensemble) 0.83 0.49 0.71

msr-vid: Microsoft video description corpus (SemEval 2012)Short video description sentences

msr-par: Microsoft paraphrase corpus (SemEval 2012)Long news sentences

SICK: (SemEval 2014)

37

Evaluation (STS using PSL)

msr-vid msr-par SICKPSL time/pair 8s 30s 10sMLN time/pair 1m 31s 11m 49s 4m 24sMLN timeouts(10 min) 9% 97% 36%

38

montague meets markov: combining logical and distributional semantics

Documents

order logic

complex markov networks

order predicate logic

binary symbolic logic

probabilistic reasoning

bcancerbfriend constants

complex bayesian networks

markov nets