montague meets markov: combining logical and distributional semantics

38
1 1 Montague meets Markov: Combining Logical and Distributional Semantics Raymond J. Mooney Katrin Erk Islam Beltagy University of Texas at Austin

Upload: andra

Post on 24-Feb-2016

44 views

Category:

Documents


1 download

DESCRIPTION

Montague meets Markov: Combining Logical and Distributional Semantics. Raymond J. Mooney Katrin Erk Islam Beltagy University of Texas at Austin. 1. 1. Logical AI Paradigm. Represents knowledge and data in a binary symbolic logic such as FOPC. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Montague meets Markov: Combining Logical and Distributional Semantics

111

Montague meets Markov:Combining Logical and Distributional Semantics

Raymond J. MooneyKatrin Erk

Islam BeltagyUniversity of Texas at Austin

Page 2: Montague meets Markov: Combining Logical and Distributional Semantics

Logical AI Paradigm

• Represents knowledge and data in a binary symbolic logic such as FOPC.

+ Rich representation that handles arbitrary sets of objects, with properties, relations, quantifiers, etc.

Unable to handle uncertain knowledge and probabilistic reasoning.

Page 3: Montague meets Markov: Combining Logical and Distributional Semantics

Probabilistic AI Paradigm

• Represents knowledge and data as a fixed set of random variables with a joint probability distribution.

+ Handles uncertain knowledge and probabilistic reasoning.

Unable to handle arbitrary sets of objects, with properties, relations, quantifiers, etc.

Page 4: Montague meets Markov: Combining Logical and Distributional Semantics

Statistical Relational Learning (SRL)

• SRL methods attempt to integrate methods from predicate logic (or relational databases) and probabilistic graphical models to handle structured, multi-relational data.

Page 5: Montague meets Markov: Combining Logical and Distributional Semantics

• 5

SRL Approaches(A Taste of the “Alphabet Soup”)

• Stochastic Logic Programs (SLPs) (Muggleton, 1996)• Probabilistic Relational Models (PRMs) (Koller, 1999)• Bayesian Logic Programs (BLPs) (Kersting & De Raedt, 2001)• Markov Logic Networks (MLNs)

(Richardson & Domingos, 2006)• Probabilistic Soft Logic (PSL)

(Kimmig et al., 2012)

Page 6: Montague meets Markov: Combining Logical and Distributional Semantics

SRL Methods Based onProbabilistic Graphical Models

• BLPs use definite-clause logic (Prolog programs) to define abstract templates for large, complex Bayesian networks (i.e. directed graphical models).

• MLNs use full first order logic to define abstract templates for large, complex Markov networks (i.e. undirected graphical models).

• PSL uses logical rules to define templates for Markov nets with real-valued propositions to support efficient inference.

• McCallum’s FACTORIE uses an object-oriented programming language to define large, complex factor graphs.

• Goodman & Tanenbaum’s CHURCH uses a functional programming language to define, large complex generative models. 6

Page 7: Montague meets Markov: Combining Logical and Distributional Semantics

7

Markov Logic Networks [Richardson & Domingos, 2006]

Set of weighted clauses in first-order predicate logic.

Larger weight indicates stronger belief that the clause should hold.

MLNs are templates for constructing Markov networks for a given set of constants

)()(),(,)()(

ySmokesxSmokesyxFriendsyxxCancerxSmokesx

1.15.1

MLN Example: Friends & Smokers

Page 8: Montague meets Markov: Combining Logical and Distributional Semantics

Example: Friends & Smokers

)()(),(,)()(

ySmokesxSmokesyxFriendsyxxCancerxSmokesx

1.15.1

Two constants: Anna (A) and Bob (B)

8

Page 9: Montague meets Markov: Combining Logical and Distributional Semantics

Example: Friends & Smokers

)()(),(,)()(

ySmokesxSmokesyxFriendsyxxCancerxSmokesx

1.15.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants: Anna (A) and Bob (B)

9

Page 10: Montague meets Markov: Combining Logical and Distributional Semantics

Example: Friends & Smokers

)()(),(,)()(

ySmokesxSmokesyxFriendsyxxCancerxSmokesx

1.15.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants: Anna (A) and Bob (B)

10

Page 11: Montague meets Markov: Combining Logical and Distributional Semantics

Example: Friends & Smokers

)()(),(,)()(

ySmokesxSmokesyxFriendsyxxCancerxSmokesx

1.15.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants: Anna (A) and Bob (B)

11

Page 12: Montague meets Markov: Combining Logical and Distributional Semantics

iii xnw

ZxXP )(exp1)(

Weight of formula i No. of true groundings of formula i in x

12

Probability of a possible world

A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases.

a possible world

x iii xnwZ )(exp

Page 13: Montague meets Markov: Combining Logical and Distributional Semantics

MLN Inference Infer probability of a particular query

given a set of evidence facts. P(Cancer(Anna) | Friends(Anna,Bob),

Smokes(Bob)) Use standard algorithms for inference in

graphical models such as Gibbs Sampling or belief propagation.

Page 14: Montague meets Markov: Combining Logical and Distributional Semantics

MLN Learning Learning weights for an existing set of

clauses EM Max-margin On-line

Learning logical clauses (a.k.a. structure learning) Inductive Logic Programming methods Top-down and bottom-up MLN clause

learning On-line MLN clause learning 14

Page 15: Montague meets Markov: Combining Logical and Distributional Semantics

Strengths of MLNs

• Fully subsumes first-order predicate logic– Just give weight to all clauses

• Fully subsumes probabilistic graphical models.– Can represent any joint distribution over an

arbitrary set of discrete random variables. • Can utilize prior knowledge in both

symbolic and probabilistic forms.• Large existing base of open-source software

(Alchemy)15

Page 16: Montague meets Markov: Combining Logical and Distributional Semantics

Weaknesses of MLNs

• Inherits computational intractability of general methods for both logical and probabilistic inference and learning.– Inference in FOPC is semi-decidable– Inference in general graphical models is P-space

complete• Just producing the “ground” Markov Net can

produce a combinatorial explosion. – Current “lifted” inference methods do not help

reasoning with many kinds of nested quantifiers.16

Page 17: Montague meets Markov: Combining Logical and Distributional Semantics

PSL: Probabilistic Soft Logic[Kimmig & Bach & Broecheler & Huang & Getoor, NIPS 2012]

● Probabilistic logic framework designed with efficient inference in mind.

● Input: set of weighted First Order Logic rules and a set of evidence, just as in BLP or MLN

● MPE inference is a linear-programming problem that can efficiently draw probabilistic conclusions.

17

Page 18: Montague meets Markov: Combining Logical and Distributional Semantics

PSL● Atoms have continuous

truth values in the interval [0,1].

● Inference finds truth value of all atoms that best satisfy the rules and evidence.

● MPE inference: Most Probable Explanation.

● Linear optimization problem.

PSL vs. MLN

MLN● Atoms have boolean truth

values {0, 1}.● Inference finds probability of

atoms given the rules and evidence.

● Calculates conditional probability of a query atom given evidence.

● Combinatorial counting problem.

18

Page 19: Montague meets Markov: Combining Logical and Distributional Semantics

PSL Example

● First Order Logic weighted rules

● EvidenceI(friend(John,Alex)) = 1 I(spouse(John,Mary)) = 1I(votesFor(Alex,Romney)) = 1 I(votesFor(Mary,Obama)) = 1● Inference

– I(votesFor(John, Obama)) = 1 – I(votesFor(John, Romney)) = 0

19

Page 20: Montague meets Markov: Combining Logical and Distributional Semantics

PSL’s Interpretation of Logical Connectives

● Łukasiewicz relaxation of AND, OR, NOT– I(ℓ1 ∧ ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1}– I(ℓ1 ∨ ℓ2) = min {1, I(ℓ1) + I(ℓ2) }– I(¬ ℓ1) = 1 – I(ℓ1)

● Distance to satisfaction– Implication: ℓ1 → ℓ2 is Satisfied iff I(ℓ1) ≤ I(ℓ2)– d = max {0, I(ℓ1) - I(ℓ2) }

● Example– I(ℓ1) = 0.3, I(ℓ2) = 0.9 ⇒ d = 0– I(ℓ1) = 0.9, I(ℓ2) = 0.3 ⇒ d = 0.6

20

Page 21: Montague meets Markov: Combining Logical and Distributional Semantics

PSL Probability Distribution

● PDF:

Weight of formula r

Distance to satisfaction of rule r

Normalization constant

a possible continuous truth assignment

For all rules

21

Page 22: Montague meets Markov: Combining Logical and Distributional Semantics

● MPE Inference: (Most probable explanation)

– Find interpretation that maximizes PDF

– Find interpretation that minimizes summation

– Distance to satisfaction is a linear function

– Linear optimization problem

PSL Inference

22

Page 23: Montague meets Markov: Combining Logical and Distributional Semantics

Semantic Representations

• Formal Semantics– Uses first-order

logic– Deep– Brittle

Distributional Semantics

Statistical method

Robust

Shallow

23

• Combining both logical and distributional semantics

– Represent meaning using a probabilistic logic • Markov Logic Network (MLN) • Probabilistic Soft Logic (PSL)

– Generate soft inference rules• From distributional semantics

Page 24: Montague meets Markov: Combining Logical and Distributional Semantics

System Architecture[Garrette et al. 2011, 2012; Beltagy et al., 2013]

24

Sent1BOXER Rule

Base

result

Sent2

LF1

LF2Dist. Rule

Constructor

Vector Space MLN/PSL Inference

• BOXER [Bos, et al. 2004]: maps sentences to logical form

• Distributional Rule constructor: generates relevant soft inference rules based on distributional similarity

• MLN/PSL: probabilistic inference

• Result: degree of entailment or semantic similarity score (depending on the task)

Page 25: Montague meets Markov: Combining Logical and Distributional Semantics

Markov Logic Networks[Richardson & Domingos, 2006]

• Two constants: Anna (A) and Bob (B)

• P(Cancer(Anna) | Friends(Anna,Bob), Smokes(Bob))

1.15.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

25

∀ x Smokes( x )⇒ Cancer ( x )∀ x,y Friends ( x,y )⇒ ( Smokes ( x )⇔ Smokes ( y ) )

Page 26: Montague meets Markov: Combining Logical and Distributional Semantics

Recognizing Textual Entailment (RTE)

• Premise: “A man is cutting pickles”

x,y,z. man(x) ∧ cut(y) ∧ agent(y, x) ∧ pickles(z) ∧ patient(y, z)• Hypothesis: “A guy is slicing cucumber”

x,y,z. guy(x) ∧ slice(y) ∧ agent(y, x) ∧ cucumber(z) ∧ patient(y, z)• Inference: Pr(Hypothesis | Premise)

– Degree of entailment 26

Page 27: Montague meets Markov: Combining Logical and Distributional Semantics

27

Distributional Lexical Rules

• For all pairs of words (a, b) where a is in S1 and b is in S2 add a soft rule relating the two

– x a(x) → b(x) | wt(a, b)– wt(a, b) = f( cos(a, b) )

• Premise: “A man is cutting pickles”• Hypothesis: “A guy is slicing cucumber”

– x man(x) → guy(x) | wt(man, guy)– x cut(x) → slice(x) | wt(cut, slice)– x pickle(x) → cucumber(x) | wt(pickle, cucumber)– x man(x) → cucumber(x) | wt(man, cucumber)– x pickle(x) → guy(x) | wt(pickle, guy)

→ →

Page 28: Montague meets Markov: Combining Logical and Distributional Semantics

Distributional Phrase Rules

• Premise: “A boy is playing”• Hypothesis: “A little kid is playing”• Need rules for phrases

– x boy(x) → little(x) ∧ kid(x) | wt(boy, "little kid")• Compute vectors for phrases using vector

addition [Mitchell & Lapata, 2010]

– "little kid" = little + kid

28

Page 29: Montague meets Markov: Combining Logical and Distributional Semantics

Paraphrase Rules [by: Cuong Chau]

• Generate inference rules from pre-compiled paraphrase collections like Berant et al. [2012]

• e.g,“X solves Y” => “X finds a solution to Y ” | w

29

Page 30: Montague meets Markov: Combining Logical and Distributional Semantics

Evaluation (RTE using MLNs)

• Dataset• RTE-1, RTE-2, RTE-3• Each dataset is 800 training pairs and 800

testing pairs

• Use multiple parses to reduce impact of misparses

30

Page 31: Montague meets Markov: Combining Logical and Distributional Semantics

Evaluation (RTE using MLNs)[by: Cuong Chau]

RTE-1 RTE-2 RTE-3

Bos & Markert[2005] 0.52 ––

MLN 0.570.58 0.55

MLN-multi-parse 0.56 0.580.57

MLN-paraphrases 0.60 0.600.60

31

Logic-only baseline

KB is wordnet

Page 32: Montague meets Markov: Combining Logical and Distributional Semantics

Semantic Textual Similarity (STS)

• Rate the semantic similarity of two sentences on a 0 to 5 scale

• Gold standards are averaged over multiple human judgments

• Evaluate by measuring correlation to human ratingS1S2 score

A man is slicing a cucumber A guy is cutting a cucumber5

A man is slicing a cucumber A guy is cutting a zucchini4

A man is slicing a cucumber A woman is cooking a zucchini3

A man is slicing a cucumber A monkey is riding a bicycle1

32

Page 33: Montague meets Markov: Combining Logical and Distributional Semantics

Softening Conjunction for STS

33

• Premise: “A man is driving”x,y. man(x) ∧ drive(y) ∧ agent(y, x)• Hypothesis: “A man is driving a bus”x,y,z. man(x) ∧ drive(y) ∧ agent(y, x) ∧ bus(z) ∧ patient(y, z)• Break the sentence into “mini-clauses” then combine their

evidences using an “averaging combiner” [Natarajan et al., 2010]

• Becomes– x,y,z. man(x) ∧ agent(y, x)→ result()– x,y,z. drive(y) ∧ agent(y, x)→ result()– x,y,z. drive(y) ∧ patient(y, z) → result()– x,y,z. bus(z) ∧ patient(y, z) → result()

Page 34: Montague meets Markov: Combining Logical and Distributional Semantics

Evaluation (STS using MLN)

34

• Microsoft video description corpus (SemEval 2012)

– Short video descriptions

System Pearson r

Our System with no distributional rules [Logic only] 0.52

Our System with lexical rules0.60

Our System with lexical and phrase rules0.63

Page 35: Montague meets Markov: Combining Logical and Distributional Semantics

PSL: Probabilistic Soft Logic[Kimmig & Bach & Broecheler & Huang & Getoor, NIPS 2012]

● MLN's inference is very slow

● PSL is a probabilistic logic framework designed with efficient inference in mind

● Inference is a linear program

35

Page 36: Montague meets Markov: Combining Logical and Distributional Semantics

● Łukasiewicz relaxation of AND is very restrictive– I(ℓ1 ∧ ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1}

● Replace AND with weighted average– I(ℓ1 ∧ … ∧ ℓn) = w_avg( I(ℓ1), …, I(ℓn))– Learning weights (future work)

• For now, they are equal

● Inference– “weighted average” is a linear function– no changes in the optimization problem

STS using PSL - Conjunction

36

Page 37: Montague meets Markov: Combining Logical and Distributional Semantics

Evaluation (STS using PSL)

msr-vidmsr-par SICK

vec-add (dist. only) 0.78 0.24 0.65

vec-mul (dist. only) 0.76 0.12 0.62

MLN (logic + dist.) 0.63 0.16 0.47

PSL-no-DIR (logic only) 0.74 0.46 0.68

PSL (logic + dist.) 0.79 0.53 0.70

PSL+vec-add (ensemble) 0.83 0.49 0.71

msr-vid: Microsoft video description corpus (SemEval 2012)Short video description sentences

msr-par: Microsoft paraphrase corpus (SemEval 2012)Long news sentences

SICK: (SemEval 2014)

37

Page 38: Montague meets Markov: Combining Logical and Distributional Semantics

Evaluation (STS using PSL)

msr-vid msr-par SICKPSL time/pair 8s 30s 10sMLN time/pair 1m 31s 11m 49s 4m 24sMLN timeouts(10 min) 9% 97% 36%

38