montague meets markov: combining logical and distributional semantics
DESCRIPTION
Montague meets Markov: Combining Logical and Distributional Semantics. Raymond J. Mooney Katrin Erk Islam Beltagy University of Texas at Austin. 1. 1. Logical AI Paradigm. Represents knowledge and data in a binary symbolic logic such as FOPC. - PowerPoint PPT PresentationTRANSCRIPT
111
Montague meets Markov:Combining Logical and Distributional Semantics
Raymond J. MooneyKatrin Erk
Islam BeltagyUniversity of Texas at Austin
Logical AI Paradigm
• Represents knowledge and data in a binary symbolic logic such as FOPC.
+ Rich representation that handles arbitrary sets of objects, with properties, relations, quantifiers, etc.
Unable to handle uncertain knowledge and probabilistic reasoning.
Probabilistic AI Paradigm
• Represents knowledge and data as a fixed set of random variables with a joint probability distribution.
+ Handles uncertain knowledge and probabilistic reasoning.
Unable to handle arbitrary sets of objects, with properties, relations, quantifiers, etc.
Statistical Relational Learning (SRL)
• SRL methods attempt to integrate methods from predicate logic (or relational databases) and probabilistic graphical models to handle structured, multi-relational data.
• 5
SRL Approaches(A Taste of the “Alphabet Soup”)
• Stochastic Logic Programs (SLPs) (Muggleton, 1996)• Probabilistic Relational Models (PRMs) (Koller, 1999)• Bayesian Logic Programs (BLPs) (Kersting & De Raedt, 2001)• Markov Logic Networks (MLNs)
(Richardson & Domingos, 2006)• Probabilistic Soft Logic (PSL)
(Kimmig et al., 2012)
SRL Methods Based onProbabilistic Graphical Models
• BLPs use definite-clause logic (Prolog programs) to define abstract templates for large, complex Bayesian networks (i.e. directed graphical models).
• MLNs use full first order logic to define abstract templates for large, complex Markov networks (i.e. undirected graphical models).
• PSL uses logical rules to define templates for Markov nets with real-valued propositions to support efficient inference.
• McCallum’s FACTORIE uses an object-oriented programming language to define large, complex factor graphs.
• Goodman & Tanenbaum’s CHURCH uses a functional programming language to define, large complex generative models. 6
7
Markov Logic Networks [Richardson & Domingos, 2006]
Set of weighted clauses in first-order predicate logic.
Larger weight indicates stronger belief that the clause should hold.
MLNs are templates for constructing Markov networks for a given set of constants
)()(),(,)()(
ySmokesxSmokesyxFriendsyxxCancerxSmokesx
1.15.1
MLN Example: Friends & Smokers
Example: Friends & Smokers
)()(),(,)()(
ySmokesxSmokesyxFriendsyxxCancerxSmokesx
1.15.1
Two constants: Anna (A) and Bob (B)
8
Example: Friends & Smokers
)()(),(,)()(
ySmokesxSmokesyxFriendsyxxCancerxSmokesx
1.15.1
Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants: Anna (A) and Bob (B)
9
Example: Friends & Smokers
)()(),(,)()(
ySmokesxSmokesyxFriendsyxxCancerxSmokesx
1.15.1
Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants: Anna (A) and Bob (B)
10
Example: Friends & Smokers
)()(),(,)()(
ySmokesxSmokesyxFriendsyxxCancerxSmokesx
1.15.1
Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants: Anna (A) and Bob (B)
11
iii xnw
ZxXP )(exp1)(
Weight of formula i No. of true groundings of formula i in x
12
Probability of a possible world
A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases.
a possible world
x iii xnwZ )(exp
MLN Inference Infer probability of a particular query
given a set of evidence facts. P(Cancer(Anna) | Friends(Anna,Bob),
Smokes(Bob)) Use standard algorithms for inference in
graphical models such as Gibbs Sampling or belief propagation.
MLN Learning Learning weights for an existing set of
clauses EM Max-margin On-line
Learning logical clauses (a.k.a. structure learning) Inductive Logic Programming methods Top-down and bottom-up MLN clause
learning On-line MLN clause learning 14
Strengths of MLNs
• Fully subsumes first-order predicate logic– Just give weight to all clauses
• Fully subsumes probabilistic graphical models.– Can represent any joint distribution over an
arbitrary set of discrete random variables. • Can utilize prior knowledge in both
symbolic and probabilistic forms.• Large existing base of open-source software
(Alchemy)15
Weaknesses of MLNs
• Inherits computational intractability of general methods for both logical and probabilistic inference and learning.– Inference in FOPC is semi-decidable– Inference in general graphical models is P-space
complete• Just producing the “ground” Markov Net can
produce a combinatorial explosion. – Current “lifted” inference methods do not help
reasoning with many kinds of nested quantifiers.16
PSL: Probabilistic Soft Logic[Kimmig & Bach & Broecheler & Huang & Getoor, NIPS 2012]
● Probabilistic logic framework designed with efficient inference in mind.
● Input: set of weighted First Order Logic rules and a set of evidence, just as in BLP or MLN
● MPE inference is a linear-programming problem that can efficiently draw probabilistic conclusions.
17
PSL● Atoms have continuous
truth values in the interval [0,1].
● Inference finds truth value of all atoms that best satisfy the rules and evidence.
● MPE inference: Most Probable Explanation.
● Linear optimization problem.
PSL vs. MLN
MLN● Atoms have boolean truth
values {0, 1}.● Inference finds probability of
atoms given the rules and evidence.
● Calculates conditional probability of a query atom given evidence.
● Combinatorial counting problem.
18
PSL Example
● First Order Logic weighted rules
● EvidenceI(friend(John,Alex)) = 1 I(spouse(John,Mary)) = 1I(votesFor(Alex,Romney)) = 1 I(votesFor(Mary,Obama)) = 1● Inference
– I(votesFor(John, Obama)) = 1 – I(votesFor(John, Romney)) = 0
19
PSL’s Interpretation of Logical Connectives
● Łukasiewicz relaxation of AND, OR, NOT– I(ℓ1 ∧ ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1}– I(ℓ1 ∨ ℓ2) = min {1, I(ℓ1) + I(ℓ2) }– I(¬ ℓ1) = 1 – I(ℓ1)
● Distance to satisfaction– Implication: ℓ1 → ℓ2 is Satisfied iff I(ℓ1) ≤ I(ℓ2)– d = max {0, I(ℓ1) - I(ℓ2) }
● Example– I(ℓ1) = 0.3, I(ℓ2) = 0.9 ⇒ d = 0– I(ℓ1) = 0.9, I(ℓ2) = 0.3 ⇒ d = 0.6
20
PSL Probability Distribution
● PDF:
Weight of formula r
Distance to satisfaction of rule r
Normalization constant
a possible continuous truth assignment
For all rules
21
● MPE Inference: (Most probable explanation)
– Find interpretation that maximizes PDF
– Find interpretation that minimizes summation
– Distance to satisfaction is a linear function
– Linear optimization problem
PSL Inference
22
Semantic Representations
• Formal Semantics– Uses first-order
logic– Deep– Brittle
Distributional Semantics
Statistical method
Robust
Shallow
23
• Combining both logical and distributional semantics
– Represent meaning using a probabilistic logic • Markov Logic Network (MLN) • Probabilistic Soft Logic (PSL)
– Generate soft inference rules• From distributional semantics
System Architecture[Garrette et al. 2011, 2012; Beltagy et al., 2013]
24
Sent1BOXER Rule
Base
result
Sent2
LF1
LF2Dist. Rule
Constructor
Vector Space MLN/PSL Inference
• BOXER [Bos, et al. 2004]: maps sentences to logical form
• Distributional Rule constructor: generates relevant soft inference rules based on distributional similarity
• MLN/PSL: probabilistic inference
• Result: degree of entailment or semantic similarity score (depending on the task)
Markov Logic Networks[Richardson & Domingos, 2006]
• Two constants: Anna (A) and Bob (B)
• P(Cancer(Anna) | Friends(Anna,Bob), Smokes(Bob))
1.15.1
Cancer(A)
Smokes(A)Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
25
∀ x Smokes( x )⇒ Cancer ( x )∀ x,y Friends ( x,y )⇒ ( Smokes ( x )⇔ Smokes ( y ) )
Recognizing Textual Entailment (RTE)
• Premise: “A man is cutting pickles”
x,y,z. man(x) ∧ cut(y) ∧ agent(y, x) ∧ pickles(z) ∧ patient(y, z)• Hypothesis: “A guy is slicing cucumber”
x,y,z. guy(x) ∧ slice(y) ∧ agent(y, x) ∧ cucumber(z) ∧ patient(y, z)• Inference: Pr(Hypothesis | Premise)
– Degree of entailment 26
27
Distributional Lexical Rules
• For all pairs of words (a, b) where a is in S1 and b is in S2 add a soft rule relating the two
– x a(x) → b(x) | wt(a, b)– wt(a, b) = f( cos(a, b) )
• Premise: “A man is cutting pickles”• Hypothesis: “A guy is slicing cucumber”
– x man(x) → guy(x) | wt(man, guy)– x cut(x) → slice(x) | wt(cut, slice)– x pickle(x) → cucumber(x) | wt(pickle, cucumber)– x man(x) → cucumber(x) | wt(man, cucumber)– x pickle(x) → guy(x) | wt(pickle, guy)
→ →
Distributional Phrase Rules
• Premise: “A boy is playing”• Hypothesis: “A little kid is playing”• Need rules for phrases
– x boy(x) → little(x) ∧ kid(x) | wt(boy, "little kid")• Compute vectors for phrases using vector
addition [Mitchell & Lapata, 2010]
– "little kid" = little + kid
28
Paraphrase Rules [by: Cuong Chau]
• Generate inference rules from pre-compiled paraphrase collections like Berant et al. [2012]
• e.g,“X solves Y” => “X finds a solution to Y ” | w
29
Evaluation (RTE using MLNs)
• Dataset• RTE-1, RTE-2, RTE-3• Each dataset is 800 training pairs and 800
testing pairs
• Use multiple parses to reduce impact of misparses
30
Evaluation (RTE using MLNs)[by: Cuong Chau]
RTE-1 RTE-2 RTE-3
Bos & Markert[2005] 0.52 ––
MLN 0.570.58 0.55
MLN-multi-parse 0.56 0.580.57
MLN-paraphrases 0.60 0.600.60
31
Logic-only baseline
KB is wordnet
Semantic Textual Similarity (STS)
• Rate the semantic similarity of two sentences on a 0 to 5 scale
• Gold standards are averaged over multiple human judgments
• Evaluate by measuring correlation to human ratingS1S2 score
A man is slicing a cucumber A guy is cutting a cucumber5
A man is slicing a cucumber A guy is cutting a zucchini4
A man is slicing a cucumber A woman is cooking a zucchini3
A man is slicing a cucumber A monkey is riding a bicycle1
32
Softening Conjunction for STS
33
• Premise: “A man is driving”x,y. man(x) ∧ drive(y) ∧ agent(y, x)• Hypothesis: “A man is driving a bus”x,y,z. man(x) ∧ drive(y) ∧ agent(y, x) ∧ bus(z) ∧ patient(y, z)• Break the sentence into “mini-clauses” then combine their
evidences using an “averaging combiner” [Natarajan et al., 2010]
• Becomes– x,y,z. man(x) ∧ agent(y, x)→ result()– x,y,z. drive(y) ∧ agent(y, x)→ result()– x,y,z. drive(y) ∧ patient(y, z) → result()– x,y,z. bus(z) ∧ patient(y, z) → result()
Evaluation (STS using MLN)
34
• Microsoft video description corpus (SemEval 2012)
– Short video descriptions
System Pearson r
Our System with no distributional rules [Logic only] 0.52
Our System with lexical rules0.60
Our System with lexical and phrase rules0.63
PSL: Probabilistic Soft Logic[Kimmig & Bach & Broecheler & Huang & Getoor, NIPS 2012]
● MLN's inference is very slow
● PSL is a probabilistic logic framework designed with efficient inference in mind
● Inference is a linear program
35
● Łukasiewicz relaxation of AND is very restrictive– I(ℓ1 ∧ ℓ2) = max {0, I(ℓ1) + I(ℓ2) – 1}
● Replace AND with weighted average– I(ℓ1 ∧ … ∧ ℓn) = w_avg( I(ℓ1), …, I(ℓn))– Learning weights (future work)
• For now, they are equal
● Inference– “weighted average” is a linear function– no changes in the optimization problem
STS using PSL - Conjunction
36
Evaluation (STS using PSL)
msr-vidmsr-par SICK
vec-add (dist. only) 0.78 0.24 0.65
vec-mul (dist. only) 0.76 0.12 0.62
MLN (logic + dist.) 0.63 0.16 0.47
PSL-no-DIR (logic only) 0.74 0.46 0.68
PSL (logic + dist.) 0.79 0.53 0.70
PSL+vec-add (ensemble) 0.83 0.49 0.71
msr-vid: Microsoft video description corpus (SemEval 2012)Short video description sentences
msr-par: Microsoft paraphrase corpus (SemEval 2012)Long news sentences
SICK: (SemEval 2014)
37
Evaluation (STS using PSL)
msr-vid msr-par SICKPSL time/pair 8s 30s 10sMLN time/pair 1m 31s 11m 49s 4m 24sMLN timeouts(10 min) 9% 97% 36%
38