online max-margin weight learning for markov logic networks

36
Online Max-Margin Weight Learning for Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science The University of Texas at Austin SDM 2011, April 29, 2011

Upload: lynley

Post on 23-Feb-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Machine Learning Group Department of Computer Science The University of Texas at Austin. Online Max-Margin Weight Learning for Markov Logic Networks. SDM 2011, April 29, 2011. Tuyen N. Huynh and Raymond J. Mooney. Motivation. Citation segmentation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Online Max-Margin Weight Learning  for Markov Logic Networks

Online Max-Margin Weight Learning

for Markov Logic NetworksTuyen N. Huynh and Raymond J.

MooneyMachine Learning Group

Department of Computer ScienceThe University of Texas at Austin

SDM 2011, April 29, 2011

Page 2: Online Max-Margin Weight Learning  for Markov Logic Networks

2

Motivation

D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, 1980.D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, 1980.D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, 1980.D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, 1980.D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, 1980.D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-72, 1980.D. McDermott and J. Doyle. Non-monotonic Reasoning I. Artificial Intelligence, 13: 41-

72, 1980.

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was

writing about]

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was

writing about]

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was

writing about]

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was

writing about]

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was

writing about]

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was

writing about]

[A0 He] [AM-MOD would] [AM-NEG n’t] [V accept] [A1 anything of value] from [A2 those he was

writing about]

Citation segmentation

Semantic role labeling

Page 3: Online Max-Margin Weight Learning  for Markov Logic Networks

3

Motivation (cont.) Markov Logic Networks (MLNs) [Richardson & Domingos,

2006] is an elegant and powerful formalism for handling those complex structured data

Existing weight learning methods for MLNs are in the batch setting Need to run inference over all the training examples in each

iteration Usually take a few hundred iterations to converge May not fit all the training examples in main memory do not scale to problems having a large number of examples

Previous work applied an existing online algorithm to learn weights for MLNs but did not compare to other algorithms

Introduce a new online weight learning algorithm and extensively compare to other existing methods

Page 4: Online Max-Margin Weight Learning  for Markov Logic Networks

4

Outline Motivation Background

Markov Logic Networks Primal-dual framework for online learning

New online learning algorithm for max-margin structured prediction

Experiment Evaluation Summary

Page 5: Online Max-Margin Weight Learning  for Markov Logic Networks

5

Markov Logic Networks [Richardson & Domingos, 2006]

Set of weighted first-order formulas Larger weight indicates stronger belief that the

formula should hold. The formulas are called the structure of the

MLN. MLNs are templates for constructing Markov

networks for a given set of constants

)()(),(,)()(

ySmokesxSmokesyxFriendsyxxCancerxSmokesx

1.15.1

MLN Example: Friends & Smokers

*Slide from [Domingos, 2007]

Page 6: Online Max-Margin Weight Learning  for Markov Logic Networks

6

Example: Friends & Smokers

)()(),(,)()(

ySmokesxSmokesyxFriendsyxxCancerxSmokesx

1.15.1

Two constants: Anna (A) and Bob (B)

*Slide from [Domingos, 2007]

Page 7: Online Max-Margin Weight Learning  for Markov Logic Networks

7

Example: Friends & Smokers

)()(),(,)()(

ySmokesxSmokesyxFriendsyxxCancerxSmokesx

1.15.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants: Anna (A) and Bob (B)

*Slide from [Domingos, 2007]

Page 8: Online Max-Margin Weight Learning  for Markov Logic Networks

8

Example: Friends & Smokers

1.15.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants: Anna (A) and Bob (B)

*Slide from [Domingos, 2007]

)()(),(,)()(

ySmokesxSmokesyxFriendsyxxCancerxSmokesx

Page 9: Online Max-Margin Weight Learning  for Markov Logic Networks

9

Example: Friends & Smokers

1.15.1

Cancer(A)

Smokes(A)Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants: Anna (A) and Bob (B)

*Slide from [Domingos, 2007]

)()(),(,)()(

ySmokesxSmokesyxFriendsyxxCancerxSmokesx

Page 10: Online Max-Margin Weight Learning  for Markov Logic Networks

iii xnw

ZxXP )(exp1)(

Weight of formula i No. of true groundings of formula i in x

10

Probability of a possible world

A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases.

a possible world

x iii xnwZ )(exp

Page 11: Online Max-Margin Weight Learning  for Markov Logic Networks

11

Max-margin weight learning for MLNs[Huynh & Mooney, 2009]

maximize the separation margin: log of the ratio of the probability of the correct label and the probability of the closest incorrect one

Formulate as 1-slack Structural SVM [Joachims et al., 2009] Use cutting plane method [Tsochantaridis et.al., 2004]

with an approximate inference algorithm based on Linear Programming

)|(maxargˆ \ xyPy yYy

),(max),(

)|ˆ()|(log);,(

\yxnwyxnw

xyPxyPwyx

T

yYy

T

Page 12: Online Max-Margin Weight Learning  for Markov Logic Networks

12

Online learning For i=1 to T:

Receive an example The learner choose a vector and uses it to

predict a label Receive the correct label Suffer a loss:

Goal: minimize the regret

Regret = R(T) = P Tt=1 lt(wt) ¡ minw2W

Regret = R(T) =TX

t=1ct(wt) ¡ min

w2W(8)Regret = R(T) =

TX

t=1ct(wt) ¡ min

w2W

TX

t=1ct(w) (1)

Regret = R(T) = P Tt=1 ct(wt) ¡ minw2W

P Tt=1 ct(w)

The accumulative loss of the online

learner

The accumulative loss of the best batch learner

Page 13: Online Max-Margin Weight Learning  for Markov Logic Networks

13

A general and latest framework for deriving low-regret online algorithms

Rewrite the regret bound as an optimization problem (called the primal problem), then considering the dual problem of the primal one

Derive a condition that guarantees the increase in the dual objective in each step

Incremental-Dual-Ascent (IDA) algorithms. For example: subgradient methods [Zinkevich, 2003]

Primal-dual framework for online learning[Shalev-Shwartz et al., 2006]

Page 14: Online Max-Margin Weight Learning  for Markov Logic Networks

14

Primal-dual framework for online learning (cont.)

Propose a new class of IDA algorithms called Coordinate-Dual-Ascent (CDA) algorithm: The CDA update rule only optimizes the

dual w.r.t the last dual variable (the current example)

A closed-form solution of CDA update rule CDA algorithm has the same cost as subgradient methods but increase the dual objective more in each step better accuracy

Page 15: Online Max-Margin Weight Learning  for Markov Logic Networks

15

Steps for deriving a new CDA algorithm 1. Define the regularization and loss

functions2. Find the conjugate functions3. Derive a closed-form solution for

the CDA update ruleCDA algorithm

for max-margin structured prediction

Page 16: Online Max-Margin Weight Learning  for Markov Logic Networks

16

Max-margin structured prediction

The output y belongs to some structure space Y

Joint feature function: (x,y): X x Y → R Learn a discriminant function f:

Prediction for a new input x:

Max-margin criterion:

),();,( yxwwyxf T

),(maxarg);( yxwwxh T

Yy

)',(max),();,(\

yxwyxwwyx T

yYy

T

MLNs: n(x,y)

Page 17: Online Max-Margin Weight Learning  for Markov Logic Networks

17

1. Define the regularization and loss functions Regularization function: Loss function:

Prediction based loss (PL): the loss incurred by using the predicted label at each step

+

where

Label loss function

Page 18: Online Max-Margin Weight Learning  for Markov Logic Networks

18

1. Define the regularization and loss functions (cont.)

Loss function: Maximal loss (ML): the maximum loss an

online learner could suffer at each step

where Upper bound of the PL loss more aggressive

update better predictive accuracy on clean datasets

The ML loss depends on the label loss function can only be used with some label loss functions

Page 19: Online Max-Margin Weight Learning  for Markov Logic Networks

19

2. Find the conjugate functions

Conjugate function:

1-dimension: is the negative of the y-intercept of the tangent line to the graph of f that has slope

Page 20: Online Max-Margin Weight Learning  for Markov Logic Networks

20

2. Find the conjugate functions (cont.)

Conjugate function of the regularization function f(w):f(w)=(1/2)||w||2

2 f*(µ) = (1/2)||µ||22

Page 21: Online Max-Margin Weight Learning  for Markov Logic Networks

21

2. Find the conjugate functions (cont.)

Conjugate function of the loss functions: +

similar to Hinge loss +

Conjugate function of Hinge loss: [Shalev-Shwartz & Singer, 2007]

Conjugate functions of PL and ML loss:

Page 22: Online Max-Margin Weight Learning  for Markov Logic Networks

22

CDA’s update formula:

Compare with the update formula of the simple update, subgradient method [Ratliff et al., 2007]:

CDA’s learning rate combines the learning rate of the subgradient method with the loss incurred at each step

3. Closed-form solution for the CDA update rule

Page 23: Online Max-Margin Weight Learning  for Markov Logic Networks

23

Experiments

Page 24: Online Max-Margin Weight Learning  for Markov Logic Networks

24

Experimental Evaluation Citation segmentation on CiteSeer

dataset Search query disambiguation on a

dataset obtained from Microsoft Semantic role labeling on noisy CoNLL

2005 dataset

Page 25: Online Max-Margin Weight Learning  for Markov Logic Networks

25

Citation segmentation Citeseer dataset [Lawrence et.al., 1999] [Poon and

Domingos, 2007]

1,563 citations, divided into 4 research topics

Task: segment each citation into 3 fields: Author, Title, Venue

Used the MLN for isolated segmentation model in [Poon and Domingos, 2007]

Page 26: Online Max-Margin Weight Learning  for Markov Logic Networks

26

Experimental setup 4-fold cross-validation Systems compared:

MM: the max-margin weight learner for MLNs in batch setting [Huynh & Mooney, 2009]

1-best MIRA [Crammer et al., 2005]

Subgradient CDA

CDA-PL CDA-ML

Metric: F1, harmonic mean of the precision and recall

𝑤𝑡+1=𝑤𝑡+[ 𝜌 (𝑦𝑡 , 𝑦𝑡

𝑃 )− ⟨𝑤 𝑡 , Δ𝜙𝑡𝑃𝐿 ⟩ ]+¿

‖Δ𝜙𝑡𝑃𝐿‖2

2 Δ𝜙 𝑡𝑃𝐿 ¿

Page 27: Online Max-Margin Weight Learning  for Markov Logic Networks

27

Average F1on CiteSeer

MM

1-best

-MIRA

Subg

radien

t

CDA-PL

CDA-ML

90.591

91.592

92.593

93.594

94.595

F1

Page 28: Online Max-Margin Weight Learning  for Markov Logic Networks

28

Average training time in minutes

MM

1-best

-MIRA

Subg

radien

t

CDA-PL

CDA-ML

0102030405060708090

100

Minutes

Page 29: Online Max-Margin Weight Learning  for Markov Logic Networks

29

Search query disambiguation Used the dataset created by Mihalkova & Mooney

[2009] Thousands of search sessions where ambiguous

queries were asked: 4,618 sessions for training, 11,234 sessions for testing

Goal: disambiguate search query based on previous related search sessions

Noisy dataset since the true labels are based on which results were clicked by users

Used the 3 MLNs proposed in [Mihalkova & Mooney, 2009]

Page 30: Online Max-Margin Weight Learning  for Markov Logic Networks

30

Experimental setup Systems compared:

Contrastive Divergence (CD) [Hinton 2002] used in [Mihalkova & Mooney, 2009]

1-best MIRA Subgradient CDA

CDA-PL CDA-ML

Metric: Mean Average Precision (MAP): how close the

relevant results are to the top of the rankings

Page 31: Online Max-Margin Weight Learning  for Markov Logic Networks

31

MAP scores on Microsoft query search

MLN1 MLN2 MLN30.35

0.36

0.37

0.38

0.39

0.4

0.41

CD1-best-MIRASubgradientCDA-PLCDA-ML

MAP

Page 32: Online Max-Margin Weight Learning  for Markov Logic Networks

32

Semantic role labeling CoNLL 2005 shared task dataset [Carreras & Marques,

2005] Task: For each target verb in a sentence, find and

label all of its semantic components 90,750 training examples; 5,267 test examples Noisy labeled experiment:

Motivated by noisy labeled data obtained from crowdsourcing services such as Amazon Mechanical Turk

Simple noise model: At p percent noise, there is p probability that an argument

in a verb is swapped with another argument of that verb.

Page 33: Online Max-Margin Weight Learning  for Markov Logic Networks

33

Experimental setup Used the MLN developed in [Riedel, 2007] Systems compared:

1-best MIRA Subgradient CDA-ML

Metric: F1 of the predicted arguments [Carreras &

Marques, 2005]

Page 34: Online Max-Margin Weight Learning  for Markov Logic Networks

34

F1 scores on CoNLL 2005

0 5 10 15 20 25 30 35 40 500.5

0.55

0.6

0.65

0.7

0.75

1-best-MIRASubgradientCDA-ML

Percentage of noise

F1

Page 35: Online Max-Margin Weight Learning  for Markov Logic Networks

35

Summary Derived CDA algorithms for max-margin

structured prediction Have the same computational cost as

existing online algorithms but increase the dual objective more

Experimental results on several real-world problems show that the new algorithms generally achieve better accuracy and also have more consistent performance.

Page 36: Online Max-Margin Weight Learning  for Markov Logic Networks

36

Thank you!

Questions?