hal daumé iii ([email protected]) beyond em: bayesian techniques for nlp...

128
Bayesian Methods for NLP Slide 1 Hal Daumé III ([email protected]) 10:16 Beyond EM: Bayesian Techniques for NLP Researchers Hal Daumé III Information Sciences Institute University of Southern California [email protected] ...or “And you thought EM was hard” ...or “And you thought EM was fun”

Upload: others

Post on 12-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 1

Hal Daumé III ([email protected])10:16

Beyond EM:Bayesian Techniques for NLP Researchers

Hal Daumé I I IInformation Sciences Institute

University of Southern California

[email protected]

...or “And you thought EM was hard”

...or “And you thought EM was fun”

Page 2: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 2

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 3: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 3

Hal Daumé III ([email protected])10:16

The Bayesian Paradigm

Every statistical problem has data and parameters

Find a probability distribution of the parameters given the data using Bayes' Rule:

Use the posterior to:

Predict unseen data (machine learning)

Reach scientific conclusions (statistics)

Make optimal decisions (Bayesian decision theory)

Posterior

LikelihoodPrior

Marginal

� ��� �� � �� � � � �� � � �� � �� � � � � � � �� � ��

� � � �

Page 4: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 4

Hal Daumé III ([email protected])10:16

Why be Bayesian?

Pedagogical arguments:

All estimates, answers, decisions, etc. are consistent

Uniform treatment of all aspects of statistical modeling

Results are often interpretable

Assumptions are explicit and most can be tested

Practical arguments:

We know maximum likelihood estimators (for instance) are flawed

Who doesn't do smoothing?

I don't want to run 1000 experiments, changing a single parameter by 0.01 on each run (and I don't want you to either!)

Lets you do some fun stuff that you can't otherwise do

See a

lso: B

er[1

.6, 4

.2],

MK

[2.1

, 2.2

], W

as[1

1.1]

Page 5: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 5

Hal Daumé III ([email protected])10:16

Why not be Bayesian?

Makes you do a lot of (or at least some) math

Could be a reason to be Bayesian, though

Computational complexity

Though it's not so bad compared to cross validation

Arbitrariness of priors

Often based on misguided impressions of either Bayesian, or frequentist, methods

Generally not a big problem for machine learning

Page 6: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 6

Hal Daumé III ([email protected])10:16

What does it mean to be Bayesian?

It is NOT about using priors

Oooh look, I put a Gaussian prior on my maximum entropy problem...I must be a Bayesian now

It is NOT about applying Bayes' Rule

Well, I've used a noisy channel model, which employs Bayes' Rule...I'm being Bayesian, right?

It is ONLY about modeling uncertainty in all stages of statistical inference

This means making decisions by summing over all possibilities

NOT

Page 7: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 7

Hal Daumé III ([email protected])10:16

What is this tutorial about?

Graphical models as a tool for expressing assumptions

Common statistical distributions that are useful for modeling different types of quantities

How to go from a problem specification to a model

How to specify/choose appropriate prior distributions

How to per form inference in that model

Initial focus on unsupervised learning, but will discuss supervised models at the end

Page 8: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 8

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 9: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 9

Hal Daumé III ([email protected])10:16

What is a Model?

A statistical model is a specification of our assumptions about the data we will be modeling

What aspects are independent of others?

For non-independent aspects, how are they related?

Example 1 (maximum entropy models):

There is a class variable y and a instance variable x and that the conditional probability of y given x is a linear function of a bunch of features of x

Example 2 (machine translation)

There is something called a foreign string and something called an English string and each foreign word is generated by either exactly one English word or a special 'null token' according to a translation table

Page 10: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 10

Hal Daumé III ([email protected])10:16

What is a Good Model?

We can consider models by looking at the probability that they generate our data set (the marginal likelihood of the data):

P(d

ata

| mo

del

)

all possible data sets

Model 1

Model 2

Model 3 Current data set

See a

lso: M

K[2

8.1]

Page 11: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 11

Hal Daumé III ([email protected])10:16

What are Parameters?

Parameters are the parts of the model about which you are not completely 'certain' (or, willing to claim certainty)

Example 1 (maximum entropy models):

I 'know' that the class is linearly related to the input

I 'know' what the relevant features are

I don't know the feature weights => these are the parameters

Example 2 (machine translation):

I 'know' that foreign words are generated from exactly one English word (or the null word)

I do not know what the probability of any foreign word is, given a particular English word

(I also do not know which foreign words correspond to which English word, but this is a 'hidden variable', not a parameter.)

Page 12: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 12

Hal Daumé III ([email protected])10:16

Graphical Models

Convenient notation for representing probability distributions and conditional independence assumptions

X A observed random variable

X A unobserved/hidden random variable

X A observed/known parameter

X A unobserved/unknown parameter

A submodel replicated N timesN

An indication of conditional dependence

See a

lso: M

urph

y

X

X

Page 13: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 13

Hal Daumé III ([email protected])10:16

Example 1: Naïve Bayes

X

Y

N

��

Feature parameters

Data vector

Class label

Class 'prior' probability

� ���� �� � �� � � � �� � �� �

� ��� � ��� �� � �� � � � ���� �Y

� For each example n:� Choose a class Y by:

� For each feature f:� Choose X by:

F� ! " #�$ "&%

' ( ) *+-, . * (+

Page 14: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 14

Hal Daumé III ([email protected])10:16

Example 2: Maximum Entropy

Data vector

Class label

Feature parameters

� For each example n:� Choose a class Y by:

p

Y � y�

X,

� �

exp

fXf

f

X

Y

N

F

Page 15: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 15

Hal Daumé III ([email protected])10:16

Example 3: Hidden Markov Models

X

Y�

X

Y

X

Y

X

Y

X

Y

Page 16: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 16

Hal Daumé III ([email protected])10:16

Example for Summarization

Consider a stupid summarization model:

Each word in a document is drawn independently

Each word is draw either from a general English model, or a document specific model

We don't know which words are drawn from which

w

zN

� �

M

G D

Indicatorvariable

p

w

� � , �G ,

�D ���

m n zmn

p

zmn

� � � p �w

� �G �zmn p

�w

� �

dD �1 zmn

Page 17: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 17

Hal Daumé III ([email protected])10:16

Fun with Graphical Models

Easy to propose extensions to the model: add sentences!

w

zN

� �

M

G Dw

zN

� �

M

G D

S

Page 18: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 18

Hal Daumé III ([email protected])10:16

Fun with Graphical Models

Add queries!

w

zN

� �

M

G D

S

w

zN

� �

M

G D�

SQ

w

�Q

Page 19: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 19

Hal Daumé III ([email protected])10:16

Maximum Likelihood Estimators (MLE)

X

YN

��

Take a parameterized model and some data

Find the parameters that maximize the likelihoodof that data (i.e., the 'probability' of the parametersgiven the data):

� �� � � � ��� � � � � ��� �� �� ���

� � ��

�� ��� � �� � �� � ��� � � � �

! � �"$# � ! � � % & � '# � � � ! � � % ��� & � ' (

l

)*

, + ,

X1: N ,Y1: N

-/.

n k

)

Ynk log +

k

0 )

1 1 Ynk

-

log)

1 1 +

k

- -

0

n f

)

Xnf log

*

fYn 0 )

1 1 Xnf

-log

)

1 1 *

fYn

- -

2

l

2435

n k

6

Ynk

3

k

7 1

7Ynk

1 7 3

k

8

9

l

9: k

;

n: Yn

< k f

=

Xnf

:fk

> 1> Xnf

1 >:

fk

?

F

Page 20: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 20

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 21: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 21

Hal Daumé III ([email protected])10:16

MLE with hidden variables

Consider a stupid summarization model:

Each word in a document is drawn independently

Each word is draw either from a general English model, or a document specific model

We don't know which words are drawn from which

Compute log likelihood:

Uh oh! Logs can't go inside sums!

w

zN

� �

M

G D

Indicatorvariable

p

w

� � , �G ,

�D ���

m n zmn

p

zmn

� � � p�

w� �G �z

mn p

w

� �

dD �1 z

mn

l

�� ,

w

���

m n

logzmn

...

Page 22: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 22

Hal Daumé III ([email protected])10:16

Expectation Maximization

We would like to move the log inside the sum, but can we?

Jensen's Inequality to the rescue:

For any distribution Q (with the same support)

How should we choose Q?

��� � � ��� � � ��� � ��

��� � �� � � �

��� � ��

��� � � � � � � � � � � �

� � � �

� ��

��� � � � � � � � � � � � � � �

� � � �

�� � � � � � � � � �� � � ��� �� � � � � � � � � � � �

���� �� � � � � �� � � � � � � �� �� � � � � � � � �

See a

lso: W

as[9

.13]

, Mur

phy

Page 23: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 23

Hal Daumé III ([email protected])10:16

Expectation Maximization

If we set then the lower bound becomes an equality:

So, when computing , the expectation should be taken with respect to the true posterior

� ��� � � � ��� � � � �

��� � � � �� �� � ��� � � ��� �

� � � � � �

�� � � � � � �� �� �� � � � � � � � �

� � � � � �� �

� �

�� � � � � � �� �� �� � � � � � �� � � � � � � �

� � � � � �� �

� �

�� � � � � � �� �� �� � ��� � � �

� � �� � � � � � � �

��� � � � � � �� �

� � �� � � � � � �

��� �� � � � � �� � � � � �

Page 24: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 24

Hal Daumé III ([email protected])10:16

EM in Practice

Recall, we wanted to estimate parameters for:

So we replace the hidden variables with their expectations:

All we need to do is calculate the expectations:

And now the computation proceeds as in the no-hidden-variable setting

� � � � � � �� �

�� � � ��� �� � � � � � ��� �� � � � � � ��� �� � � � � �� �

� � � � � � � � � � � � � � � � � � � � �

� � � � ��� � �� � � � �

� � � �!� � � � � � � � � � � � � � � � �! � � � � �#"

� � $&% �'� !

� �)( *+ , -� � � � � � � �'�! � � � � �" � � $% � �!

Page 25: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 25

Hal Daumé III ([email protected])10:16

EM Summed Up

Initialize parameters however you desire

Repeat:

E-STEP:Compute expectations of hidden variables underthe current parameter settings

M-STEP:Optimize parameters given those expectation

This procedure is guaranteed to:

Converge to a (local) maximum

Monotonically increase the incomplete log-likelihood

Page 26: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 26

Hal Daumé III ([email protected])10:16

EM on our simple model

Suppose we have three words: { A, B, C}

Document 1 = [A B], Document 2 = [A C]

Initialized uniformly

E-step:

M-step:

��� ���� �� � � � � � �� � � � �� � � �

� ����� � � �

� ���

� � � "! #$ % � & �(' �) �

*,+ -/. $ 0 1

*,+ - . $ 0 1 ! *,+ -/. $ 0 1 � * + -

� �� � 2 � � � ��� 2� � � � �� 2 2 � � * + -

3

AG 4 1

Z

5

E

6

z11

798

E

6

z21

7 : 4 12

3

BG 4 1

Z5

E6

z12

7 : 4 14

3

CG 4 1

Z

5

E

6

z22

7 : 4 14

3

1AD 4 1

Z

5

1 ; E

6

z11

7 : 4 12

31BD 4 1

Z

5

1 ; E

6

z12

7 : 4 12

3

1CG 4 0

3

2AD 4 1

Z

5

1 ; E

6

z21

7 : 4 12

3

2BD 4 0 3

2CG 4 1

Z

5

1 ; E

6

z22

7 : 4 12

< 4 E

6

z11

78E

6z21

7

E

6

z11

78

E

6z21

7 8E

6z12

78

E

6

z22

7 4 12

Page 27: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 27

Hal Daumé III ([email protected])10:16

EM on our simple model

Suppose we have three words: { A, B, C}

Document 1 = [A B], Document 2 = [A C]

Initialized uniformly

Complete log likelihood

Incomplete log likelihood

log �

log

(A)

log �

(A)G

D1

Page 28: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 28

Hal Daumé III ([email protected])10:16

Problems with EM

For our documents, will always converge to this solution

BUT:

For more documents and words, there is a trivial local maximum where the general English model does nothing

This corresponds to � � �

Why is this bad?

It doesn't conform to our prior beliefs aboutwhat parameters are likely!!!

So how can we specify our prior beliefs about �?

Page 29: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 29

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 30: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 30

Hal Daumé III ([email protected])10:16

What is a Prior?

Recall Bayes' Rule:

A prior is a specification of our beliefs about the values parameters can take, before seeing any data

Okay, so what is a belief?

Do we have the same beliefs?

What if we don't?

What if I don't know what I believe?

Posterior

LikelihoodPrior

Marginal

� #� � � & �

� #� & � # � � � &

�� �� � #� & � # � � � &

Page 31: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 31

Hal Daumé III ([email protected])10:16

What is a Belief?

We want to be able to state numerically our beliefs in things like:

belief (it rained last night)belief (it will rain tonight)belief (the next card I draw will be an ace)

And we want to know how to manipulate beliefs

Suppose you are willing to accept a bet with odds proportional to the strength of your belief

Suppose you believe a coin will come up heads 90% of the time:belief (heads) = 0.9

Then you will accept a bet that if a coin comes up heads, you win at least $1, and if it comes up tails, you lose at most $9

See a

lso: M

K[3

], Be

r[3.1

, 2.3

]

Page 32: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 32

Hal Daumé III ([email protected])10:16

Beliefs (The Dutch Book)

IF:

Suppose you believe a coin will come up heads 90% of the time:belief (heads) = 0.9

Then you will accept a bet that if a coin comes up heads, you win at least $1, and if it comes up tails, you lose at most $9

THEN:

Unless your beliefs satisfy the rules of probability including Bayes' Rule, then I can take arbitrary amounts of money from you

SO:

The only way not to go broke is to ensure that your beliefs agree with probability theory

AND BAYES' RULE!

See a

lso: B

er[4

.8]

Page 33: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 33

Hal Daumé III ([email protected])10:16

How do my beliefs compare to Kevin's?

Maybe they do, maybe they don't

Does it matter?

If we have enough data, does it matter?

No! Theorem* :

* under some regularity conditions

� � ��� ��� � � � � � � � �

See a

lso: B

er[4

.7, 4

.8],

, Was

[11.

5]

Page 34: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 34

Hal Daumé III ([email protected])10:16

Specifying Priors

A prior is a map that:

Assigns to every setting of parameters a non-negative real value

Integrates to 1 over the parameter space

Such a beast can be difficult to describe! Tools:

When the parameters are discrete, we can often set them by hand

Unless we're in high dimensions with (prior) interaction among parameters

Otherwise, we will often choose a parametric prior

and deal with the hyper-parameters by one of many means:

Set them subjectively (subjective/true Bayes)

Integrate them out analytically (often not possible and often suboptimal)

Choose them in such a way to be objective (objective Bayes)

Optimize them from the data by marginal likelihood (empirical Bayes, Type II ML)

Or choose a set of priors and integrate over them (robust Bayes)

...

See a

lso: B

er[3

.1-6

, 4.7

], M

K[],

, W

as[1

1.1]

�� �� � � � �� � � �

Page 35: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 35

Hal Daumé III ([email protected])10:16

Exponential Family

A set of distributions of the form:

Using exponential distributions is very convenient:

They are convex with respect to the parameters

The have natural prior distributions

They have several convenient properties wrt moments:

See a

lso: B

er[8

.7],

MK

[22]

, Was

[9.1

3]

� � ����� � � � � �� ��� ��� � � ��� ��� � �Natural

parameterSufficientstatistics

Normalizationfactor

���� � �� ! "$# %'& () *# + "# ,.- % ! "# %& //021 3 "01 %

Page 36: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 36

Hal Daumé III ([email protected])10:16

Subjective Bayes

Eliciting priors can be very difficult; several options:

The histogram approach

Simple, but how big are intervals, no tails, ...

The relative likelihood approach

Typically easier to elicit, but still no tails

Moment matching of parametric forms

Most used and misused approach; now we have tails, but they are hard to elicit

Alternative is to use quantiles

Cumulative distribution function determination

Choose quantiles of the CDF

Page 37: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 37

Hal Daumé III ([email protected])10:16

Expected Fisher Information

Objective Bayes

Desire to give a prior that contains no information

For instance, the improper uniform prior on the real line

These are rarely consistent across reparameterizations

For scale parameters, is more appropriate

Jeffreys' prior:

Not affected by restriction on the parameter space

� �� � � � � �

� � � � �� � � � � �� �

���� � � � � � ��� � �

� � � � � ���� ! �#" $ � �

%

Page 38: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 38

Hal Daumé III ([email protected])10:16

Empirical Bayes

Specify a class of priors (typically a functional form):

Estimate the prior by maximizing the marginal likelihood:

γ � � �� � �� � � � �� � � �

�� � �� � � γ

� � � � � � �� �� �

A

��

�� � � � � � � � � � � �

Page 39: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 39

Hal Daumé III ([email protected])10:16

A Bit of Philosophizing

Bayesians are often criticized for the use of priors:

Priors are subjective, science is objective

Priors can fail to be robust:

Small changes in the prior can result in large changes in the decision

Bayesian statistics requires thought and introspection

Do we really have degrees of belief? Is betting a good model?

Does infinite data really lead to agreement?

The computations are too difficult

Recall the goal of the statistician:

Statistics aims to do for inductive reasoning what Frege did for deductive reasoning

Do these complaints matter from the perspective of machine learning

“The subjectivist states his judgments, whereas the objectivist sweeps them under the carpet by calling assumptions knowledge, and he basks in the glorious objectivity of science.” (Good, 1973)

Page 40: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 40

Hal Daumé III ([email protected])10:16

The Likelihood Principle

Principle states:

All relevant information about the parameters after the data is observed is contained in the likelihood for the observed data. Furthermore, two likelihood functions contain the same information about the parameters if they are proportional to each other.

Example:We want to determine if a coin is biased or notExperiment: the coin is flipped and we come up with 9 h and 3 tTwo possibilities for how the experiment was performed:

We decided to do a total of 12 flipsWe decided to keep flipping the coin until 3 tails were

observed

The Bayesian doesn't care

The classical statistician would compute:

7.5% 'chance' under option 1

3.25% 'chance' under option 2

Page 41: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 41

Hal Daumé III ([email protected])10:16

Conjugate (convenient) Priors

Given a distribution

And a prior

The prior is conjugate if:

The posterior distribution of the parameter after observing somedata has the same functional form as the prior distribution

Beta/Binomial Example:

� ��� ��� �

� � � �

� �� ��� � � ��� � �� � � � � � � � � �

�� � � �! " �� � � � � � � � � � � # � $� �

% &(' )* &(+ , -/. 021 3 4 5 67 4

.81 9 -: ;1 5 <�= 9

%!> ?+ -1 0�@ 5 6 Γ

-@ A B@DC 5Γ

-@ A 5 Γ -@EC 51 (F =C -: ;1 5 (G =C

H -1 0 @ 3 . 5 67 4

.8 Γ

-@ A B@IC 5

Γ

-@ A 5 Γ -@JC 51 F =C K 9 -: ;1 5 G =C K <�= 9

6 %!> ?+ - . 0L @ A B . 3@C B 4 ; . M 5

See a

lso: B

er[4

.2.2

], M

K[2

3]

Page 42: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 42

Hal Daumé III ([email protected])10:16

Binomial and Beta Distributions

Binomial distribution models flips of coins (domain={ 0,1} ):

Probability that a coin, bias

, flipped N times will come up x heads

Parameters:

Distribution:

Moments:

Beta distribution models nothing (we care about) (domain=[0,1]):

Parameters:

Distribution:

Moments:

Beta is conjugate to binomial:

Posterior parameters:

Marginal distribution:

4�� � K 3 1 � �� �: �

% &' - . 0 4 31 5 67 4

.81 -: ;1 5 < =

6 ��� � �� 6 �� -� ; � B �� 5

@ � � K 3 � � � K

��� ��� �� ��� ! "$# γ

�� % ! "γ

�� "γ

� ! " �&' ( �) *� " +' (

, 6 @@ B � 3 -+. 6 @ �

-@ B � 5 / -@ B � B : 5

0213 1 4576

098 3 8 4 :<; 5

H - . 0 @ 3 � 5 6 Γ

-@ B � 5

Γ

-@ 5

Γ

- � 57 4

.8

Γ

-@ B . 5

Γ

- � B 4 ; . 5

Γ

-@ B � B 4 5

Page 43: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 43

Hal Daumé III ([email protected])10:16

Beta Distribution Examples

�=0.

5

�=1

�=4

=0.5

=1

=4

��� ��� �� ��� ! "# γ

�� % ! "

γ

�� "

γ

� ! " �& ' ( �) *� " +' (

Page 44: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 44

Hal Daumé III ([email protected])10:16

Multinomial Distribution

A distribution over counts of K>1 discrete events (words)

Domain:

Parameters:

Distribution:

Moments:

L .C � � 3 . � M � � � K � �

L 1C 3 � 31 � M � ∆ � 6 �L 1C 3 � 31 � M�� - � 5 1 � � � 1 6 : �

�� � � ���� � � � "# γ

� ��� %) �

γ

��� � %) " �� ���

�� # ��� ��� � # ��� �) *�� " ���� � # * �� ��

(N,0,0) (0,N,0)

(0,0,N)

Page 45: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 45

Hal Daumé III ([email protected])10:16

Dirichlet Distribution

A distribution over a probability simplex

Domain:

Parameters:

Distribution:

Moments:

� � (�� � � � ��� δ

�� � � �� �� � " � #�

� �

� ��� � ��� � ��� ��� γ

�� � �� γ

� � � � � � �� � !

"$# % &#' & ( ) *+ # % &# , ' &.- &# /

, ' & / 0 ,1 2 ' & /

(N,0,0) (0,N,0)

(0,0,N)

(N,0,0) (0,N,0)

(0,0,N)

[1,1,2] [5,5,10]

Page 46: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 46

Hal Daumé III ([email protected])10:16

Multinomial/Dirichlet Pair

Multinomial distribution:

Dirichlet distribution:

Posterior hyper-parameters:

Marginal Distribution:

�� � � �� � � � � "# γ

� �� %) �

γ

��� � %) " �� � �

� ��� � �� � � � �� γ

�� � �� γ

� � � � � � �� � !

� ���C � � � ���� � � � � �� � � � � �� �� �

� ���� � � � � Γ

� � � � � �Γ

�� � � � � �� � �

Γ� � � �

Γ

� � � �� � �

Γ

� � � � � � �

Page 47: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 47

Hal Daumé III ([email protected])10:16

Gaussian/Gaussian-Gamma

Gaussian distribution:

Gaussian prior:

Gamma prior:

Posterior hyper-parameters:

Marginal distribution:

���� �� ��� � � � � �� � � � � ��� ���� � � �

� �� �

���� �� � � � � � �

��� � � � � � � � �

�� �

γ

� � � � � � � � � � � �

�� � �

� � ��� � � � � � � � !#" $ %

� �

& �� �

� �� �

' � � �

� �

� ( � � �)

� ) ( � �

� ( � � � � ( � �

�� � � � ( � � �

*� � � � �

� )�� ) � �� � �

+ � �

� �� � � � � � � � � � � ,-/. 0 � � � � � � �

Non-standardStudent's T distribution

Page 48: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 48

Hal Daumé III ([email protected])10:16

Gamma Distribution

��� � ��� � �� � �

Γ

� � � � � � � ��� � � ��� � �

�=1

�=2

�=4

�=1 �=2 �=4

� � ( �

�� � � ( � �

Page 49: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 49

Hal Daumé III ([email protected])10:16

Summary of DistributionsDistr ibution Domain Pr ior Parametr ic Form

Binomial Binary Beta

Multinomial K classes Dirichlet

Beta [0,1]

Gamma [0, � )Dirichlet Simplex

Gaussian Reals Nor/Gam

Cauchy Reals

Student's t Reals� ��� ��� � � � �� � � �� � � ��� �

��� ��� �� � � ! " #%$ � &' ( �) *� # + ' (

,�- ./ 02123 4 1�5 6%7 598 :<;

= >@?A BC<D BE F�G H CJI KMLN O

@PQ �� �SR T U �� VW X Y ��� �R U�Z [ T U \

]_^ ` acb d ^fe g hji b � k � lnmo p arq gb h

s_tu vxw y tfz { |S} { ~�� v { ��� vxw � t | � | � � �

� vw y tz �z � � |�� ��� � vw � � | ��� v t � � | � � �M� � � �� �

Page 50: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 50

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 51: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 51

Hal Daumé III ([email protected])10:16

Recall our summarization model

w

zN

� �

M

G D

The problem was that we don't believe that it's okay for � to go to 0 or 1

Solution?Put a prior on �!

What's a good prior?� y � � � � v � |

� y � � � � ��� �� v � � | � ��� �� v � � | ��� �

Page 52: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 52

Hal Daumé III ([email protected])10:16

Bayesianified summarization model

w

zN

��

� �

M

G D

� d ^e g � ��� �^ a ^ e g h

� d � � � � a � h

� d �e � ��� � � a � h � ��� � � a � h � � �

� �� ����� �� � � � � !#" γ

� � $ � �γ

� � � γ � � � " %�& ' �( ) " � *& '

+, -/.0 1 2354 ' 6" -. 0 �( ) " � ' & -/.0

7�� 78 � -. 0 9.0: ��<; 7= � > ' & -.0 ? 9. 0:

� @BAC DE F % > %HG * ?I

+G,

@ -.0 C D J, > A ?K

7�� 78 � -.0 9L.0: ��<; 7= � > ' & -L.0 ? 9. 0: M N

Page 53: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 53

Hal Daumé III ([email protected])10:16

The Integration Problem

In general, we want to compute something like:

Examples:

Summarization model:

p is the posterior of the hidden variables

f is the probability of the words

Classification model:

p is the posterior of the model parameters

f is the prediction function

Integral normalization

p is uniform

f is a probability measure (i.e., unnormalized probability distribution)

��� ��

��� � �� � � � ��� ��� � � �

See a

lso: B

er[4

.2, 4

.3],

MK

[IV

]

Page 54: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 54

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 55: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 55

Hal Daumé III ([email protected])10:16

Maximum a Posteriori

Not Bayesian, but sometimes effective

Choose a,b by hand, proceed as before, but now we only need to maximize over � and

(unchanged)

Amounts to a simple form of smoothing:

Now, we can just maximize � as before, but with fake counts added proportional to the prior

� � � ��� � � � � � � Γ

� �

Γ

� � Γ

� � ���� � �� �� � �� �

� �� �� � ���� � �� � �� � � �� � � �� � �

� � ! �� �� " ��# � �%$ & � ' � �� � � ( " ��#

) *�,+ �� � �

� � � � - ��� � �� �� � � �� � � - �� �

� � ! �� �� " ��# � �%$ & � ' � �� � � ( " ��#

w

zN

.0/ .1 1

M

G D

2 354 6 798 :; <4 =4 6 7 >

? 3 2 8 :@A = 2 >

B 3 ? 6 C 8 DFE G < = C H > I DFE G < = C J > K5L I

M

Page 56: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 56

Hal Daumé III ([email protected])10:16

Maximum a Posteriori Temptation

I don't want to specify my prior! Let me estimate it:

First, I find a,b that maximize the marginal likelihood

Then, I use this a,b as the smoothing parameters for �

This is NOT VALID! Why?

We are 'double counting' the evidence

Page 57: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 57

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 58: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 58

10:16

��� ��

��� � � � � �

� � � �� � � �

Page 59: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 59

Hal Daumé III ([email protected])10:16

Summing in our Model

Simply rewrite the integral as a sum:

Now we can compute expectations of z easily and use these for the M-step of EM

w

zN

.0/ .� �

M

G D

� �54 6 798 :; <4 �4 6 7 �

? � � 8 :@A � � �

B � ? 6 � 8 DFE G < � � H � I DFE G < � � J � K5L I

M

� � � � � � � � ) ��

�� Γ

� �

Γ

� � Γ

� ���� � �� �� � �� �

� � � �� � ���� � �� � �� �� �� � � �� ��

� � ! �� �� " ��# � �$ & � ' � �� �� ( " ��#

� � �� � �

Γ

� �

Γ

� � Γ

� �� � � � � � �� � � � � � � � � � �� �

� � � �� � ���� � �� � � � � �� �� �� � � � � � � � �� ��

� � ! �� �� " ��# � �$ & � ' � �� �� ( " ��#

Page 60: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 60

10:16

Idea: let's choose R differently

Page 61: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 61

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 62: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 62

10:16

��� ��

��� � � � � �

� � � �� � � �

See a

lso: M

K[2

9], W

as[2

4.2]

, And

03

Page 63: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 63

10:16

Uniform Sampling

Pros:

Can now work in arbitrarilyhigh dimensions (in theory)

Choice is now size of R, notthe width of windows

Cons:

Number of samples requiredto get near the mode of a spikydistribution is huge:

True distribution is rarely uniform�� �

�� � � � � �

�� � �

� � � �

� � � ��

See a

lso: M

K[2

9], W

as[2

4.2]

, And

03

Page 64: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

10:16

Page 65: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 65

10:16

Page 66: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 66

10:16

See a

lso: M

K[3

0], A

nd03

Page 67: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 67

10:16

Rejection Sampling

Pros:

Again, if q is close to p, we willget good sample (i.e., few sampleswill be rejected)

Cons:

Hard to construct such a q

With p and q 0-mean Gaussians, with qp, we must set:

which for D=1000 yields an acceptance rate of 1/20,000

� � �� � � � �� � �

Page 68: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 68

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 69: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 69

Hal Daumé III ([email protected])10:16

Markov Chain Monte Carlo

Monte Carlo methods suffer because the proposal density needs to be similar to the true density everywhere

MCMC methods get around this problem by changing the proposal density after each sample

General framework:

Choose a proposal density q( | x) parameterized by location x

Initialize state x arbitrarily

Repeatedly sample by:

Propose a new state x' from q(x' | x)

Either accept or reject this new state

If accepted, set x = x'

New problem: samples are no longer independent!

See a

lso: M

K[3

0], W

as[2

4.4]

, And

03

Page 70: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 70

Hal Daumé III ([email protected])10:16

Metropolis-Hastings Sampling

Accept new states with probability:

Only put every Nth sample into R

� ����

���� � �

� � � �

� � � �

p(x)

x0 x'

p(x')

p(x0)

q(x0|x')

q(x'|x0)

q( | x0) q( | x')

��� ��

� � � ���

�� � �

� �

See a

lso: M

K[3

0], W

as[2

4.4]

, And

03

Page 71: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 71

Hal Daumé III ([email protected])10:16

MH in our Model

Invent a proposal distribution q

Or, condition on all variables:

Now we can compute expectations of z easily and use these for the M-step of EM

Alternatively, we could propose values for LMs in the sampling

w

zN

.0/ .� �

M

G D

� �54 6 798 :; <4 �4 6 7 �

? � � 8 :@A � � �

B � ? 6 � 8 DFE G < � � � � I DFE G < � � J � K5L I

M� � � � � � � � �� � � � � � � �� �

� � � � � � � � �� � � � � � � � �

� ��� � � � � �� � � �� � �

��� A � � � � � ��� ���� �

� � � � � � �� � � � � � � � �

� � � � � � �� � � � � � � �� �

� ��� � � � ! � ��� � � �� � � 6A

� ��� � �� A � � �

��� A � � � ��� � � � A � � � #" �%$ �& � ���& �� '

Page 72: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 72

Hal Daumé III ([email protected])10:16

Metropolis-Hastings Sampling

Pros:

No longer need to specify auniversally good proposaldistribution; only locally good

Simple proposal distributionscan go far

Cons:

Hard to tell now far to space samples:

Suppose we use spherical proposals and, then we need at least

where sigmas are lengths of the major density in p

Auto-correlation to track this:

p(x)

x0 x'

p(x')p(x0)

q(x0|x')q(x'|x0)

q( | x0) q( | x')

� � ��

� � � ��

�� � �

� �

��� ��� ��� �� �

��� � ��� ���� ��� � � � � !� � �" � � � !

��� ��

� � � � � ! #

Page 73: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 73

Hal Daumé III ([email protected])10:16

Gibbs Sampling

Defined only for multidimensional problems

Useful when you can take out one variable and explicitly sample the rest

p(x1 | x2)

x2

x1

p(x1 | x

2)

p(x1 | x2)

p(x1 | x

2)

��� ��

��� � � � � � � � � ��

� � � � � �

See a

lso: M

K[3

0], W

as[2

4.5]

, And

03

Page 74: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 74

Hal Daumé III ([email protected])10:16

Gibbs Sampling

Typically our params are:

If, for each i, we can draw a sample from:

then we can use Gibbs sampling

In graphical models, only depends on the Markov blanket:

p(x1 | x2)

x2

x1 p(x

1 | x2)

p(x1 | x2) p(x1 | x

2)

� ��

� � � � � �

���� � � �� �� ���

� � � � �� � � � � � � � � � � �� � � � �� � �" �� �� �� �

� � � � � � ��� � � � � �� � � � � ���� ��� ��� � � ��� � � � � �� � � � � �

a

bd

e

f! "# $�% & ' (�) ! " # $+* , - ( ! "/. $ # ( ! "0 $ # , 1 (

c

Page 75: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 75

Hal Daumé III ([email protected])10:16

Gibbs in our Model

Compute conditional probabilities

Now we can compute expectations of z easily and use these for the M-step of EM

Alternatively, we could propose values for LMs in the sampling

w

zN

��� �� �

M

G D

� � �� � � � � � �� �

� � � � ��� � � �

� � �� � � ��� � � � � � � � ��� � � � � � � ��� �

!#" $ % & !" $ ' (*) + ! ,.- % !" $ /

- % & - ' (*) + ! ,.- % !" $ /0 � �

( 132 ,54 0 � % - /

4 0 � % &4 0 � ' ( 132 ,4 0 � % - /76 , 80 � % 4 0 � " 9 /

Page 76: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 76

Hal Daumé III ([email protected])10:16

Gibbs Sampling

Pros:

Designed to work in highdimensional spaces

Terribly simple to implement

Automatable

Cons:

Hard to judge convergence, can require many many samplesto get an independent one (often worse than MH)

Only applicable when conditional distributions are 'nice'�

(Though there are ways around this)

p(x1 | x2)

x2

x1 p(x

1 | x2)

p(x1 | x2) p(x1 | x

2)

� ��

� ��� �

� � �

Page 77: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 77

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 78: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 78

Hal Daumé III ([email protected])10:16

Laplace (Saddlepoint) Approximation

Idea: approximate the expectation by a quadratic (Taylor expansion) and use the normalizing constant from the resulting Gaussian distribution

��� �� ��� � � � � � � ��� ���� � � � �

��� �� � � ��� � �

�� �"!

p(x)f(x) = g(x)

q(x)

x0

See a

lso: M

K[2

7]

Page 79: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 79

Hal Daumé III ([email protected])10:16

Laplace Approximation

Find a mode x0 of the high-dimensionaldistribution g

Approximate ln g(x) by a Taylorexpansion around this mode:

Compute the matrix A of second derivatives

The exponential form is a Gaussian distribution; use the Gaussian normalizing constant:

�� � ��� � �� � �

� � � �� � �� � � �� � � �� ��

g(x)

q(x)

x0

�� � � � �� � � � ��

� � � � � � � � � �

���� ��� � �

� � � ���� � ��� � �

� ! �

"$# %& � ')( * , ( /,+ * ,�- (/. / ,0 - / 1') + 2

Page 80: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 80

Hal Daumé III ([email protected])10:16

Laplace in our Model

Compute second derivatives:w

zN

��� �� �

M

G D

� � �� � � � � � �� �� � � � ��� � � �� � �� � � ��� � � � � � � � ��� � � � � � � ��� �

!" #$ %'& ( $ &) *,+- . $ / () * 01

02

34657

$ 4857+ - . $ / *) 4 579 + : 1 2;=< 1 2> ? /

@ ACBD E@GF H IKJCL M NF L IO L M NI ML F N P QRGS T

UWV RTF LI ML V RT NI ML F NX

@ Y ACBD E@ F Y H L IKJ L M NF Y L

IO L M NI ML F N YL QR S T

ZV RTF Y P I ML V RT NI ML F N Y[

\ ]_^ `

a6bcdfe ab g hie dj k

lKm no ln

pe dj klKm n

cdfe o ln g

qsr tu vxw y z w {| z w {~} � z w�� { ���� � �r �

��� �� w � �=� � z w { �

�� �G�

Page 81: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 81

Hal Daumé III ([email protected])10:16

Laplace Approximation

Pros:

Deterministic

Efficient if A is of a suitable form(i.e., diagonal or block-diagonal)

Can apply transformations to makequadratic approximation more reasonable

Cons:

Poor fit for multimodal distributions

Often, det A cannot be found efficiently

��� � ��� �� ���

� ����� ��� � � ��� �� � � �

�� ���g(x)

q(x)

x0

Page 82: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 82

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 83: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 83

Hal Daumé III ([email protected])10:16

Variational Approximation

Basic idea: replace intractable p with tractable q

Old Problem:

We cannot come up with a good, single, q to approximate p

Key Idea:

Consider a family of distributionswith 'variational parameters'

Choose a member q from Q that is closest to p

New problems:

How do we choose Q?

How do we measure 'closeness' between q and p?

� � ��� ��� � � � � � �

See a

lso: M

K[3

3], W

ain0

3, M

in03

Page 84: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 84

Hal Daumé III ([email protected])10:16

Recall EM and Jensen's Inequality

Jensen gives us:

Where we chose to turn the inequality into an equality. But we can also compute:

for any choice of q

� ��� � � � � � � � � �

�� �� ��� ��� � � � � �� ��� � ��� � � ��� �

� � � �� ��� � � � � � � � � ��� �

� � � �

! �� ��� � � � � �� � �

� � � ��� �

� � � �

� �� � � � � � �� � � � � � � �#" �� � � � � � � � � � �

� $&%' ( ) � �� � � � � � � � *" $% ' ( ) � � � � � � *

+�, -. /�0 1�2 354 L 6 78 9;: /�< 3 1 1 . /< 1 0 =2 3 >

L

?�@ ACB D

Page 85: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 85

Hal Daumé III ([email protected])10:16

Variational EM

Parameterize q and directly optimize:

Iterate:

V-Step: Compute variational parameters to minimize KL

E-Step: Compute expectations of hidden variables wrt

M-Step: Maximize wrt true parameters

Art: inventing q so that this is all tractable

� � � � � � � � � $% ' ( ) � �� � � � � � � � *" $%' ( ) � � � � � � * � �� � � � � � �� � � � � � � � � �� � �

�B

� � �

�L

Page 86: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 86

Hal Daumé III ([email protected])10:16

Variational: Choosing Q

Mixture model:

� � ��� � � � �� � � � �

γ

� � � �

γ

� � γ

� � � � ��� � �� � � � �� �

��� �� ���� � � � � � � � �� �

� � "! �� # �� � $

% & ')(+*'-, . /10 2 ( 3 ')(+*', 4

5 & ' % . / 687 3 ' % 4z

M,N

9;:<>=@?<BA C

� �ED � � � ��F ��-G � �D � �

γ� �F � �-G �

γ� � F � γ

� �-G �H�I J KD H J K

LNMPO Q R �D H J%)ST O

Key: and z arenow not tied in the

q distribution!

U

w

zN

VW

V

X X

M

G D

Y

Page 87: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 87

Hal Daumé III ([email protected])10:16

Variation EM

Step 1: Write out the log likelihood function:

L

� � ��� � � � �

� ��� ��� ������ �� ��� � � � � �� �� � � �� �"! � � � �� �� � � �� � � � � � � ! � #

$ �%��� ��� ��&��� �' ��� �� �� � � �� � ! � � � #

�(��� ��� �)&*

γ

�� � � $ * γ

�� $ * γ

� � � �� $ + � � � � � � � $ + ��� � � + $ � ,

� � �� ��� ��

�� -! � � � � � � � + $ ! � � � � � + $ � . #

� � �� ��� �/

�� � 0� 1 ! � 0 � � 1 � � � � 10 2

$ �3��� ��� �4 *

γ

� 5� � 5 � $ * γ

� 5� $ * γ

� 5 � � � 5� $ + � � � � � � 5 � $ + ��� � � + $� 6

$ �(��� ��� ��

�� � 0 ! � 0��� � 5� � 0#

Page 88: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 88

Hal Daumé III ([email protected])10:16

Variational EM

Step 2: Simplify and compute expectations

��� � �� �

� 5 � � ��� % �� � �� �

� ��� %

��� � �� �

� 3� 5 � �4 ��� 3 � % 4 �� � �� �

� ��� 3 � % 4

��� � �� �

� 5 � � �

L

� � � � � � � �

� �� � �� ��� � �� �� � � � � � �� ��� �� �"! � � � � �� � � �� � � � � � � ! � �

$ �(�� � �� ��� � � ' �� � �� � � �� �"! � � � �

��� � �� ��*

γ

�� � � $ * γ

�� $ * γ

� � � �� $ + � � � � � � � $ + � � � � + $ � �

� �%�� � �� ��

�� �! � ��� � � � � + $ ! � ��� � � + $ � � �

� �%�� � �� �/

�� � 0� 1 ! � 0 � � 1 ��� � � 10 2

$ �%�� � �� ��*

γ

� 5� � 5 � $ * γ

� 5� $ * γ

� 5 � � � 5� $ + � � � � � � 5 � $ + � � � � + $ � �

$ �(�� � �� ��

�� � 0 ! � 0� � � 5� � 0�

Page 89: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 89

Hal Daumé III ([email protected])10:16

Thanks to sufficient statistics!

Computing Expectations

% & '(*'-, . /0 2 ( 3 ')(+*'-, 4

5 & ' % . / 687 3 ' % 4

����� � �� �� �� � ��� � � �� �� � � ��

� ���

����� � �� ���� � � � � �

� � �� � �� ��"! � � � � � �

� # $ ��% & ��' (*) # $ �% () # $ ��' (

����� � �� �� +� � � � � � � ���, � � �� �� � ��-� +� �

� � � � � �� �� ! � ��� � � �

� ��� . # $ � % & �/' () # $ �% (*) # $ �/' ( 0

Page 90: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 90

Hal Daumé III ([email protected])10:16

Optimizing Variational Parameters

Step 3: Given full expression for , differentiate wrt VPsL

� � ��� �

L

��� �� � � � ��� � �� � �� � � ��� � � � � � � � � �� � � � � � � � �

� � � � ��� � � � � � �� � �� � � ��� � � � � � � � �

� � � ��! � � � �� ��#" �

� � � ��$" � � � �� ��" �

� � % & ' � & � � � ( &%

)

L) � � �

� � � � � � �� � � � ��� � � � � � ��� � � � � � � � � � � �

*�� �

� � � � � +, - . � � � � � � � � � � � � �

/

Page 91: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 91

Hal Daumé III ([email protected])10:16

Optimizing Variational Parameters

L

� ��� � � � � � � � � �� �

� � � � � � � �� � �� � �� �

�� � � � � � � �� � �

γ

� � � � � �γ

� � � � �� �

� � � � � � � � � � � � � � �� �

�� � � � �� � � � � ��� � �� �

L� �� � � � � ��� � � � � ��� � �

� � ��

� � � � � � � �� � � � � � � � �

� � ��

� �� � � ��

�� �

� �� � � ��

�� � �� � �

Page 92: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 92

Hal Daumé III ([email protected])10:16

Optimize Model Parameters

Step 4: Optimize the model parameters:

L

� (��� � � ��� � ( �� �

� � ��� � � � '� ��

)

L

) (�

� � � �� � � � � '� ��

(��

* ( �� �

� � ��� � � � '� ��

Page 93: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 93

Hal Daumé III ([email protected])10:16

Optimize Model Parameters

Finally, a and b:

L

� ��� � � � ��

γ

� � � � ��

γ

� � �

γ

� � � � �� � ����� � � �� � ������

L� � � � ��� � � � � � � � � � � �� � �

�� � � � � � � � � � � �

! solve using optimization techniques

Page 94: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 94

Hal Daumé III ([email protected])10:16

VEM in our Model

Iterate:

Optimize variational parameters:

Optimize model parameters:

w

zN

��� �� �

M

G D

� � � � � � � � � � � �

� � � � �� � � �� � � � � ��� � � � � � � � ��� � � � � � � � � �

!#" $ % & ')(*" +-, % & ."

'(" + /102 " $ .3 40 2 3576 8 9 3"

:<;>= ? @ A BC D .FE @ GH = ? @ /:JI @ K I @ G= L ?

: ;= ? @

M�

� N� � �

OQP� � � R� ��

SUT V W generic optimization techniques

Page 95: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 95

Hal Daumé III ([email protected])10:16

Variational EM Summed Up

Steps:

Write down conditional likelihood andchoose an approximating distribution(eg, by factoring everything) with variational parameters

Iterate between optimizing the VPsand model parameters

Pros:

Efficient, deterministic, often quite accurate

Cons:

At it's heart, still a mode-based technique

Often underestimates the spread of a distribution

Approximation is local

Page 96: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 96

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 97: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 97

Hal Daumé III ([email protected])10:16

Expectation Propagation

Basic idea: replace intractable p with product of tractable q

Generally we want to compute:

Approximate each factor:

Integral is approximated by:

approximate terms should be chosen to make the EP operations tractable

� ��

���� �

�� � �

? � � � � ? � � �

Typically is the prior

� � �

� � � �

��

��

�� �

� �

� � �

See a

lso: M

in01

Page 98: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 98

Hal Daumé III ([email protected])10:16

Expectation Propagation Algorithm

Initialize approximate terms

Compute the approximate posterior:

Iterate:

Select a to update

Delete from the posterior by dividing and renormalizing:

Match and and minimize KL divergence to get a new posterior with marginal

Update

����

�� �

� � � � �

�2

�2 � �

��

� �2

2 � �

� ? � � �

� ? � � �� � ? � � � �

� � � �

� ? � � �

� � ? � � � ���

�� ��

� ? � � � � � @� � � �

� � ? � � �

Page 99: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 99

Hal Daumé III ([email protected])10:16

EP in our Model

Our integral looks like:

We approximate non-prior terms:

The approximate posterior is:

w

zN

��� �� �

M

G D

� � ��� �� � � �� �

� � � ��� � � �

� � �� � ��� � � � � � � ��� � � � � � ��� �

! "#

$&%'( )

*+ *, ' - % .

, ) - % . ! /10 ,�2 - % 3 254 6 .

, ' - % . ! 7 89;:

% 9 :<>= :9;: ?@BA

CBDFEGIH J K LE M&NH MN

O '9: Resembles a Beta/Dirichletwith parameters

P E MN

Q RTS U � VXW Y[Z R S \ Z^] _ U`

a Y`R S U bdc e VXW Y Z RTf U f g e Z h`

i` j` k

l;m e _ h`

i` j` n

Page 100: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 100

Hal Daumé III ([email protected])10:16

EP in our Model

Delete:

Match:

Choose such that:

Update:

w

zN

��� �� �

M

G D

� � ��� �� � � �� �

� � � ��� � � �

� � �� � ��� � � � � � � ��� � � � � � ��� �

Q �` �� � � Q �� �� Y`� � � e VXW Y[Z � l �` � l�� �` e l� j` �

� �� � � � �� � � �

� � � � � � � � � �

� � �

� ��� �� ��� !#" $% �� &(' ) * + � � � �� ��� $� ��

,% - ��� �� ��� !#" $% �� &(' ) * + ,% - ��� �� ��� $� ��

.0/ 1 2

3465

798 1 :<; 8 1 =>/?A@

B3 ;

345

C./ 1DEF

G/ 2 1 H/1 8 1=>/

1 8 1 =>/γ

I 1 8 1 : J

1 γ

I8 1 : J

1 γ

I8 1=/ J

γ

I 1 8 1 =/ JLearning rate

Difference betweenactual and expected

Approximatenormalization

Old normalization

New normalization

Page 101: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 101

Hal Daumé III ([email protected])10:16

Summing up EP

Approximate terms by product, iteratively minimize each's KL divergence to the true posterior

Pros:

Efficient, accurate

Global approximation to the integral

Cons:

Typically requires lots of human effort

Only gives an approximation to the integral (not a bound)

Sometimes difficult to productize distributions

Doesn't necessarily converge

Page 102: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 102

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 103: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 103

Hal Daumé III ([email protected])10:16

Summary of Methods PROS CONS

MAP Just as easy as EM Not Bayesian, overfitsOnly simple models

Summing Easy to implement Bounded regionsArbitrarily accurate Impossible for d>2

Monte Carlo Simple Proposal dist. is hardFew tunable params Lots of samples required

MCMC Should work in d>>2 Hard to discover convergenceNo need for global PDs Lots of useless computation

Laplace Not much harder than MAP Poor fit for multimodalEfficient for good A Inefficient for bad A

Variational Efficient, deterministic Still (massive) mode-basedOften very accurate Local approximations

EP Efficient, deterministic Lots of human effortVery accurate Weak convergence guarantees

Page 104: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 104

Hal Daumé III ([email protected])10:16

Message Passing Algorithms

Two major choices:

What approximating distribution should we use?

What cost should we minimize?

Power EPexp familyD (p || q)

Struct MFexp familyKL(q || p)

Frac BPfactorizedD (p || q)

EPexp familyKL(p || q)

Mean FieldfactorizedKL(q || p)

Tree RepfactorizedD (p || q)

BPfactorizedKL(p || q)

a

a

a>1

q || p p || q

��� ��� � ��� �

� � ���

��� � � � � � � � � � ��� � �

�� � ��

�� ��� � ��� � ��� ��� � ��� �� �� � � � � ��� ���� � ���

See a

lso: M

in05

Page 105: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 105

Hal Daumé III ([email protected])10:16

Other Integration Strategies

Message Passing:

Generalized BP

Iterated conditional modes

Max-product belief revision

TRW-max-product

Laplace propagation

Penniless propagation

Bound propagation

Variational message passing

Non-message-passing:

Contrastive free energy

Bethe free energy

Spline integration

Page 106: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 106

Hal Daumé III ([email protected])10:16

Empirical Evaluation of Methods

Query-focused summarization model:

w

zN

� �M

G D

SQ

w

� G����

� � � � �� ����

���� � � ��� ��� �

��� � � � � � �����

��� � � � � �� �� � �!" #

� � � � �$ � � �! " %� & ' (

� � � � ��)� ��! " % � & * & ' (

Page 107: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 107

Hal Daumé III ([email protected])10:16

Evaluation Data

All TREC data

Queries 51-350 and 401-450 (35k words)

All relevant documents (43k docs, 2.1m sents, 65.8m words)

Asked 7 annotators to select up to 4 sentences for an extract

Each annotated 25 queries (166 total)

Systems produce ranked lists of sentences

Compared on mean average precision, mean reciprocal rank and precision at 2

Computation Time:

MAP (2 hours)

Summing (2 days)

Monte Carlo (2 days)

MCMC (3 days)

Laplace (5 hours)Variational (4 hours)EP (2.5 hours)

Page 108: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 108

Hal Daumé III ([email protected])10:16

Evaluation Results

��

� �� �

� �� �

� �� �

� �� �

�� � � � � � � � ��� ��� �

Rando

mPo

sition IR

MAP

Summing

Mon

te Carl

o

MCM

CLap

lace

Variati

onal EP

2 hours

2 days

2 days

3 days

3.5 hours

4 hours

2.5 hours

Page 109: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 109

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 110: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 110

Hal Daumé III ([email protected])10:16

Bayesian Discriminative Models

Take a neural network and put a prior on the weights:

Computation requires a bit of calculus on Gaussians

Then use Laplace, variational or MCMC to perform integration over posterior

� ��� � & ' � � '�� � & '� � '�� �� � � � � �� � & ' � � � � � & ' � � � ��

� �� �� � � �� �

� � � ��� � �� � & ' � � � � � & '� � � � ��� � � � ���� � � �

� � � � � �� � � �� � � � �

See a

lso: M

K[3

8]

Page 111: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 111

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 112: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 112

Hal Daumé III ([email protected])10:16

Gaussian Processes

Idea: instead of placing a prior over weights, place it over the function directly

Gaussian Process:

A collection of r.v.s such that any finite sample is jointly Gaussian

Uniquely specified by mean distribution and covariance function

� ��� � & ' � � ' � � & '� � '�� � � � � � �

�� � ��� � & ' � � � � � & ' � � � � � ��

� �� �� � � �� � �

� �� � � $�� � � �� � �� �

� � �� � �� � � $�� � $�� � � �� � �� � � � �� �

See a

lso: M

K[3

8]

Page 113: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 113

Hal Daumé III ([email protected])10:16

Computing with GPs

Compute the new covariance matrix:

� � & ' �

� ���

��

��� � � � � � � ��� �� � � � � � �

Expected value Error bars

�� � � � � � � � �

Page 114: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 114

Hal Daumé III ([email protected])10:16

Efficiently Computing with GPs

Inverting is cubic

Idea:

Iteratively add training points to maximize information gain (or some other criteria)

Only compute covariance over those

Leads to 'Informative Vector Machine (IVM)'

��� � � � � � � ��� � � � � � � � �

See a

lso: L

aw[0

3]

Page 115: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 115

Hal Daumé III ([email protected])10:16

Example Covariance Functions

Linear:

Polynomial:

RBF:

Combo:

Bayesian advantage: we can tune all of these parameters!

� ��� � � ��� � � � �

� ��� � � �� � � � � � � �

� ��� � � ��� �� � � � � � � � ���� � � �! � "�

� �� � � ��� �$# % � & � � � ' %( ) �� � � � & � � �

Page 116: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 116

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 117: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 117

Hal Daumé III ([email protected])10:16

Dirichlet Processes

Suppose we don't want to limit the number of components of a mixture model

Example:

Each word in a document is drawn from a topic

We don't know how many topics there are

Allow the number of topics to grow with the data

A Dirichlet Process is a collection of r.v.s such that any finite sample is jointly Dirichlet

Parameterized by a precision and a mean distribution

� ��� � �

A draw from a DP is, itself, a distribution

See a

lso: N

eal9

8

Page 118: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 118

Hal Daumé III ([email protected])10:16

Dirichlet Distribution vs. Process

Suppose we place a DP prior:

Now we observe samples drawn from :

The posterior (after integrating out G) is again a DP:

This is exactly like a Dirichlet Distribution:

� � �� ��� ���

��� � � �� �� � � �

� �� � � � �� � � � � � � � � � �� � � ��

� �� � �� � �� � �� � � � � � � � ��

� ��

Distribution

Vector

Page 119: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 119

Hal Daumé III ([email protected])10:16

Polya Urns

Start with an urn witha single black ball

Repeatedly draw balls:

If drawn ball is black, replace itand put in a ball of a new color

If ball is not black, replace itand put in a ball of the same color

Distribution is given by a DP:

� ��� � � � � �� � � � � � � � �

�� � �

� � � � � � � � � � � � �

�� � � ���

� � �

Page 120: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 120

Hal Daumé III ([email protected])10:16

Gibbs Sampling for the DP

Suppose we have an infinite mixture model over words:

Key, Mult is conjugate to the DP mean (Dir )

Gibbs sampling

Assign a latent class to each data point

Repeat:

Resample each by:

Resample each by:

� � � � � � � � ��� �� �

� � � � �

�� � � � �� � � �

� � � �

� �

� � � �� � � ��� � � � ��

� � � �� � � � � �� � � �� � �

� � � � � ��� � � �� �� � � �� � ��

���

�� � � �� � �� � � � � � �

�� � � �� � � �

Page 121: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 121

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 122: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 122

Hal Daumé III ([email protected])10:16

Bayesian Decision Theory

Bayesian statistics tells us how to compute distributions

What if we want to actually do something with a distribution?

Define a loss function:

Choose action to minimize the Bayesian expected loss:

� � ��� � � � ����� � � � � � � ��

� � � � � � � � � � �

� � � � � �� � �� ��� �

Frequentist approach: define a decision rule:

Define the risk of the decision rule as:

Now, search for admissible decision rules

��� ��� �� � � � � ��� �

� �� � � � !#" � $ % �� � � �'& � � ($ �& ) $ �& * � � % �� � � � & � �

What do weexpect to lose

Over allparameters

What do weexpect to lose

Over alldata sets

With knownparameters

See a

lso: B

er[4

.4]

Page 123: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 123

Hal Daumé III ([email protected])10:16

Tutorial Outline

Introduction to the Bayesian Paradigm

Background Material

Graphical Models

Expectation Maximization

Priors, priors, priors (subjective, conjugate, reference, etc.)

Inference Problem and Solutions

MAP

Summing

Monte Carlo

Markov Chain Monte Carlo

Advanced Topics (time permitting)

Bayesian discriminative models

Non-parametric (infinite) models

Bayesian Decision Theory

References

Laplace Approximation

Variational Approximation

Expectation Propagation

Others...

Page 124: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 124

Hal Daumé III ([email protected])10:16

Bayes in Action (NLP/IR/Text)

D. Blei, A. Ng, M. Jordan, Latent Dirichlet allocation, JMLR 2003.

T. Griffiths, M. Steyvers, D. Blei, J. Tenenbaum, Integrating topics and syntax. NIPS 2004.

A. McCallum, A. Corrada-Emmanuel, X. Wang, Topic and Role Discovery in Social Networks. IJCAI 2005.

Y. Zhang, J. Callan, T. Minka, Novelty and Redundancy Detection in Adaptive Filtering. SIGIR 2002.

T. Minka, Bayesian conditional random fields. AISTATS 2005.

K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, M. Jordan. Matching words and pictures. JMLR 2003.

Page 125: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 125

Hal Daumé III ([email protected])10:16

For Further Information (Books)

James O. Berger, Statistical Decision Theory and Bayesian Analysis. Springer, 1985.

David MacKay, Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.

Larry Wasserman, All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.

Christopher Bishop, “The new book.”2006.

Page 126: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 126

Hal Daumé III ([email protected])10:16

For Further Information (Tutorials)

C. Andreiu, N. de Freitas, A. Doucet, M. Jordan, An Introduction to MCMC for Machine Learning. ML 2003

M. Wainwright, M. Jordan, Graphical models, exponential families and variational inference. UCB Stat TR#649, 2003.

K. Murphy, A Brief Introduction to Graphical Models and Bayesian Networks. www.cs.ubc.ca/~murphyk/Bayes/bayes.html

T. Minka, Using lower bounds to approximate integrals. www.research.microsoft.com/~minka/papers/rem.html, 2003.

Page 127: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 127

Hal Daumé III ([email protected])10:16

Other References

N. Lawrence, Fast sparse Gaussian process methods: the informative vector machine. NIPS 2003.

T. Minka, Expectation Propagation for Approximate Bayesian Inference. UAI 2001.

T. Minka, Divergence Measures and Message Passing. AI-Stats 2005.

R. Neal, Markov chain sampling methods for Dirichlet process mixture models, TR. 9815, Dept. of Statistics, University of Toronto.

Page 128: Hal Daumé III (hdaume@isi.edu) Beyond EM: Bayesian Techniques for NLP …users.umiacs.umd.edu/~hal/tmp/bayesnlp/beyond-em.pdf · 2005-06-22 · Slide 3 Bayesian Methods for NLP 10:16

Bayesian Methods for NLPSlide 128

Hal Daumé III ([email protected])10:16

Thank you!

Questions? Comments?