hal daumé iii ([email protected]) beyond em: bayesian techniques for nlp...

Bayesian Methods for NLPSlide 1

Hal Daumé III ([email protected])10:16

Beyond EM:Bayesian Techniques for NLP Researchers

Hal Daumé I I IInformation Sciences Institute

University of Southern California

[email protected]

...or “And you thought EM was hard”

...or “And you thought EM was fun”



Tutorial Outline

�

Introduction to the Bayesian Paradigm

�

Background Material

�

Graphical Models

�

Expectation Maximization

�

Priors, priors, priors (subjective, conjugate, reference, etc.)

�

Inference Problem and Solutions

�

MAP

�

Summing

�

Monte Carlo

�

Markov Chain Monte Carlo

�

Advanced Topics (time permitting)

�

Bayesian discriminative models

�

Non-parametric (infinite) models

�

Bayesian Decision Theory

�

References

�

Laplace Approximation

�

Variational Approximation

�

Expectation Propagation

�

Others...



The Bayesian Paradigm

�

Every statistical problem has data and parameters

�

Find a probability distribution of the parameters given the data using Bayes' Rule:

�

Use the posterior to:

�

Predict unseen data (machine learning)

�

Reach scientific conclusions (statistics)

�

Make optimal decisions (Bayesian decision theory)

Posterior

LikelihoodPrior

Marginal

� ��

� � � �



Why be Bayesian?

�

Pedagogical arguments:

�

All estimates, answers, decisions, etc. are consistent

�

Uniform treatment of all aspects of statistical modeling

�

Results are often interpretable

�

Assumptions are explicit and most can be tested

�

Practical arguments:

�

We know maximum likelihood estimators (for instance) are flawed

�

Who doesn't do smoothing?

�

I don't want to run 1000 experiments, changing a single parameter by 0.01 on each run (and I don't want you to either!)

�

Lets you do some fun stuff that you can't otherwise do

See a

lso: B

er[1

.6, 4

.2],

MK

[2.1

, 2.2

], W

as[1

1.1]



Why not be Bayesian?

�

Makes you do a lot of (or at least some) math

�

Could be a reason to be Bayesian, though

�

Computational complexity

�

Though it's not so bad compared to cross validation

�

Arbitrariness of priors

�

Often based on misguided impressions of either Bayesian, or frequentist, methods

�

Generally not a big problem for machine learning



What does it mean to be Bayesian?

�

It is NOT about using priors

�

Oooh look, I put a Gaussian prior on my maximum entropy problem...I must be a Bayesian now

�

It is NOT about applying Bayes' Rule

�

Well, I've used a noisy channel model, which employs Bayes' Rule...I'm being Bayesian, right?

�

It is ONLY about modeling uncertainty in all stages of statistical inference

�

This means making decisions by summing over all possibilities

NOT



What is this tutorial about?

�

Graphical models as a tool for expressing assumptions

�

Common statistical distributions that are useful for modeling different types of quantities

�

How to go from a problem specification to a model

�

How to specify/choose appropriate prior distributions

�

How to per form inference in that model

�

Initial focus on unsupervised learning, but will discuss supervised models at the end



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...



What is a Model?

�

A statistical model is a specification of our assumptions about the data we will be modeling

�

What aspects are independent of others?

�

For non-independent aspects, how are they related?

�

Example 1 (maximum entropy models):

�

There is a class variable y and a instance variable x and that the conditional probability of y given x is a linear function of a bunch of features of x

�

Example 2 (machine translation)

�

There is something called a foreign string and something called an English string and each foreign word is generated by either exactly one English word or a special 'null token' according to a translation table



What is a Good Model?

�

We can consider models by looking at the probability that they generate our data set (the marginal likelihood of the data):

P(d

ata

| mo

del

)

all possible data sets

Model 1

Model 2

Model 3 Current data set

See a

lso: M

K[2

8.1]



What are Parameters?

�

Parameters are the parts of the model about which you are not completely 'certain' (or, willing to claim certainty)

�

Example 1 (maximum entropy models):

�

I 'know' that the class is linearly related to the input

�

I 'know' what the relevant features are

�

I don't know the feature weights => these are the parameters

�

Example 2 (machine translation):

�

I 'know' that foreign words are generated from exactly one English word (or the null word)

�

I do not know what the probability of any foreign word is, given a particular English word

�

(I also do not know which foreign words correspond to which English word, but this is a 'hidden variable', not a parameter.)



Graphical Models

�

Convenient notation for representing probability distributions and conditional independence assumptions

X A observed random variable

X A unobserved/hidden random variable

X A observed/known parameter

X A unobserved/unknown parameter

A submodel replicated N timesN

An indication of conditional dependence

See a

lso: M

urph

y

X

X



Example 1: Naïve Bayes

X

Y

N

��

Feature parameters

Data vector

Class label

Class 'prior' probability

� ��

� �� Y

� For each example n:� Choose a class Y by:

� For each feature f:� Choose X by:

F� ! " #�$ "&%

' ( ) *+-, . * (+



Example 2: Maximum Entropy

�

Data vector

Class label

Feature parameters

� For each example n:� Choose a class Y by:

p

�

Y � y�

X,

� �

�

exp

�

fXf

�

f

�

X

Y

N

F



Example 3: Hidden Markov Models

X

Y�

�

X

Y

X

Y

X

Y

X

Y



Example for Summarization

�

Consider a stupid summarization model:

�

Each word in a document is drawn independently

�

Each word is draw either from a general English model, or a document specific model

�

We don't know which words are drawn from which

w

zN

�

� �

M

G D

Indicatorvariable

p

�

w

� � , �G ,

�D ��

m n zmn

p

�

zmn

� � � p �w

� �G �zmn p

�w

� �

dD �1 zmn



Fun with Graphical Models

�

Easy to propose extensions to the model: add sentences!

w

zN

�

� �

M

G Dw

zN

�

� �

M

G D

�

S



Fun with Graphical Models

�

Add queries!

w

zN

�

� �

M

G D

�

S

w

zN

�

� �

M

G D�

SQ

w

�Q



Maximum Likelihood Estimators (MLE)

X

YN

��

�

Take a parameterized model and some data

�

Find the parameters that maximize the likelihoodof that data (i.e., the 'probability' of the parametersgiven the data):

� ��

� � ��

��

! � �"$# � ! � � % & � '# � � � ! � � % �� & � ' (

l

)*

, + ,

X1: N ,Y1: N

-/.

n k

)

Ynk log +

k

0 )

1 1 Ynk

-

log)

1 1 +

k

- -

0

n f

)

Xnf log

*

fYn 0 )

1 1 Xnf

-log

)

1 1 *

fYn

- -

2

l

2435

n k

6

Ynk

3

k

7 1

7Ynk

1 7 3

k

8

9

l

9: k

;

n: Yn

< k f

=

Xnf

:fk

> 1> Xnf

1 >:

fk

?

F



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...



MLE with hidden variables

�

Consider a stupid summarization model:

�

Each word in a document is drawn independently

�

Each word is draw either from a general English model, or a document specific model

�

We don't know which words are drawn from which

�

Compute log likelihood:

�

Uh oh! Logs can't go inside sums!

w

zN

�

� �

M

G D

Indicatorvariable

p

�

w

� � , �G ,

�D ��

m n zmn

p

�

zmn

� � � p�

w� �G �z

mn p

�

w

� �

dD �1 z

mn

l

�� ,

�

w

��

m n

logzmn

...




�

We would like to move the log inside the sum, but can we?

�

Jensen's Inequality to the rescue:

�

For any distribution Q (with the same support)

�

How should we choose Q?

��

��

��

��

� � � �

� ��

��

� � � �

��

��

See a

lso: W

as[9

.13]

, Mur

phy




�

If we set then the lower bound becomes an equality:

�

So, when computing , the expectation should be taken with respect to the true posterior

� ��

�

��

� � � � � �

��

� � � � � ��

� �

��

� � � � � ��

� �

��

� � ��

��

� � ��

��



EM in Practice

�

Recall, we wanted to estimate parameters for:

�

So we replace the hidden variables with their expectations:

�

All we need to do is calculate the expectations:

�

And now the computation proceeds as in the no-hidden-variable setting

� � � � � � ��

��

� � � � � � � � � � � � � � � � � � � � �

� � � � ��

� � � �!� � � � � � � � � � � � � � � � �! � � � � �#"

� � $&% �'� !

�

� �)( *+ , -� � � � � � � �'�! � � � � �" � � $% � �!



EM Summed Up

�

Initialize parameters however you desire

�

Repeat:

�

E-STEP:Compute expectations of hidden variables underthe current parameter settings

�

M-STEP:Optimize parameters given those expectation

�

This procedure is guaranteed to:

�

Converge to a (local) maximum

�

Monotonically increase the incomplete log-likelihood



EM on our simple model

�

Suppose we have three words: { A, B, C}

�

Document 1 = [A B], Document 2 = [A C]

�

Initialized uniformly

�

E-step:

�

M-step:

��

� ��

� ��

� � � "! #$ % � & �(' �) �

*,+ -/. $ 0 1

*,+ - . $ 0 1 ! *,+ -/. $ 0 1 � * + -

� �� 2 � � � �� 2� � � � �� 2 2 � � * + -

3

AG 4 1

Z

5

E

6

z11

798

E

6

z21

7 : 4 12

3

BG 4 1

Z5

E6

z12

7 : 4 14

3

CG 4 1

Z

5

E

6

z22

7 : 4 14

3

1AD 4 1

Z

5

1 ; E

6

z11

7 : 4 12

31BD 4 1

Z

5

1 ; E

6

z12

7 : 4 12

3

1CG 4 0

3

2AD 4 1

Z

5

1 ; E

6

z21

7 : 4 12

3

2BD 4 0 3

2CG 4 1

Z

5

1 ; E

6

z22

7 : 4 12

< 4 E

6

z11

78E

6z21

7

E

6

z11

78

E

6z21

7 8E

6z12

78

E

6

z22

7 4 12



EM on our simple model

�

Suppose we have three words: { A, B, C}

�

Document 1 = [A B], Document 2 = [A C]

�

Initialized uniformly

Complete log likelihood

Incomplete log likelihood

log �

log

�

(A)

log �

(A)G

D1



Problems with EM

�

For our documents, will always converge to this solution

�

BUT:

�

For more documents and words, there is a trivial local maximum where the general English model does nothing

�

This corresponds to � � �

�

Why is this bad?

It doesn't conform to our prior beliefs aboutwhat parameters are likely!!!

�

So how can we specify our prior beliefs about �?



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...



What is a Prior?

�

Recall Bayes' Rule:

�

A prior is a specification of our beliefs about the values parameters can take, before seeing any data

�

Okay, so what is a belief?

�

Do we have the same beliefs?

�

What if we don't?

�

What if I don't know what I believe?

Posterior

LikelihoodPrior

Marginal

� #� � � & �

� #� & � # � � � &

�� #� & � # � � � &



What is a Belief?

�

We want to be able to state numerically our beliefs in things like:

belief (it rained last night)belief (it will rain tonight)belief (the next card I draw will be an ace)

�

And we want to know how to manipulate beliefs

�

Suppose you are willing to accept a bet with odds proportional to the strength of your belief

�

Suppose you believe a coin will come up heads 90% of the time:belief (heads) = 0.9

�

Then you will accept a bet that if a coin comes up heads, you win at least $1, and if it comes up tails, you lose at most $9

See a

lso: M

K[3

], Be

r[3.1

, 2.3

]



Beliefs (The Dutch Book)

�

IF:

�

Suppose you believe a coin will come up heads 90% of the time:belief (heads) = 0.9

�

Then you will accept a bet that if a coin comes up heads, you win at least $1, and if it comes up tails, you lose at most $9

�

THEN:

�

Unless your beliefs satisfy the rules of probability including Bayes' Rule, then I can take arbitrary amounts of money from you

�

SO:

�

The only way not to go broke is to ensure that your beliefs agree with probability theory

�

AND BAYES' RULE!

See a

lso: B

er[4

.8]



How do my beliefs compare to Kevin's?

�

Maybe they do, maybe they don't

�

Does it matter?

�

�

If we have enough data, does it matter?

�

No! Theorem* :

* under some regularity conditions

� � ��

See a

lso: B

er[4

.7, 4

.8],

, Was

[11.

5]



Specifying Priors

�

A prior is a map that:

�

Assigns to every setting of parameters a non-negative real value

�

Integrates to 1 over the parameter space

�

Such a beast can be difficult to describe! Tools:

�

When the parameters are discrete, we can often set them by hand

�

Unless we're in high dimensions with (prior) interaction among parameters

�

Otherwise, we will often choose a parametric prior

and deal with the hyper-parameters by one of many means:

�

Set them subjectively (subjective/true Bayes)

�

Integrate them out analytically (often not possible and often suboptimal)

�

Choose them in such a way to be objective (objective Bayes)

�

Optimize them from the data by marginal likelihood (empirical Bayes, Type II ML)

�

Or choose a set of priors and integrate over them (robust Bayes)

�

...

See a

lso: B

er[3

.1-6

, 4.7

], M

K[],

, W

as[1

1.1]

��



Exponential Family

�

A set of distributions of the form:

�

Using exponential distributions is very convenient:

�

They are convex with respect to the parameters

�

The have natural prior distributions

�

They have several convenient properties wrt moments:

See a

lso: B

er[8

.7],

MK

[22]

, Was

[9.1

3]

� � �� Natural

parameterSufficientstatistics

Normalizationfactor

�� ! "$# %'& () *# + "# ,.- % ! "# %& //021 3 "01 %



Subjective Bayes

�

Eliciting priors can be very difficult; several options:

�

The histogram approach

�

Simple, but how big are intervals, no tails, ...

�

The relative likelihood approach

�

Typically easier to elicit, but still no tails

�

Moment matching of parametric forms

�

Most used and misused approach; now we have tails, but they are hard to elicit

�

Alternative is to use quantiles

�

Cumulative distribution function determination

�

Choose quantiles of the CDF



Expected Fisher Information

Objective Bayes

�

Desire to give a prior that contains no information

�

For instance, the improper uniform prior on the real line

�

These are rarely consistent across reparameterizations

�

For scale parameters, is more appropriate

�

Jeffreys' prior:

�

Not affected by restriction on the parameter space

� ��

� � � � ��

��

� � � � � �� ! �#" $ � �

%



Empirical Bayes

�

Specify a class of priors (typically a functional form):

�

Estimate the prior by maximizing the marginal likelihood:

γ � � ��

�� γ

� � � � � � ��

A

��

��



A Bit of Philosophizing

�

Bayesians are often criticized for the use of priors:

�

Priors are subjective, science is objective

�

Priors can fail to be robust:

�

Small changes in the prior can result in large changes in the decision

�

Bayesian statistics requires thought and introspection

�

Do we really have degrees of belief? Is betting a good model?

�

Does infinite data really lead to agreement?

�

The computations are too difficult

�

Recall the goal of the statistician:

�

Statistics aims to do for inductive reasoning what Frege did for deductive reasoning

�

Do these complaints matter from the perspective of machine learning

“The subjectivist states his judgments, whereas the objectivist sweeps them under the carpet by calling assumptions knowledge, and he basks in the glorious objectivity of science.” (Good, 1973)



The Likelihood Principle

�

Principle states:

�

All relevant information about the parameters after the data is observed is contained in the likelihood for the observed data. Furthermore, two likelihood functions contain the same information about the parameters if they are proportional to each other.

Example:We want to determine if a coin is biased or notExperiment: the coin is flipped and we come up with 9 h and 3 tTwo possibilities for how the experiment was performed:

We decided to do a total of 12 flipsWe decided to keep flipping the coin until 3 tails were

observed

�

The Bayesian doesn't care

�

The classical statistician would compute:

�

7.5% 'chance' under option 1

�

3.25% 'chance' under option 2



Conjugate (convenient) Priors

�

Given a distribution

�

And a prior

�

The prior is conjugate if:

The posterior distribution of the parameter after observing somedata has the same functional form as the prior distribution

�

Beta/Binomial Example:

� ��

� � � �

� ��

�� ! " �� # � $� �

% &(' )* &(+ , -/. 021 3 4 5 67 4

.81 9 -: ;1 5 <�= 9

%!> ?+ -1 0�@ 5 6 Γ

-@ A B@DC 5Γ

-@ A 5 Γ -@EC 51 (F =C -: ;1 5 (G =C

H -1 0 @ 3 . 5 67 4

.8 Γ

-@ A B@IC 5

Γ

-@ A 5 Γ -@JC 51 F =C K 9 -: ;1 5 G =C K <�= 9

6 %!> ?+ - . 0L @ A B . 3@C B 4 ; . M 5

See a

lso: B

er[4

.2.2

], M

K[2

3]



Binomial and Beta Distributions

�

Binomial distribution models flips of coins (domain={ 0,1} ):

�

Probability that a coin, bias

�

, flipped N times will come up x heads

�

Parameters:

�

Distribution:

�

Moments:

�

Beta distribution models nothing (we care about) (domain=[0,1]):

�

Parameters:

�

Distribution:

�

Moments:

�

Beta is conjugate to binomial:

�

Posterior parameters:

�

Marginal distribution:

4�� K 3 1 � �� : �

% &' - . 0 4 31 5 67 4

.81 -: ;1 5 < =

6 �� 6 �� -� ; � B �� 5

@ � � K 3 � � � K

�� ! "$# γ

�� % ! "γ

�� "γ

� ! " �&' ( �) *� " +' (

, 6 @@ B � 3 -+. 6 @ �

-@ B � 5 / -@ B � B : 5

0213 1 4576

098 3 8 4 :<; 5

H - . 0 @ 3 � 5 6 Γ

-@ B � 5

Γ

-@ 5

Γ

- � 57 4

.8

Γ

-@ B . 5

Γ

- � B 4 ; . 5

Γ

-@ B � B 4 5



Beta Distribution Examples

�=0.

5

�=1

�=4

�

=0.5

�

=1

�

=4

�� ! "# γ

�� % ! "

γ

�� "

γ

� ! " �& ' ( �) *� " +' (



Multinomial Distribution

�

A distribution over counts of K>1 discrete events (words)

�

Domain:

�

Parameters:

�

Distribution:

�

Moments:

L .C � � 3 . � M � � � K � �

L 1C 3 � 31 � M � ∆ � 6 �L 1C 3 � 31 � M�� - � 5 1 � � � 1 6 : �

�� "# γ

� �� %) �

γ

�� %) " ��

�� # �� # �� ) *�� " �� # * ��

(N,0,0) (0,N,0)

(0,0,N)



Dirichlet Distribution

�

A distribution over a probability simplex

�

Domain:

�

Parameters:

�

Distribution:

�

Moments:

� � (�� δ

�

�� " � #�

� �

� �� γ

�� γ

� � � � � � �� !

"$# % &#' & ( ) *+ # % &# , ' &.- &# /

, ' & / 0 ,1 2 ' & /

(N,0,0) (0,N,0)

(0,0,N)

(N,0,0) (0,N,0)

(0,0,N)

[1,1,2] [5,5,10]



Multinomial/Dirichlet Pair

�

Multinomial distribution:

�

Dirichlet distribution:

�

Posterior hyper-parameters:

�

Marginal Distribution:

�� "# γ

� �� %) �

γ

�� %) " ��

� �� γ

�� γ

� � � � � � �� !

� ��C � � � ��

� �� Γ

� � � � � �Γ

�� Γ��

Γ� � � �

Γ

� � � ��

Γ

� � � � � � �



Gaussian/Gaussian-Gamma

�

Gaussian distribution:

�

Gaussian prior:

�

Gamma prior:

�

Posterior hyper-parameters:

�

Marginal distribution:

��

� ��

��

��

��

γ

� � � � � � � � � � � �

��

� � �� !#" $ %

� �

& ��

� ��

' � � �

� �

� ( � � �)

� ) ( � �

� ( � � � � ( � �

�� ( � � �

*� � � � �

� )�� ) � ��

+ � �

� �� ,-/. 0 � � � � � � �

Non-standardStudent's T distribution



Gamma Distribution

��

�

Γ

� � � � � � � ��

�=1

�=2

�=4

�=1 �=2 �=4

� � ( �

�� ( � �



Summary of DistributionsDistr ibution Domain Pr ior Parametr ic Form

Binomial Binary Beta

Multinomial K classes Dirichlet

Beta [0,1]

Gamma [0, � )Dirichlet Simplex

Gaussian Reals Nor/Gam

Cauchy Reals

Student's t Reals� ��

�� ! " #%$ � &' ( �) *� # + ' (

,�- ./ 02123 4 1�5 6%7 598 :<;

= >@?A BC<D BE F�G H CJI KMLN O

@PQ �� SR T U �� VW X Y �� R U�Z [ T U \

]_^ ` acb d ^fe g hji b � k � lnmo p arq gb h

s_tu vxw y tfz { |S} { ~�� v { �� vxw � t | � | � � �

� vw y tz �z � � |�� vw � � | �� v t � � | � � �M� � � ��



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...



Recall our summarization model

w

zN

�

� �

M

G D

�

The problem was that we don't believe that it's okay for � to go to 0 or 1

�

Solution?Put a prior on �!

�

What's a good prior?� y � � � � v � |

� y � � � � �� v � � | � �� v � � | ��



Bayesianified summarization model

w

zN

��

�

� �

M

G D

� d ^e g � �� ^ a ^ e g h

� d � � � � a � h

� d �e � �� a � h � �� a � h � � �

�

� �� !#" γ

� � $ � �γ

� � � γ � � � " %�& ' �( ) " � *& '

+, -/.0 1 2354 ' 6" -. 0 �( ) " � ' & -/.0

7�� 78 � -. 0 9.0: ��<; 7= � > ' & -.0 ? 9. 0:

� @BAC DE F % > %HG * ?I

+G,

@ -.0 C D J, > A ?K

7�� 78 � -.0 9L.0: ��<; 7= � > ' & -L.0 ? 9. 0: M N



The Integration Problem

�

In general, we want to compute something like:

�

Examples:

�

Summarization model:

�

p is the posterior of the hidden variables

�

f is the probability of the words

�

Classification model:

�

p is the posterior of the model parameters

�

f is the prediction function

�

Integral normalization

�

p is uniform

�

f is a probability measure (i.e., unnormalized probability distribution)

��

��

See a

lso: B

er[4

.2, 4

.3],

MK

[IV

]



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...



Maximum a Posteriori

�

Not Bayesian, but sometimes effective

�

Choose a,b by hand, proceed as before, but now we only need to maximize over � and

�

(unchanged)

�

Amounts to a simple form of smoothing:

�

Now, we can just maximize � as before, but with fake counts added proportional to the prior

� � � �� Γ

� �

Γ

� � Γ

� � ��

� ��

� � ! �� " ��# � �%$ & � ' � �� ( " ��#

) *�,+ ��

� � � � - �� - ��

� � ! �� " ��# � �%$ & � ' � �� ( " ��#

w

zN

.0/ .1 1

M

G D

2 354 6 798 :; <4 =4 6 7 >

? 3 2 8 :@A = 2 >

B 3 ? 6 C 8 DFE G < = C H > I DFE G < = C J > K5L I

M



Maximum a Posteriori Temptation

�

I don't want to specify my prior! Let me estimate it:

�

First, I find a,b that maximize the marginal likelihood

�

Then, I use this a,b as the smoothing parameters for �

�

This is NOT VALID! Why?

�

We are 'double counting' the evidence



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...


10:16

��

��

� � � ��



Summing in our Model

�

Simply rewrite the integral as a sum:

�

Now we can compute expectations of z easily and use these for the M-step of EM

w

zN

.0/ .� �

M

G D

� �54 6 798 :; <4 �4 6 7 �

? � � 8 :@A � � �

B � ? 6 � 8 DFE G < � � H � I DFE G < � � J � K5L I

M

� � � � � � � � ) ��

�� Γ

� �

Γ

� � Γ

� ��

� � � ��

� � ! �� " ��# � �$ & � ' � �� ( " ��#

� � ��

Γ

� �

Γ

� � Γ

� ��

� � � ��

� � ! �� " ��# � �$ & � ' � �� ( " ��#


10:16

�

Idea: let's choose R differently



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...


10:16

��

��

� � � ��

See a

lso: M

K[2

9], W

as[2

4.2]

, And

03


10:16

Uniform Sampling

�

Pros:

�

Can now work in arbitrarilyhigh dimensions (in theory)

�

Choice is now size of R, notthe width of windows

�

Cons:

�

Number of samples requiredto get near the mode of a spikydistribution is huge:

�

True distribution is rarely uniform��

�

��

��

� � � �

� � � ��

See a

lso: M

K[2

9], W

as[2

4.2]

, And

03


10:16


10:16

See a

lso: M

K[3

0], A

nd03


10:16

Rejection Sampling

�

Pros:

�

Again, if q is close to p, we willget good sample (i.e., few sampleswill be rejected)

�

Cons:

�

Hard to construct such a q

�

With p and q 0-mean Gaussians, with qp, we must set:

which for D=1000 yields an acceptance rate of 1/20,000

� � ��



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...




�

Monte Carlo methods suffer because the proposal density needs to be similar to the true density everywhere

�

MCMC methods get around this problem by changing the proposal density after each sample

�

General framework:

�

Choose a proposal density q( | x) parameterized by location x

�

Initialize state x arbitrarily

�

Repeatedly sample by:

�

Propose a new state x' from q(x' | x)

�

Either accept or reject this new state

�

If accepted, set x = x'

�

New problem: samples are no longer independent!

See a

lso: M

K[3

0], W

as[2

4.4]

, And

03



Metropolis-Hastings Sampling

�

Accept new states with probability:

�

Only put every Nth sample into R

� ��

��

� � � �

� � � �

p(x)

x0 x'

p(x')

p(x0)

q(x0|x')

q(x'|x0)

q( | x0) q( | x')

��

� � � ��

��

� �

See a

lso: M

K[3

0], W

as[2

4.4]

, And

03



MH in our Model

�

Invent a proposal distribution q

�

Or, condition on all variables:

�


�

Alternatively, we could propose values for LMs in the sampling

w

zN

.0/ .� �

M

G D

� �54 6 798 :; <4 �4 6 7 �

? � � 8 :@A � � �

B � ? 6 � 8 DFE G < � � � � I DFE G < � � J � K5L I

M� � � � � � � � ��

� � � � � � � � ��

� ��

�� A � � � � � ��

� � � � � � ��

� � � � � � ��

� �� ! � �� 6A

� �� A � � �

�� A � � � �� A � � � #" �%$ �& � ��& �� '



Metropolis-Hastings Sampling

�

Pros:

�

No longer need to specify auniversally good proposaldistribution; only locally good

�

Simple proposal distributionscan go far

�

Cons:

�

Hard to tell now far to space samples:

�

Suppose we use spherical proposals and, then we need at least

where sigmas are lengths of the major density in p

�

Auto-correlation to track this:

p(x)

x0 x'

p(x')p(x0)

q(x0|x')q(x'|x0)

q( | x0) q( | x')

� � ��

� � � ��

��

� �

��

�� !� � �" � � � !

��

� � � � � ! #



Gibbs Sampling

�

Defined only for multidimensional problems

�

Useful when you can take out one variable and explicitly sample the rest

p(x1 | x2)

x2

x1

p(x1 | x

2)

p(x1 | x2)

p(x1 | x

2)

��

��

� � � � � �

See a

lso: M

K[3

0], W

as[2

4.5]

, And

03



Gibbs Sampling

�

Typically our params are:

�

If, for each i, we can draw a sample from:

then we can use Gibbs sampling

�

In graphical models, only depends on the Markov blanket:

p(x1 | x2)

x2

x1 p(x

1 | x2)

p(x1 | x2) p(x1 | x

2)

� ��

� � � � � �

��

� � � � �� " ��

� � � � � � ��

a

bd

e

f! "# $�% & ' (�) ! " # $+* , - ( ! "/. $ # ( ! "0 $ # , 1 (

c



Gibbs in our Model

�

Compute conditional probabilities

�


�

Alternatively, we could propose values for LMs in the sampling

w

zN

��

M

G D

� � ��

� � � � ��

� � ��

!#" $ % & !" $ ' (*) + ! ,.- % !" $ /

- % & - ' (*) + ! ,.- % !" $ /0 � �

( 132 ,54 0 � % - /

4 0 � % &4 0 � ' ( 132 ,4 0 � % - /76 , 80 � % 4 0 � " 9 /



Gibbs Sampling

�

Pros:

�

Designed to work in highdimensional spaces

�

Terribly simple to implement

�

Automatable

�

Cons:

�

Hard to judge convergence, can require many many samplesto get an independent one (often worse than MH)

�

Only applicable when conditional distributions are 'nice'�

(Though there are ways around this)

p(x1 | x2)

x2

x1 p(x

1 | x2)

p(x1 | x2) p(x1 | x

2)

� ��

� ��

� � �



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...



Laplace (Saddlepoint) Approximation

�

Idea: approximate the expectation by a quadratic (Taylor expansion) and use the normalizing constant from the resulting Gaussian distribution

��

��

�� "!

p(x)f(x) = g(x)

q(x)

x0

See a

lso: M

K[2

7]




�

Find a mode x0 of the high-dimensionaldistribution g

�

Approximate ln g(x) by a Taylorexpansion around this mode:

�

Compute the matrix A of second derivatives

�

The exponential form is a Gaussian distribution; use the Gaussian normalizing constant:

��

� � � ��

g(x)

q(x)

x0

��

� � � � � � � � � �

��

� � � ��

� ! �

"$# %& � ')( * , ( /,+ * ,�- (/. / ,0 - / 1') + 2



Laplace in our Model

�

Compute second derivatives:w

zN

��

M

G D

� � ��

!" #$ %'& ( $ &) *,+- . $ / () * 01

02

34657

$ 4857+ - . $ / *) 4 579 + : 1 2;=< 1 2> ? /

@ ACBD E@GF H IKJCL M NF L IO L M NI ML F N P QRGS T

UWV RTF LI ML V RT NI ML F NX

@ Y ACBD E@ F Y H L IKJ L M NF Y L

IO L M NI ML F N YL QR S T

ZV RTF Y P I ML V RT NI ML F N Y[

\ ]_^ `

a6bcdfe ab g hie dj k

lKm no ln

pe dj klKm n

cdfe o ln g

qsr tu vxw y z w {| z w {~} � z w�� { �� r �

�� w � �=� � z w { �

�� G�




�

Pros:

�

Deterministic

�

Efficient if A is of a suitable form(i.e., diagonal or block-diagonal)

�

Can apply transformations to makequadratic approximation more reasonable

�

Cons:

�

Poor fit for multimodal distributions

�

Often, det A cannot be found efficiently

��

� ��

�� g(x)

q(x)

x0



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...




�

Basic idea: replace intractable p with tractable q

�

Old Problem:

�

We cannot come up with a good, single, q to approximate p

�

Key Idea:

�

Consider a family of distributionswith 'variational parameters'

�

Choose a member q from Q that is closest to p

�

New problems:

�

How do we choose Q?

�

How do we measure 'closeness' between q and p?

� � ��

See a

lso: M

K[3

3], W

ain0

3, M

in03



Recall EM and Jensen's Inequality

�

Jensen gives us:

�

Where we chose to turn the inequality into an equality. But we can also compute:

for any choice of q

� ��

��

� � � ��

� � � �

! ��

� � � ��

� � � �

� �� #" ��

� $&%' ( ) � �� *" $% ' ( ) � � � � � � *

+�, -. /�0 1�2 354 L 6 78 9;: /�< 3 1 1 . /< 1 0 =2 3 >

L

?�@ ACB D



Variational EM

�

Parameterize q and directly optimize:

�

Iterate:

�

V-Step: Compute variational parameters to minimize KL

�

E-Step: Compute expectations of hidden variables wrt

�

M-Step: Maximize wrt true parameters

�

Art: inventing q so that this is all tractable

� � � � � � � � � $% ' ( ) � �� *" $%' ( ) � � � � � � * � ��

�B

� � �

�L



Variational: Choosing Q

�

Mixture model:

� � ��

γ

� � � �

γ

� � γ

� � � � ��

��

� � "! �� # �� $

% & ')(+*'-, . /10 2 ( 3 ')(+*', 4

5 & ' % . / 687 3 ' % 4z

M,N

9;:<>=@?<BA C

� �ED � � � ��F ��-G � �D � �

γ� �F � �-G �

γ� � F � γ

� �-G �H�I J KD H J K

LNMPO Q R �D H J%)ST O

Key: and z arenow not tied in the

q distribution!

U

w

zN

VW

V

X X

M

G D

Y



Variation EM

�

Step 1: Write out the log likelihood function:

L

� � ��

� �� "! � � � �� ! � #

$ �%�� &�� ' �� ! � � � #

�(�� )&*

γ

�� $ * γ

�� $ * γ

� � � �� $ + � � � � � � � $ + �� + $ � ,

� � ��

�� -! � � � � � � � + $ ! � � � � � + $ � . #

� � �� /

�� 0� 1 ! � 0 � � 1 � � � � 10 2

$ �3�� 4 *

γ

� 5� � 5 � $ * γ

� 5� $ * γ

� 5 � � � 5� $ + � � � � � � 5 � $ + �� + $� 6

$ �(��

�� 0 ! � 0�� 5� � 0#



Variational EM

�

Step 2: Simplify and compute expectations

��

� 5 � � �� % ��

� �� %

��

� 3� 5 � �4 �� 3 � % 4 ��

� �� 3 � % 4

��

� 5 � � �

L

� � � � � � � �

� �� "! � � � � �� ! � �

$ �(�� ' �� "! � � � �

�� *

γ

�� $ * γ

�� $ * γ

� � � �� $ + � � � � � � � $ + � � � � + $ � �

� �%��

�� ! � �� + $ ! � �� + $ � � �

� �%�� /

�� 0� 1 ! � 0 � � 1 �� 10 2

$ �%�� *

γ

� 5� � 5 � $ * γ

� 5� $ * γ

� 5 � � � 5� $ + � � � � � � 5 � $ + � � � � + $ � �

$ �(��

�� 0 ! � 0� � � 5� � 0�



Thanks to sufficient statistics!

Computing Expectations

% & '(*'-, . /0 2 ( 3 ')(+*'-, 4

5 & ' % . / 687 3 ' % 4

��

� ��

��

� � �� "! � � � � � �

� # $ ��% & ��' (*) # $ �% () # $ ��' (

�� +� � � � � � � ��, � � �� -� +� �

� � � � � �� ! � ��

� �� . # $ � % & �/' () # $ �% (*) # $ �/' ( 0



Optimizing Variational Parameters

�

Step 3: Given full expression for , differentiate wrt VPsL

� � ��

L

��

� � � � ��

� � � ��! � � � �� #" �

� � � ��$" � � � �� " �

� � % & ' � & � � � ( &%

)

L) � � �

� � � � � � ��

*��

� � � � � +, - . � � � � � � � � � � � � �

/



Optimizing Variational Parameters

L

� ��

� � � � � � � ��

��

γ

� � � � � �γ

� � � � ��

� � � � � � � � � � � � � � ��

��

�

L� ��

� � ��

� � � � � � � ��

� � ��

� ��

��

� ��

��



Optimize Model Parameters

�

Step 4: Optimize the model parameters:

L

� (�� ( ��

� � �� '� ��

)

L

) (�

� � � �� '� ��

(��

�

* ( ��

� � �� '� ��



Optimize Model Parameters

�

Finally, a and b:

L

� ��

�

γ

� � � � ��

γ

� � �

γ

� � � � ��

�

L� � � � ��

��

! solve using optimization techniques



VEM in our Model

�

Iterate:

�

Optimize variational parameters:

�

Optimize model parameters:

w

zN

��

M

G D

�

� � � � � � � � � � � �

� � � � ��

!#" $ % & ')(*" +-, % & ."

'(" + /102 " $ .3 40 2 3576 8 9 3"

:<;>= ? @ A BC D .FE @ GH = ? @ /:JI @ K I @ G= L ?

: ;= ? @

M�

� N� � �

OQP� � � R� ��

SUT V W generic optimization techniques



Variational EM Summed Up

�

Steps:

�

Write down conditional likelihood andchoose an approximating distribution(eg, by factoring everything) with variational parameters

�

Iterate between optimizing the VPsand model parameters

�

Pros:

�

Efficient, deterministic, often quite accurate

�

Cons:

�

At it's heart, still a mode-based technique

�

Often underestimates the spread of a distribution

�

Approximation is local



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...




�

Basic idea: replace intractable p with product of tractable q

�

Generally we want to compute:

�

Approximate each factor:

�

Integral is approximated by:

approximate terms should be chosen to make the EP operations tractable

� ��

�

��

��

? � � � � ? � � �

Typically is the prior

�

� � �

� � � �

��

��

��

� �

� � �

See a

lso: M

in01



Expectation Propagation Algorithm

�

Initialize approximate terms

�

Compute the approximate posterior:

�

Iterate:

�

Select a to update

�

Delete from the posterior by dividing and renormalizing:

�

Match and and minimize KL divergence to get a new posterior with marginal

�

Update

��

��

� � � � �

�2

�2 � �

��

� �2

2 � �

� ? � � �

� ? � � �� ? � � � �

� � � �

� ? � � �

� � ? � � � ��

��

�

� ? � � � � � @� � � �

� � ? � � �



EP in our Model

�

Our integral looks like:

�

We approximate non-prior terms:

�

The approximate posterior is:

w

zN

��

M

G D

� � ��

� � � ��

� � ��

�

! "#

$&%'( )

*+ *, ' - % .

, ) - % . ! /10 ,�2 - % 3 254 6 .

, ' - % . ! 7 89;:

% 9 :<>= :9;: ?@BA

CBDFEGIH J K LE M&NH MN

O '9: Resembles a Beta/Dirichletwith parameters

P E MN

Q RTS U � VXW Y[Z R S \ Z^] _ U`

a Y`R S U bdc e VXW Y Z RTf U f g e Z h`

i` j` k

l;m e _ h`

i` j` n



EP in our Model

�

Delete:

�

Match:

�

Choose such that:

�

Update:

w

zN

��

M

G D

� � ��

� � � ��

� � ��

�

Q �` �� Q �� Y`� � � e VXW Y[Z � l �` � l�� ` e l� j` �

� ��

� � � � � � � � � �

� � �

� �� !#" $% �� &(' ) * + � � � �� $� ��

,% - �� !#" $% �� &(' ) * + ,% - �� $� ��

.0/ 1 2

3465

798 1 :<; 8 1 =>/?A@

B3 ;

345

C./ 1DEF

G/ 2 1 H/1 8 1=>/

1 8 1 =>/γ

I 1 8 1 : J

1 γ

I8 1 : J

1 γ

I8 1=/ J

γ

I 1 8 1 =/ JLearning rate

Difference betweenactual and expected

Approximatenormalization

Old normalization

New normalization



Summing up EP

�

Approximate terms by product, iteratively minimize each's KL divergence to the true posterior

�

Pros:

�

Efficient, accurate

�

Global approximation to the integral

�

Cons:

�

Typically requires lots of human effort

�

Only gives an approximation to the integral (not a bound)

�

Sometimes difficult to productize distributions

�

Doesn't necessarily converge



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...



Summary of Methods PROS CONS

MAP Just as easy as EM Not Bayesian, overfitsOnly simple models

Summing Easy to implement Bounded regionsArbitrarily accurate Impossible for d>2

Monte Carlo Simple Proposal dist. is hardFew tunable params Lots of samples required

MCMC Should work in d>>2 Hard to discover convergenceNo need for global PDs Lots of useless computation

Laplace Not much harder than MAP Poor fit for multimodalEfficient for good A Inefficient for bad A

Variational Efficient, deterministic Still (massive) mode-basedOften very accurate Local approximations

EP Efficient, deterministic Lots of human effortVery accurate Weak convergence guarantees



Message Passing Algorithms

�

Two major choices:

�

What approximating distribution should we use?

�

What cost should we minimize?

Power EPexp familyD (p || q)

Struct MFexp familyKL(q || p)

Frac BPfactorizedD (p || q)

EPexp familyKL(p || q)

Mean FieldfactorizedKL(q || p)

Tree RepfactorizedD (p || q)

BPfactorizedKL(p || q)

a

a

a>1

q || p p || q

��

� � ��

��

��

��

See a

lso: M

in05



Other Integration Strategies

�

Message Passing:

�

Generalized BP

�

Iterated conditional modes

�

Max-product belief revision

�

TRW-max-product

�

Laplace propagation

�

Penniless propagation

�

Bound propagation

�

Variational message passing

�

Non-message-passing:

�

Contrastive free energy

�

Bethe free energy

�

Spline integration



Empirical Evaluation of Methods

�

Query-focused summarization model:

w

zN

�

� �M

G D

�

SQ

w

� G��

� � � � ��

��

��

�� !" #

�

� � � � �$ � � �! " %� & ' (

�

� � � � ��)� ��! " % � & * & ' (



Evaluation Data

�

All TREC data

�

Queries 51-350 and 401-450 (35k words)

�

All relevant documents (43k docs, 2.1m sents, 65.8m words)

�

Asked 7 annotators to select up to 4 sentences for an extract

�

Each annotated 25 queries (166 total)

�

Systems produce ranked lists of sentences

�

Compared on mean average precision, mean reciprocal rank and precision at 2

�

Computation Time:

�

MAP (2 hours)

�

Summing (2 days)

�

Monte Carlo (2 days)

�

MCMC (3 days)

Laplace (5 hours)Variational (4 hours)EP (2.5 hours)



Evaluation Results

��

� ��

� ��

� ��

� ��

��

Rando

mPo

sition IR

MAP

Summing

Mon

te Carl

o

MCM

CLap

lace

Variati

onal EP

2 hours

2 days

2 days

3 days

3.5 hours

4 hours

2.5 hours



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...



Bayesian Discriminative Models

�

Take a neural network and put a prior on the weights:

�

Computation requires a bit of calculus on Gaussians

�

Then use Laplace, variational or MCMC to perform integration over posterior

� �� & ' � � '�� & '� � '�� & ' � � � � � & ' � � � ��

�

� ��

� � � �� & ' � � � � � & '� � � � ��

�

� � � � � ��

See a

lso: M

K[3

8]



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...



Gaussian Processes

�

Idea: instead of placing a prior over weights, place it over the function directly

�

Gaussian Process:

�

A collection of r.v.s such that any finite sample is jointly Gaussian

�

Uniquely specified by mean distribution and covariance function

� �� & ' � � ' � � & '� � '��

�� & ' � � � � � & ' � � � � � ��

� ��

� �� $��

� � �� $�� $��

See a

lso: M

K[3

8]



Computing with GPs

�

Compute the new covariance matrix:

� � & ' �

� ��

��

��

Expected value Error bars

��



Efficiently Computing with GPs

�

Inverting is cubic

�

Idea:

�

Iteratively add training points to maximize information gain (or some other criteria)

�

Only compute covariance over those

�

Leads to 'Informative Vector Machine (IVM)'

��

See a

lso: L

aw[0

3]



Example Covariance Functions

�

Linear:

�

Polynomial:

�

RBF:

�

Combo:

�

Bayesian advantage: we can tune all of these parameters!

� ��

� ��

� �� ! � "�

� �� $# % � & � � � ' %( ) �� & � � �



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...



Dirichlet Processes

�

Suppose we don't want to limit the number of components of a mixture model

�

Example:

�

Each word in a document is drawn from a topic

�

We don't know how many topics there are

�

Allow the number of topics to grow with the data

�

A Dirichlet Process is a collection of r.v.s such that any finite sample is jointly Dirichlet

�

Parameterized by a precision and a mean distribution

� ��

A draw from a DP is, itself, a distribution

See a

lso: N

eal9

8



Dirichlet Distribution vs. Process

�

Suppose we place a DP prior:

�

Now we observe samples drawn from :

�

The posterior (after integrating out G) is again a DP:

�

This is exactly like a Dirichlet Distribution:

� � ��

�

��

� ��

� ��

� ��

Distribution

Vector



Polya Urns

�

Start with an urn witha single black ball

�

Repeatedly draw balls:

�

If drawn ball is black, replace itand put in a ball of a new color

�

If ball is not black, replace itand put in a ball of the same color

�

Distribution is given by a DP:

� ��

��

� � � � � � � � � � � � �

��

� � �



Gibbs Sampling for the DP

�

Suppose we have an infinite mixture model over words:

�

Key, Mult is conjugate to the DP mean (Dir )

�

Gibbs sampling

�

Assign a latent class to each data point

�

Repeat:

�

Resample each by:

�

Resample each by:

� � � � � � � � ��

� � � � �

��

� � � �

� �

� � � ��

� � � ��

� � � � � ��

��

��

��



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...




�

Bayesian statistics tells us how to compute distributions

�

What if we want to actually do something with a distribution?

�

Define a loss function:

�

Choose action to minimize the Bayesian expected loss:

� � ��

� � � � � � � � � � �

� � � � � ��

�

Frequentist approach: define a decision rule:

�

Define the risk of the decision rule as:

�

Now, search for admissible decision rules

��

� �� !#" � $ % �� '& � � ($ �& ) $ �& * � � % �� & � �

What do weexpect to lose

Over allparameters

What do weexpect to lose

Over alldata sets

With knownparameters

See a

lso: B

er[4

.4]



Tutorial Outline

�


�

Background Material

�

Graphical Models

�


�


�


�

MAP

�

Summing

�

Monte Carlo

�


�


�


�


�


�

References

�


�


�


�

Others...



Bayes in Action (NLP/IR/Text)

�

D. Blei, A. Ng, M. Jordan, Latent Dirichlet allocation, JMLR 2003.

�

T. Griffiths, M. Steyvers, D. Blei, J. Tenenbaum, Integrating topics and syntax. NIPS 2004.

�

A. McCallum, A. Corrada-Emmanuel, X. Wang, Topic and Role Discovery in Social Networks. IJCAI 2005.

�

Y. Zhang, J. Callan, T. Minka, Novelty and Redundancy Detection in Adaptive Filtering. SIGIR 2002.

�

T. Minka, Bayesian conditional random fields. AISTATS 2005.

�

K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, M. Jordan. Matching words and pictures. JMLR 2003.



For Further Information (Books)

�

James O. Berger, Statistical Decision Theory and Bayesian Analysis. Springer, 1985.

�

David MacKay, Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.

�

Larry Wasserman, All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.

�

Christopher Bishop, “The new book.”2006.



For Further Information (Tutorials)

�

C. Andreiu, N. de Freitas, A. Doucet, M. Jordan, An Introduction to MCMC for Machine Learning. ML 2003

�

M. Wainwright, M. Jordan, Graphical models, exponential families and variational inference. UCB Stat TR#649, 2003.

�

K. Murphy, A Brief Introduction to Graphical Models and Bayesian Networks. www.cs.ubc.ca/~murphyk/Bayes/bayes.html

�

T. Minka, Using lower bounds to approximate integrals. www.research.microsoft.com/~minka/papers/rem.html, 2003.



Other References

�

N. Lawrence, Fast sparse Gaussian process methods: the informative vector machine. NIPS 2003.

�

T. Minka, Expectation Propagation for Approximate Bayesian Inference. UAI 2001.

�

T. Minka, Divergence Measures and Message Passing. AI-Stats 2005.

�

R. Neal, Markov chain sampling methods for Dirichlet process mixture models, TR. 9815, Dept. of Statistics, University of Toronto.



Thank you!

Questions? Comments?

hal daumé iii ([email protected]) beyond em: bayesian techniques for nlp...

Documents