hal daumé iii ([email protected]) beyond em: bayesian techniques for nlp...
TRANSCRIPT
Bayesian Methods for NLPSlide 1
Hal Daumé III ([email protected])10:16
Beyond EM:Bayesian Techniques for NLP Researchers
Hal Daumé I I IInformation Sciences Institute
University of Southern California
...or “And you thought EM was hard”
...or “And you thought EM was fun”
Bayesian Methods for NLPSlide 2
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 3
Hal Daumé III ([email protected])10:16
The Bayesian Paradigm
�
Every statistical problem has data and parameters
�
Find a probability distribution of the parameters given the data using Bayes' Rule:
�
Use the posterior to:
�
Predict unseen data (machine learning)
�
Reach scientific conclusions (statistics)
�
Make optimal decisions (Bayesian decision theory)
Posterior
LikelihoodPrior
Marginal
� ��� �� � �� � � � �� � � �� � �� � � � � � � �� � ��
� � � �
Bayesian Methods for NLPSlide 4
Hal Daumé III ([email protected])10:16
Why be Bayesian?
�
Pedagogical arguments:
�
All estimates, answers, decisions, etc. are consistent
�
Uniform treatment of all aspects of statistical modeling
�
Results are often interpretable
�
Assumptions are explicit and most can be tested
�
Practical arguments:
�
We know maximum likelihood estimators (for instance) are flawed
�
Who doesn't do smoothing?
�
I don't want to run 1000 experiments, changing a single parameter by 0.01 on each run (and I don't want you to either!)
�
Lets you do some fun stuff that you can't otherwise do
See a
lso: B
er[1
.6, 4
.2],
MK
[2.1
, 2.2
], W
as[1
1.1]
Bayesian Methods for NLPSlide 5
Hal Daumé III ([email protected])10:16
Why not be Bayesian?
�
Makes you do a lot of (or at least some) math
�
Could be a reason to be Bayesian, though
�
Computational complexity
�
Though it's not so bad compared to cross validation
�
Arbitrariness of priors
�
Often based on misguided impressions of either Bayesian, or frequentist, methods
�
Generally not a big problem for machine learning
Bayesian Methods for NLPSlide 6
Hal Daumé III ([email protected])10:16
What does it mean to be Bayesian?
�
It is NOT about using priors
�
Oooh look, I put a Gaussian prior on my maximum entropy problem...I must be a Bayesian now
�
It is NOT about applying Bayes' Rule
�
Well, I've used a noisy channel model, which employs Bayes' Rule...I'm being Bayesian, right?
�
It is ONLY about modeling uncertainty in all stages of statistical inference
�
This means making decisions by summing over all possibilities
NOT
Bayesian Methods for NLPSlide 7
Hal Daumé III ([email protected])10:16
What is this tutorial about?
�
Graphical models as a tool for expressing assumptions
�
Common statistical distributions that are useful for modeling different types of quantities
�
How to go from a problem specification to a model
�
How to specify/choose appropriate prior distributions
�
How to per form inference in that model
�
Initial focus on unsupervised learning, but will discuss supervised models at the end
Bayesian Methods for NLPSlide 8
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 9
Hal Daumé III ([email protected])10:16
What is a Model?
�
A statistical model is a specification of our assumptions about the data we will be modeling
�
What aspects are independent of others?
�
For non-independent aspects, how are they related?
�
Example 1 (maximum entropy models):
�
There is a class variable y and a instance variable x and that the conditional probability of y given x is a linear function of a bunch of features of x
�
Example 2 (machine translation)
�
There is something called a foreign string and something called an English string and each foreign word is generated by either exactly one English word or a special 'null token' according to a translation table
Bayesian Methods for NLPSlide 10
Hal Daumé III ([email protected])10:16
What is a Good Model?
�
We can consider models by looking at the probability that they generate our data set (the marginal likelihood of the data):
P(d
ata
| mo
del
)
all possible data sets
Model 1
Model 2
Model 3 Current data set
See a
lso: M
K[2
8.1]
Bayesian Methods for NLPSlide 11
Hal Daumé III ([email protected])10:16
What are Parameters?
�
Parameters are the parts of the model about which you are not completely 'certain' (or, willing to claim certainty)
�
Example 1 (maximum entropy models):
�
I 'know' that the class is linearly related to the input
�
I 'know' what the relevant features are
�
I don't know the feature weights => these are the parameters
�
Example 2 (machine translation):
�
I 'know' that foreign words are generated from exactly one English word (or the null word)
�
I do not know what the probability of any foreign word is, given a particular English word
�
(I also do not know which foreign words correspond to which English word, but this is a 'hidden variable', not a parameter.)
Bayesian Methods for NLPSlide 12
Hal Daumé III ([email protected])10:16
Graphical Models
�
Convenient notation for representing probability distributions and conditional independence assumptions
X A observed random variable
X A unobserved/hidden random variable
X A observed/known parameter
X A unobserved/unknown parameter
A submodel replicated N timesN
An indication of conditional dependence
See a
lso: M
urph
y
X
X
Bayesian Methods for NLPSlide 13
Hal Daumé III ([email protected])10:16
Example 1: Naïve Bayes
X
Y
N
��
Feature parameters
Data vector
Class label
Class 'prior' probability
� ���� �� � �� � � � �� � �� �
� ��� � ��� �� � �� � � � ���� �Y
� For each example n:� Choose a class Y by:
� For each feature f:� Choose X by:
F� ! " #�$ "&%
' ( ) *+-, . * (+
Bayesian Methods for NLPSlide 14
Hal Daumé III ([email protected])10:16
Example 2: Maximum Entropy
�
Data vector
Class label
Feature parameters
� For each example n:� Choose a class Y by:
p
�
Y � y�
X,
� �
�
exp
�
fXf
�
f
�
X
Y
N
F
Bayesian Methods for NLPSlide 15
Hal Daumé III ([email protected])10:16
Example 3: Hidden Markov Models
X
Y�
�
X
Y
X
Y
X
Y
X
Y
Bayesian Methods for NLPSlide 16
Hal Daumé III ([email protected])10:16
Example for Summarization
�
Consider a stupid summarization model:
�
Each word in a document is drawn independently
�
Each word is draw either from a general English model, or a document specific model
�
We don't know which words are drawn from which
w
zN
�
� �
M
G D
Indicatorvariable
p
�
w
� � , �G ,
�D ���
m n zmn
p
�
zmn
� � � p �w
� �G �zmn p
�w
� �
dD �1 zmn
Bayesian Methods for NLPSlide 17
Hal Daumé III ([email protected])10:16
Fun with Graphical Models
�
Easy to propose extensions to the model: add sentences!
w
zN
�
� �
M
G Dw
zN
�
� �
M
G D
�
S
Bayesian Methods for NLPSlide 18
Hal Daumé III ([email protected])10:16
Fun with Graphical Models
�
Add queries!
w
zN
�
� �
M
G D
�
S
w
zN
�
� �
M
G D�
SQ
w
�Q
Bayesian Methods for NLPSlide 19
Hal Daumé III ([email protected])10:16
Maximum Likelihood Estimators (MLE)
X
YN
��
�
Take a parameterized model and some data
�
Find the parameters that maximize the likelihoodof that data (i.e., the 'probability' of the parametersgiven the data):
� �� � � � ��� � � � � ��� �� �� ���
� � ��
�� ��� � �� � �� � ��� � � � �
! � �"$# � ! � � % & � '# � � � ! � � % ��� & � ' (
l
)*
, + ,
X1: N ,Y1: N
-/.
n k
)
Ynk log +
k
0 )
1 1 Ynk
-
log)
1 1 +
k
- -
0
n f
)
Xnf log
*
fYn 0 )
1 1 Xnf
-log
)
1 1 *
fYn
- -
2
l
2435
n k
6
Ynk
3
k
7 1
7Ynk
1 7 3
k
8
9
l
9: k
;
n: Yn
< k f
=
Xnf
:fk
> 1> Xnf
1 >:
fk
?
F
Bayesian Methods for NLPSlide 20
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 21
Hal Daumé III ([email protected])10:16
MLE with hidden variables
�
Consider a stupid summarization model:
�
Each word in a document is drawn independently
�
Each word is draw either from a general English model, or a document specific model
�
We don't know which words are drawn from which
�
Compute log likelihood:
�
Uh oh! Logs can't go inside sums!
w
zN
�
� �
M
G D
Indicatorvariable
p
�
w
� � , �G ,
�D ���
m n zmn
p
�
zmn
� � � p�
w� �G �z
mn p
�
w
� �
dD �1 z
mn
l
�� ,
�
w
���
m n
logzmn
...
Bayesian Methods for NLPSlide 22
Hal Daumé III ([email protected])10:16
Expectation Maximization
�
We would like to move the log inside the sum, but can we?
�
Jensen's Inequality to the rescue:
�
For any distribution Q (with the same support)
�
How should we choose Q?
��� � � ��� � � ��� � ��
��� � �� � � �
��� � ��
��� � � � � � � � � � � �
� � � �
� ��
��� � � � � � � � � � � � � � �
� � � �
�� � � � � � � � � �� � � ��� �� � � � � � � � � � � �
���� �� � � � � �� � � � � � � �� �� � � � � � � � �
See a
lso: W
as[9
.13]
, Mur
phy
Bayesian Methods for NLPSlide 23
Hal Daumé III ([email protected])10:16
Expectation Maximization
�
If we set then the lower bound becomes an equality:
�
So, when computing , the expectation should be taken with respect to the true posterior
� ��� � � � ��� � � � �
�
��� � � � �� �� � ��� � � ��� �
� � � � � �
�� � � � � � �� �� �� � � � � � � � �
� � � � � �� �
� �
�� � � � � � �� �� �� � � � � � �� � � � � � � �
� � � � � �� �
� �
�� � � � � � �� �� �� � ��� � � �
� � �� � � � � � � �
��� � � � � � �� �
� � �� � � � � � �
��� �� � � � � �� � � � � �
Bayesian Methods for NLPSlide 24
Hal Daumé III ([email protected])10:16
EM in Practice
�
Recall, we wanted to estimate parameters for:
�
So we replace the hidden variables with their expectations:
�
All we need to do is calculate the expectations:
�
And now the computation proceeds as in the no-hidden-variable setting
� � � � � � �� �
�� � � ��� �� � � � � � ��� �� � � � � � ��� �� � � � � �� �
� � � � � � � � � � � � � � � � � � � � �
� � � � ��� � �� � � � �
� � � �!� � � � � � � � � � � � � � � � �! � � � � �#"
� � $&% �'� !
�
� �)( *+ , -� � � � � � � �'�! � � � � �" � � $% � �!
Bayesian Methods for NLPSlide 25
Hal Daumé III ([email protected])10:16
EM Summed Up
�
Initialize parameters however you desire
�
Repeat:
�
E-STEP:Compute expectations of hidden variables underthe current parameter settings
�
M-STEP:Optimize parameters given those expectation
�
This procedure is guaranteed to:
�
Converge to a (local) maximum
�
Monotonically increase the incomplete log-likelihood
Bayesian Methods for NLPSlide 26
Hal Daumé III ([email protected])10:16
EM on our simple model
�
Suppose we have three words: { A, B, C}
�
Document 1 = [A B], Document 2 = [A C]
�
Initialized uniformly
�
E-step:
�
M-step:
��� ���� �� � � � � � �� � � � �� � � �
� ����� � � �
� ���
� � � "! #$ % � & �(' �) �
*,+ -/. $ 0 1
*,+ - . $ 0 1 ! *,+ -/. $ 0 1 � * + -
� �� � 2 � � � ��� 2� � � � �� 2 2 � � * + -
3
AG 4 1
Z
5
E
6
z11
798
E
6
z21
7 : 4 12
3
BG 4 1
Z5
E6
z12
7 : 4 14
3
CG 4 1
Z
5
E
6
z22
7 : 4 14
3
1AD 4 1
Z
5
1 ; E
6
z11
7 : 4 12
31BD 4 1
Z
5
1 ; E
6
z12
7 : 4 12
3
1CG 4 0
3
2AD 4 1
Z
5
1 ; E
6
z21
7 : 4 12
3
2BD 4 0 3
2CG 4 1
Z
5
1 ; E
6
z22
7 : 4 12
< 4 E
6
z11
78E
6z21
7
E
6
z11
78
E
6z21
7 8E
6z12
78
E
6
z22
7 4 12
Bayesian Methods for NLPSlide 27
Hal Daumé III ([email protected])10:16
EM on our simple model
�
Suppose we have three words: { A, B, C}
�
Document 1 = [A B], Document 2 = [A C]
�
Initialized uniformly
Complete log likelihood
Incomplete log likelihood
log �
log
�
(A)
log �
(A)G
D1
Bayesian Methods for NLPSlide 28
Hal Daumé III ([email protected])10:16
Problems with EM
�
For our documents, will always converge to this solution
�
BUT:
�
For more documents and words, there is a trivial local maximum where the general English model does nothing
�
This corresponds to � � �
�
Why is this bad?
It doesn't conform to our prior beliefs aboutwhat parameters are likely!!!
�
So how can we specify our prior beliefs about �?
Bayesian Methods for NLPSlide 29
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 30
Hal Daumé III ([email protected])10:16
What is a Prior?
�
Recall Bayes' Rule:
�
A prior is a specification of our beliefs about the values parameters can take, before seeing any data
�
Okay, so what is a belief?
�
Do we have the same beliefs?
�
What if we don't?
�
What if I don't know what I believe?
Posterior
LikelihoodPrior
Marginal
� #� � � & �
� #� & � # � � � &
�� �� � #� & � # � � � &
Bayesian Methods for NLPSlide 31
Hal Daumé III ([email protected])10:16
What is a Belief?
�
We want to be able to state numerically our beliefs in things like:
belief (it rained last night)belief (it will rain tonight)belief (the next card I draw will be an ace)
�
And we want to know how to manipulate beliefs
�
Suppose you are willing to accept a bet with odds proportional to the strength of your belief
�
Suppose you believe a coin will come up heads 90% of the time:belief (heads) = 0.9
�
Then you will accept a bet that if a coin comes up heads, you win at least $1, and if it comes up tails, you lose at most $9
See a
lso: M
K[3
], Be
r[3.1
, 2.3
]
Bayesian Methods for NLPSlide 32
Hal Daumé III ([email protected])10:16
Beliefs (The Dutch Book)
�
IF:
�
Suppose you believe a coin will come up heads 90% of the time:belief (heads) = 0.9
�
Then you will accept a bet that if a coin comes up heads, you win at least $1, and if it comes up tails, you lose at most $9
�
THEN:
�
Unless your beliefs satisfy the rules of probability including Bayes' Rule, then I can take arbitrary amounts of money from you
�
SO:
�
The only way not to go broke is to ensure that your beliefs agree with probability theory
�
AND BAYES' RULE!
See a
lso: B
er[4
.8]
Bayesian Methods for NLPSlide 33
Hal Daumé III ([email protected])10:16
How do my beliefs compare to Kevin's?
�
Maybe they do, maybe they don't
�
Does it matter?
�
�
If we have enough data, does it matter?
�
No! Theorem* :
* under some regularity conditions
� � ��� ��� � � � � � � � �
See a
lso: B
er[4
.7, 4
.8],
, Was
[11.
5]
Bayesian Methods for NLPSlide 34
Hal Daumé III ([email protected])10:16
Specifying Priors
�
A prior is a map that:
�
Assigns to every setting of parameters a non-negative real value
�
Integrates to 1 over the parameter space
�
Such a beast can be difficult to describe! Tools:
�
When the parameters are discrete, we can often set them by hand
�
Unless we're in high dimensions with (prior) interaction among parameters
�
Otherwise, we will often choose a parametric prior
and deal with the hyper-parameters by one of many means:
�
Set them subjectively (subjective/true Bayes)
�
Integrate them out analytically (often not possible and often suboptimal)
�
Choose them in such a way to be objective (objective Bayes)
�
Optimize them from the data by marginal likelihood (empirical Bayes, Type II ML)
�
Or choose a set of priors and integrate over them (robust Bayes)
�
...
See a
lso: B
er[3
.1-6
, 4.7
], M
K[],
, W
as[1
1.1]
�� �� � � � �� � � �
Bayesian Methods for NLPSlide 35
Hal Daumé III ([email protected])10:16
Exponential Family
�
A set of distributions of the form:
�
Using exponential distributions is very convenient:
�
They are convex with respect to the parameters
�
The have natural prior distributions
�
They have several convenient properties wrt moments:
See a
lso: B
er[8
.7],
MK
[22]
, Was
[9.1
3]
� � ����� � � � � �� ��� ��� � � ��� ��� � �Natural
parameterSufficientstatistics
Normalizationfactor
���� � �� ! "$# %'& () *# + "# ,.- % ! "# %& //021 3 "01 %
Bayesian Methods for NLPSlide 36
Hal Daumé III ([email protected])10:16
Subjective Bayes
�
Eliciting priors can be very difficult; several options:
�
The histogram approach
�
Simple, but how big are intervals, no tails, ...
�
The relative likelihood approach
�
Typically easier to elicit, but still no tails
�
Moment matching of parametric forms
�
Most used and misused approach; now we have tails, but they are hard to elicit
�
Alternative is to use quantiles
�
Cumulative distribution function determination
�
Choose quantiles of the CDF
Bayesian Methods for NLPSlide 37
Hal Daumé III ([email protected])10:16
Expected Fisher Information
Objective Bayes
�
Desire to give a prior that contains no information
�
For instance, the improper uniform prior on the real line
�
These are rarely consistent across reparameterizations
�
For scale parameters, is more appropriate
�
Jeffreys' prior:
�
Not affected by restriction on the parameter space
� �� � � � � �
� � � � �� � � � � �� �
���� � � � � � ��� � �
� � � � � ���� ! �#" $ � �
%
Bayesian Methods for NLPSlide 38
Hal Daumé III ([email protected])10:16
Empirical Bayes
�
Specify a class of priors (typically a functional form):
�
Estimate the prior by maximizing the marginal likelihood:
γ � � �� � �� � � � �� � � �
�� � �� � � γ
� � � � � � �� �� �
A
��
�� � � � � � � � � � � �
Bayesian Methods for NLPSlide 39
Hal Daumé III ([email protected])10:16
A Bit of Philosophizing
�
Bayesians are often criticized for the use of priors:
�
Priors are subjective, science is objective
�
Priors can fail to be robust:
�
Small changes in the prior can result in large changes in the decision
�
Bayesian statistics requires thought and introspection
�
Do we really have degrees of belief? Is betting a good model?
�
Does infinite data really lead to agreement?
�
The computations are too difficult
�
Recall the goal of the statistician:
�
Statistics aims to do for inductive reasoning what Frege did for deductive reasoning
�
Do these complaints matter from the perspective of machine learning
“The subjectivist states his judgments, whereas the objectivist sweeps them under the carpet by calling assumptions knowledge, and he basks in the glorious objectivity of science.” (Good, 1973)
Bayesian Methods for NLPSlide 40
Hal Daumé III ([email protected])10:16
The Likelihood Principle
�
Principle states:
�
All relevant information about the parameters after the data is observed is contained in the likelihood for the observed data. Furthermore, two likelihood functions contain the same information about the parameters if they are proportional to each other.
Example:We want to determine if a coin is biased or notExperiment: the coin is flipped and we come up with 9 h and 3 tTwo possibilities for how the experiment was performed:
We decided to do a total of 12 flipsWe decided to keep flipping the coin until 3 tails were
observed
�
The Bayesian doesn't care
�
The classical statistician would compute:
�
7.5% 'chance' under option 1
�
3.25% 'chance' under option 2
Bayesian Methods for NLPSlide 41
Hal Daumé III ([email protected])10:16
Conjugate (convenient) Priors
�
Given a distribution
�
And a prior
�
The prior is conjugate if:
The posterior distribution of the parameter after observing somedata has the same functional form as the prior distribution
�
Beta/Binomial Example:
� ��� ��� �
� � � �
� �� ��� � � ��� � �� � � � � � � � � �
�� � � �! " �� � � � � � � � � � � # � $� �
% &(' )* &(+ , -/. 021 3 4 5 67 4
.81 9 -: ;1 5 <�= 9
%!> ?+ -1 0�@ 5 6 Γ
-@ A B@DC 5Γ
-@ A 5 Γ -@EC 51 (F =C -: ;1 5 (G =C
H -1 0 @ 3 . 5 67 4
.8 Γ
-@ A B@IC 5
Γ
-@ A 5 Γ -@JC 51 F =C K 9 -: ;1 5 G =C K <�= 9
6 %!> ?+ - . 0L @ A B . 3@C B 4 ; . M 5
See a
lso: B
er[4
.2.2
], M
K[2
3]
Bayesian Methods for NLPSlide 42
Hal Daumé III ([email protected])10:16
Binomial and Beta Distributions
�
Binomial distribution models flips of coins (domain={ 0,1} ):
�
Probability that a coin, bias
�
, flipped N times will come up x heads
�
Parameters:
�
Distribution:
�
Moments:
�
Beta distribution models nothing (we care about) (domain=[0,1]):
�
Parameters:
�
Distribution:
�
Moments:
�
Beta is conjugate to binomial:
�
Posterior parameters:
�
Marginal distribution:
4�� � K 3 1 � �� �: �
% &' - . 0 4 31 5 67 4
.81 -: ;1 5 < =
6 ��� � �� 6 �� -� ; � B �� 5
@ � � K 3 � � � K
��� ��� �� ��� ! "$# γ
�� % ! "γ
�� "γ
� ! " �&' ( �) *� " +' (
, 6 @@ B � 3 -+. 6 @ �
-@ B � 5 / -@ B � B : 5
0213 1 4576
098 3 8 4 :<; 5
H - . 0 @ 3 � 5 6 Γ
-@ B � 5
Γ
-@ 5
Γ
- � 57 4
.8
Γ
-@ B . 5
Γ
- � B 4 ; . 5
Γ
-@ B � B 4 5
Bayesian Methods for NLPSlide 43
Hal Daumé III ([email protected])10:16
Beta Distribution Examples
�=0.
5
�=1
�=4
�
=0.5
�
=1
�
=4
��� ��� �� ��� ! "# γ
�� % ! "
γ
�� "
γ
� ! " �& ' ( �) *� " +' (
Bayesian Methods for NLPSlide 44
Hal Daumé III ([email protected])10:16
Multinomial Distribution
�
A distribution over counts of K>1 discrete events (words)
�
Domain:
�
Parameters:
�
Distribution:
�
Moments:
L .C � � 3 . � M � � � K � �
L 1C 3 � 31 � M � ∆ � 6 �L 1C 3 � 31 � M�� - � 5 1 � � � 1 6 : �
�� � � ���� � � � "# γ
� ��� %) �
γ
��� � %) " �� ���
�� # ��� ��� � # ��� �) *�� " ���� � # * �� ��
(N,0,0) (0,N,0)
(0,0,N)
Bayesian Methods for NLPSlide 45
Hal Daumé III ([email protected])10:16
Dirichlet Distribution
�
A distribution over a probability simplex
�
Domain:
�
Parameters:
�
Distribution:
�
Moments:
� � (�� � � � ��� δ
�
�� � � �� �� � " � #�
� �
� ��� � ��� � ��� ��� γ
�� � �� γ
� � � � � � �� � !
"$# % &#' & ( ) *+ # % &# , ' &.- &# /
, ' & / 0 ,1 2 ' & /
(N,0,0) (0,N,0)
(0,0,N)
(N,0,0) (0,N,0)
(0,0,N)
[1,1,2] [5,5,10]
Bayesian Methods for NLPSlide 46
Hal Daumé III ([email protected])10:16
Multinomial/Dirichlet Pair
�
Multinomial distribution:
�
Dirichlet distribution:
�
Posterior hyper-parameters:
�
Marginal Distribution:
�� � � �� � � � � "# γ
� �� %) �
γ
��� � %) " �� � �
� ��� � �� � � � �� γ
�� � �� γ
� � � � � � �� � !
� ���C � � � ���� � � � � �� � � � � �� �� �
� ���� � � � � Γ
� � � � � �Γ
�� � � � � �� � �
Γ� � � �
Γ
� � � �� � �
Γ
� � � � � � �
Bayesian Methods for NLPSlide 47
Hal Daumé III ([email protected])10:16
Gaussian/Gaussian-Gamma
�
Gaussian distribution:
�
Gaussian prior:
�
Gamma prior:
�
Posterior hyper-parameters:
�
Marginal distribution:
���� �� ��� � � � � �� � � � � ��� ���� � � �
� �� �
���� �� � � � � � �
��� � � � � � � � �
�� �
γ
� � � � � � � � � � � �
�� � �
� � ��� � � � � � � � !#" $ %
� �
& �� �
� �� �
' � � �
� �
� ( � � �)
� ) ( � �
� ( � � � � ( � �
�� � � � ( � � �
*� � � � �
� )�� ) � �� � �
+ � �
� �� � � � � � � � � � � ,-/. 0 � � � � � � �
Non-standardStudent's T distribution
Bayesian Methods for NLPSlide 48
Hal Daumé III ([email protected])10:16
Gamma Distribution
��� � ��� � �� � �
�
Γ
� � � � � � � ��� � � ��� � �
�=1
�=2
�=4
�=1 �=2 �=4
� � ( �
�� � � ( � �
Bayesian Methods for NLPSlide 49
Hal Daumé III ([email protected])10:16
Summary of DistributionsDistr ibution Domain Pr ior Parametr ic Form
Binomial Binary Beta
Multinomial K classes Dirichlet
Beta [0,1]
Gamma [0, � )Dirichlet Simplex
Gaussian Reals Nor/Gam
Cauchy Reals
Student's t Reals� ��� ��� � � � �� � � �� � � ��� �
��� ��� �� � � ! " #%$ � &' ( �) *� # + ' (
,�- ./ 02123 4 1�5 6%7 598 :<;
= >@?A BC<D BE F�G H CJI KMLN O
@PQ �� �SR T U �� VW X Y ��� �R U�Z [ T U \
]_^ ` acb d ^fe g hji b � k � lnmo p arq gb h
s_tu vxw y tfz { |S} { ~�� v { ��� vxw � t | � | � � �
� vw y tz �z � � |�� ��� � vw � � | ��� v t � � | � � �M� � � �� �
Bayesian Methods for NLPSlide 50
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 51
Hal Daumé III ([email protected])10:16
Recall our summarization model
w
zN
�
� �
M
G D
�
The problem was that we don't believe that it's okay for � to go to 0 or 1
�
Solution?Put a prior on �!
�
What's a good prior?� y � � � � v � |
� y � � � � ��� �� v � � | � ��� �� v � � | ��� �
Bayesian Methods for NLPSlide 52
Hal Daumé III ([email protected])10:16
Bayesianified summarization model
w
zN
��
�
� �
M
G D
� d ^e g � ��� �^ a ^ e g h
� d � � � � a � h
� d �e � ��� � � a � h � ��� � � a � h � � �
�
� �� ����� �� � � � � !#" γ
� � $ � �γ
� � � γ � � � " %�& ' �( ) " � *& '
+, -/.0 1 2354 ' 6" -. 0 �( ) " � ' & -/.0
7�� 78 � -. 0 9.0: ��<; 7= � > ' & -.0 ? 9. 0:
� @BAC DE F % > %HG * ?I
+G,
@ -.0 C D J, > A ?K
7�� 78 � -.0 9L.0: ��<; 7= � > ' & -L.0 ? 9. 0: M N
Bayesian Methods for NLPSlide 53
Hal Daumé III ([email protected])10:16
The Integration Problem
�
In general, we want to compute something like:
�
Examples:
�
Summarization model:
�
p is the posterior of the hidden variables
�
f is the probability of the words
�
Classification model:
�
p is the posterior of the model parameters
�
f is the prediction function
�
Integral normalization
�
p is uniform
�
f is a probability measure (i.e., unnormalized probability distribution)
��� ��
��� � �� � � � ��� ��� � � �
See a
lso: B
er[4
.2, 4
.3],
MK
[IV
]
Bayesian Methods for NLPSlide 54
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 55
Hal Daumé III ([email protected])10:16
Maximum a Posteriori
�
Not Bayesian, but sometimes effective
�
Choose a,b by hand, proceed as before, but now we only need to maximize over � and
�
(unchanged)
�
Amounts to a simple form of smoothing:
�
Now, we can just maximize � as before, but with fake counts added proportional to the prior
� � � ��� � � � � � � Γ
� �
Γ
� � Γ
� � ���� � �� �� � �� �
� �� �� � ���� � �� � �� � � �� � � �� � �
� � ! �� �� " ��# � �%$ & � ' � �� � � ( " ��#
) *�,+ �� � �
� � � � - ��� � �� �� � � �� � � - �� �
� � ! �� �� " ��# � �%$ & � ' � �� � � ( " ��#
w
zN
.0/ .1 1
M
G D
2 354 6 798 :; <4 =4 6 7 >
? 3 2 8 :@A = 2 >
B 3 ? 6 C 8 DFE G < = C H > I DFE G < = C J > K5L I
M
Bayesian Methods for NLPSlide 56
Hal Daumé III ([email protected])10:16
Maximum a Posteriori Temptation
�
I don't want to specify my prior! Let me estimate it:
�
First, I find a,b that maximize the marginal likelihood
�
Then, I use this a,b as the smoothing parameters for �
�
This is NOT VALID! Why?
�
We are 'double counting' the evidence
Bayesian Methods for NLPSlide 57
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 58
10:16
��� ��
��� � � � � �
� � � �� � � �
Bayesian Methods for NLPSlide 59
Hal Daumé III ([email protected])10:16
Summing in our Model
�
Simply rewrite the integral as a sum:
�
Now we can compute expectations of z easily and use these for the M-step of EM
w
zN
.0/ .� �
M
G D
� �54 6 798 :; <4 �4 6 7 �
? � � 8 :@A � � �
B � ? 6 � 8 DFE G < � � H � I DFE G < � � J � K5L I
M
� � � � � � � � ) ��
�� Γ
� �
Γ
� � Γ
� ���� � �� �� � �� �
� � � �� � ���� � �� � �� �� �� � � �� ��
� � ! �� �� " ��# � �$ & � ' � �� �� ( " ��#
� � �� � �
Γ
� �
Γ
� � Γ
� �� � � � � � �� � � � � � � � � � �� �
� � � �� � ���� � �� � � � � �� �� �� � � � � � � � �� ��
� � ! �� �� " ��# � �$ & � ' � �� �� ( " ��#
Bayesian Methods for NLPSlide 60
10:16
�
Idea: let's choose R differently
Bayesian Methods for NLPSlide 61
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 62
10:16
��� ��
��� � � � � �
� � � �� � � �
See a
lso: M
K[2
9], W
as[2
4.2]
, And
03
Bayesian Methods for NLPSlide 63
10:16
Uniform Sampling
�
Pros:
�
Can now work in arbitrarilyhigh dimensions (in theory)
�
Choice is now size of R, notthe width of windows
�
Cons:
�
Number of samples requiredto get near the mode of a spikydistribution is huge:
�
True distribution is rarely uniform�� �
�
�� � � � � �
�� � �
� � � �
� � � ��
See a
lso: M
K[2
9], W
as[2
4.2]
, And
03
10:16
Bayesian Methods for NLPSlide 65
10:16
Bayesian Methods for NLPSlide 66
10:16
See a
lso: M
K[3
0], A
nd03
Bayesian Methods for NLPSlide 67
10:16
Rejection Sampling
�
Pros:
�
Again, if q is close to p, we willget good sample (i.e., few sampleswill be rejected)
�
Cons:
�
Hard to construct such a q
�
With p and q 0-mean Gaussians, with qp, we must set:
which for D=1000 yields an acceptance rate of 1/20,000
� � �� � � � �� � �
Bayesian Methods for NLPSlide 68
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 69
Hal Daumé III ([email protected])10:16
Markov Chain Monte Carlo
�
Monte Carlo methods suffer because the proposal density needs to be similar to the true density everywhere
�
MCMC methods get around this problem by changing the proposal density after each sample
�
General framework:
�
Choose a proposal density q( | x) parameterized by location x
�
Initialize state x arbitrarily
�
Repeatedly sample by:
�
Propose a new state x' from q(x' | x)
�
Either accept or reject this new state
�
If accepted, set x = x'
�
New problem: samples are no longer independent!
See a
lso: M
K[3
0], W
as[2
4.4]
, And
03
Bayesian Methods for NLPSlide 70
Hal Daumé III ([email protected])10:16
Metropolis-Hastings Sampling
�
Accept new states with probability:
�
Only put every Nth sample into R
� ����
���� � �
� � � �
� � � �
p(x)
x0 x'
p(x')
p(x0)
q(x0|x')
q(x'|x0)
q( | x0) q( | x')
��� ��
� � � ���
�� � �
� �
See a
lso: M
K[3
0], W
as[2
4.4]
, And
03
Bayesian Methods for NLPSlide 71
Hal Daumé III ([email protected])10:16
MH in our Model
�
Invent a proposal distribution q
�
Or, condition on all variables:
�
Now we can compute expectations of z easily and use these for the M-step of EM
�
Alternatively, we could propose values for LMs in the sampling
w
zN
.0/ .� �
M
G D
� �54 6 798 :; <4 �4 6 7 �
? � � 8 :@A � � �
B � ? 6 � 8 DFE G < � � � � I DFE G < � � J � K5L I
M� � � � � � � � �� � � � � � � �� �
� � � � � � � � �� � � � � � � � �
� ��� � � � � �� � � �� � �
��� A � � � � � ��� ���� �
� � � � � � �� � � � � � � � �
� � � � � � �� � � � � � � �� �
� ��� � � � ! � ��� � � �� � � 6A
� ��� � �� A � � �
��� A � � � ��� � � � A � � � #" �%$ �& � ���& �� '
Bayesian Methods for NLPSlide 72
Hal Daumé III ([email protected])10:16
Metropolis-Hastings Sampling
�
Pros:
�
No longer need to specify auniversally good proposaldistribution; only locally good
�
Simple proposal distributionscan go far
�
Cons:
�
Hard to tell now far to space samples:
�
Suppose we use spherical proposals and, then we need at least
where sigmas are lengths of the major density in p
�
Auto-correlation to track this:
p(x)
x0 x'
p(x')p(x0)
q(x0|x')q(x'|x0)
q( | x0) q( | x')
� � ��
� � � ��
�� � �
� �
��� ��� ��� �� �
��� � ��� ���� ��� � � � � !� � �" � � � !
��� ��
� � � � � ! #
Bayesian Methods for NLPSlide 73
Hal Daumé III ([email protected])10:16
Gibbs Sampling
�
Defined only for multidimensional problems
�
Useful when you can take out one variable and explicitly sample the rest
p(x1 | x2)
x2
x1
p(x1 | x
2)
p(x1 | x2)
p(x1 | x
2)
��� ��
��� � � � � � � � � ��
� � � � � �
See a
lso: M
K[3
0], W
as[2
4.5]
, And
03
Bayesian Methods for NLPSlide 74
Hal Daumé III ([email protected])10:16
Gibbs Sampling
�
Typically our params are:
�
If, for each i, we can draw a sample from:
then we can use Gibbs sampling
�
In graphical models, only depends on the Markov blanket:
p(x1 | x2)
x2
x1 p(x
1 | x2)
p(x1 | x2) p(x1 | x
2)
� ��
� � � � � �
���� � � �� �� ���
� � � � �� � � � � � � � � � � �� � � � �� � �" �� �� �� �
� � � � � � ��� � � � � �� � � � � ���� ��� ��� � � ��� � � � � �� � � � � �
a
bd
e
f! "# $�% & ' (�) ! " # $+* , - ( ! "/. $ # ( ! "0 $ # , 1 (
c
Bayesian Methods for NLPSlide 75
Hal Daumé III ([email protected])10:16
Gibbs in our Model
�
Compute conditional probabilities
�
Now we can compute expectations of z easily and use these for the M-step of EM
�
Alternatively, we could propose values for LMs in the sampling
w
zN
��� �� �
M
G D
� � �� � � � � � �� �
� � � � ��� � � �
� � �� � � ��� � � � � � � � ��� � � � � � � ��� �
!#" $ % & !" $ ' (*) + ! ,.- % !" $ /
- % & - ' (*) + ! ,.- % !" $ /0 � �
( 132 ,54 0 � % - /
4 0 � % &4 0 � ' ( 132 ,4 0 � % - /76 , 80 � % 4 0 � " 9 /
Bayesian Methods for NLPSlide 76
Hal Daumé III ([email protected])10:16
Gibbs Sampling
�
Pros:
�
Designed to work in highdimensional spaces
�
Terribly simple to implement
�
Automatable
�
Cons:
�
Hard to judge convergence, can require many many samplesto get an independent one (often worse than MH)
�
Only applicable when conditional distributions are 'nice'�
(Though there are ways around this)
p(x1 | x2)
x2
x1 p(x
1 | x2)
p(x1 | x2) p(x1 | x
2)
� ��
� ��� �
� � �
Bayesian Methods for NLPSlide 77
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 78
Hal Daumé III ([email protected])10:16
Laplace (Saddlepoint) Approximation
�
Idea: approximate the expectation by a quadratic (Taylor expansion) and use the normalizing constant from the resulting Gaussian distribution
��� �� ��� � � � � � � ��� ���� � � � �
��� �� � � ��� � �
�� �"!
p(x)f(x) = g(x)
q(x)
x0
See a
lso: M
K[2
7]
Bayesian Methods for NLPSlide 79
Hal Daumé III ([email protected])10:16
Laplace Approximation
�
Find a mode x0 of the high-dimensionaldistribution g
�
Approximate ln g(x) by a Taylorexpansion around this mode:
�
Compute the matrix A of second derivatives
�
The exponential form is a Gaussian distribution; use the Gaussian normalizing constant:
�� � ��� � �� � �
� � � �� � �� � � �� � � �� ��
g(x)
q(x)
x0
�� � � � �� � � � ��
� � � � � � � � � �
���� ��� � �
� � � ���� � ��� � �
� ! �
"$# %& � ')( * , ( /,+ * ,�- (/. / ,0 - / 1') + 2
Bayesian Methods for NLPSlide 80
Hal Daumé III ([email protected])10:16
Laplace in our Model
�
Compute second derivatives:w
zN
��� �� �
M
G D
� � �� � � � � � �� �� � � � ��� � � �� � �� � � ��� � � � � � � � ��� � � � � � � ��� �
!" #$ %'& ( $ &) *,+- . $ / () * 01
02
34657
$ 4857+ - . $ / *) 4 579 + : 1 2;=< 1 2> ? /
@ ACBD E@GF H IKJCL M NF L IO L M NI ML F N P QRGS T
UWV RTF LI ML V RT NI ML F NX
@ Y ACBD E@ F Y H L IKJ L M NF Y L
IO L M NI ML F N YL QR S T
ZV RTF Y P I ML V RT NI ML F N Y[
\ ]_^ `
a6bcdfe ab g hie dj k
lKm no ln
pe dj klKm n
cdfe o ln g
qsr tu vxw y z w {| z w {~} � z w�� { ���� � �r �
��� �� w � �=� � z w { �
�� �G�
Bayesian Methods for NLPSlide 81
Hal Daumé III ([email protected])10:16
Laplace Approximation
�
Pros:
�
Deterministic
�
Efficient if A is of a suitable form(i.e., diagonal or block-diagonal)
�
Can apply transformations to makequadratic approximation more reasonable
�
Cons:
�
Poor fit for multimodal distributions
�
Often, det A cannot be found efficiently
��� � ��� �� ���
� ����� ��� � � ��� �� � � �
�� ���g(x)
q(x)
x0
Bayesian Methods for NLPSlide 82
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 83
Hal Daumé III ([email protected])10:16
Variational Approximation
�
Basic idea: replace intractable p with tractable q
�
Old Problem:
�
We cannot come up with a good, single, q to approximate p
�
Key Idea:
�
Consider a family of distributionswith 'variational parameters'
�
Choose a member q from Q that is closest to p
�
New problems:
�
How do we choose Q?
�
How do we measure 'closeness' between q and p?
� � ��� ��� � � � � � �
See a
lso: M
K[3
3], W
ain0
3, M
in03
Bayesian Methods for NLPSlide 84
Hal Daumé III ([email protected])10:16
Recall EM and Jensen's Inequality
�
Jensen gives us:
�
Where we chose to turn the inequality into an equality. But we can also compute:
for any choice of q
� ��� � � � � � � � � �
�� �� ��� ��� � � � � �� ��� � ��� � � ��� �
� � � �� ��� � � � � � � � � ��� �
� � � �
! �� ��� � � � � �� � �
� � � ��� �
� � � �
� �� � � � � � �� � � � � � � �#" �� � � � � � � � � � �
� $&%' ( ) � �� � � � � � � � *" $% ' ( ) � � � � � � *
+�, -. /�0 1�2 354 L 6 78 9;: /�< 3 1 1 . /< 1 0 =2 3 >
L
?�@ ACB D
Bayesian Methods for NLPSlide 85
Hal Daumé III ([email protected])10:16
Variational EM
�
Parameterize q and directly optimize:
�
Iterate:
�
V-Step: Compute variational parameters to minimize KL
�
E-Step: Compute expectations of hidden variables wrt
�
M-Step: Maximize wrt true parameters
�
Art: inventing q so that this is all tractable
� � � � � � � � � $% ' ( ) � �� � � � � � � � *" $%' ( ) � � � � � � * � �� � � � � � �� � � � � � � � � �� � �
�B
� � �
�L
Bayesian Methods for NLPSlide 86
Hal Daumé III ([email protected])10:16
Variational: Choosing Q
�
Mixture model:
� � ��� � � � �� � � � �
γ
� � � �
γ
� � γ
� � � � ��� � �� � � � �� �
��� �� ���� � � � � � � � �� �
� � "! �� # �� � $
% & ')(+*'-, . /10 2 ( 3 ')(+*', 4
5 & ' % . / 687 3 ' % 4z
M,N
9;:<>=@?<BA C
� �ED � � � ��F ��-G � �D � �
γ� �F � �-G �
γ� � F � γ
� �-G �H�I J KD H J K
LNMPO Q R �D H J%)ST O
Key: and z arenow not tied in the
q distribution!
U
w
zN
VW
V
X X
M
G D
Y
Bayesian Methods for NLPSlide 87
Hal Daumé III ([email protected])10:16
Variation EM
�
Step 1: Write out the log likelihood function:
L
� � ��� � � � �
� ��� ��� ������ �� ��� � � � � �� �� � � �� �"! � � � �� �� � � �� � � � � � � ! � #
$ �%��� ��� ��&��� �' ��� �� �� � � �� � ! � � � #
�(��� ��� �)&*
γ
�� � � $ * γ
�� $ * γ
� � � �� $ + � � � � � � � $ + ��� � � + $ � ,
� � �� ��� ��
�� -! � � � � � � � + $ ! � � � � � + $ � . #
� � �� ��� �/
�� � 0� 1 ! � 0 � � 1 � � � � 10 2
$ �3��� ��� �4 *
γ
� 5� � 5 � $ * γ
� 5� $ * γ
� 5 � � � 5� $ + � � � � � � 5 � $ + ��� � � + $� 6
$ �(��� ��� ��
�� � 0 ! � 0��� � 5� � 0#
Bayesian Methods for NLPSlide 88
Hal Daumé III ([email protected])10:16
Variational EM
�
Step 2: Simplify and compute expectations
��� � �� �
� 5 � � ��� % �� � �� �
� ��� %
��� � �� �
� 3� 5 � �4 ��� 3 � % 4 �� � �� �
� ��� 3 � % 4
��� � �� �
� 5 � � �
L
� � � � � � � �
� �� � �� ��� � �� �� � � � � � �� ��� �� �"! � � � � �� � � �� � � � � � � ! � �
$ �(�� � �� ��� � � ' �� � �� � � �� �"! � � � �
��� � �� ��*
γ
�� � � $ * γ
�� $ * γ
� � � �� $ + � � � � � � � $ + � � � � + $ � �
� �%�� � �� ��
�� �! � ��� � � � � + $ ! � ��� � � + $ � � �
� �%�� � �� �/
�� � 0� 1 ! � 0 � � 1 ��� � � 10 2
$ �%�� � �� ��*
γ
� 5� � 5 � $ * γ
� 5� $ * γ
� 5 � � � 5� $ + � � � � � � 5 � $ + � � � � + $ � �
$ �(�� � �� ��
�� � 0 ! � 0� � � 5� � 0�
Bayesian Methods for NLPSlide 89
Hal Daumé III ([email protected])10:16
Thanks to sufficient statistics!
Computing Expectations
% & '(*'-, . /0 2 ( 3 ')(+*'-, 4
5 & ' % . / 687 3 ' % 4
����� � �� �� �� � ��� � � �� �� � � ��
� ���
����� � �� ���� � � � � �
� � �� � �� ��"! � � � � � �
� # $ ��% & ��' (*) # $ �% () # $ ��' (
����� � �� �� +� � � � � � � ���, � � �� �� � ��-� +� �
� � � � � �� �� ! � ��� � � �
� ��� . # $ � % & �/' () # $ �% (*) # $ �/' ( 0
Bayesian Methods for NLPSlide 90
Hal Daumé III ([email protected])10:16
Optimizing Variational Parameters
�
Step 3: Given full expression for , differentiate wrt VPsL
� � ��� �
L
��� �� � � � ��� � �� � �� � � ��� � � � � � � � � �� � � � � � � � �
� � � � ��� � � � � � �� � �� � � ��� � � � � � � � �
� � � ��! � � � �� ��#" �
� � � ��$" � � � �� ��" �
� � % & ' � & � � � ( &%
)
L) � � �
� � � � � � �� � � � ��� � � � � � ��� � � � � � � � � � � �
*�� �
� � � � � +, - . � � � � � � � � � � � � �
/
Bayesian Methods for NLPSlide 91
Hal Daumé III ([email protected])10:16
Optimizing Variational Parameters
L
� ��� � � � � � � � � �� �
� � � � � � � �� � �� � �� �
�� � � � � � � �� � �
γ
� � � � � �γ
� � � � �� �
� � � � � � � � � � � � � � �� �
�� � � � �� � � � � ��� � �� �
�
L� �� � � � � ��� � � � � ��� � �
� � ��
� � � � � � � �� � � � � � � � �
� � ��
� �� � � ��
�� �
� �� � � ��
�� � �� � �
Bayesian Methods for NLPSlide 92
Hal Daumé III ([email protected])10:16
Optimize Model Parameters
�
Step 4: Optimize the model parameters:
L
� (��� � � ��� � ( �� �
� � ��� � � � '� ��
)
L
) (�
� � � �� � � � � '� ��
(��
�
* ( �� �
� � ��� � � � '� ��
Bayesian Methods for NLPSlide 93
Hal Daumé III ([email protected])10:16
Optimize Model Parameters
�
Finally, a and b:
L
� ��� � � � ��
�
γ
� � � � ��
γ
� � �
γ
� � � � �� � ����� � � �� � ������
�
L� � � � ��� � � � � � � � � � � �� � �
�� � � � � � � � � � � �
! solve using optimization techniques
Bayesian Methods for NLPSlide 94
Hal Daumé III ([email protected])10:16
VEM in our Model
�
Iterate:
�
Optimize variational parameters:
�
Optimize model parameters:
w
zN
��� �� �
M
G D
�
� � � � � � � � � � � �
� � � � �� � � �� � � � � ��� � � � � � � � ��� � � � � � � � � �
!#" $ % & ')(*" +-, % & ."
'(" + /102 " $ .3 40 2 3576 8 9 3"
:<;>= ? @ A BC D .FE @ GH = ? @ /:JI @ K I @ G= L ?
: ;= ? @
M�
� N� � �
OQP� � � R� ��
SUT V W generic optimization techniques
Bayesian Methods for NLPSlide 95
Hal Daumé III ([email protected])10:16
Variational EM Summed Up
�
Steps:
�
Write down conditional likelihood andchoose an approximating distribution(eg, by factoring everything) with variational parameters
�
Iterate between optimizing the VPsand model parameters
�
Pros:
�
Efficient, deterministic, often quite accurate
�
Cons:
�
At it's heart, still a mode-based technique
�
Often underestimates the spread of a distribution
�
Approximation is local
Bayesian Methods for NLPSlide 96
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 97
Hal Daumé III ([email protected])10:16
Expectation Propagation
�
Basic idea: replace intractable p with product of tractable q
�
Generally we want to compute:
�
Approximate each factor:
�
Integral is approximated by:
approximate terms should be chosen to make the EP operations tractable
� ��
�
���� �
�� � �
? � � � � ? � � �
Typically is the prior
�
� � �
� � � �
��
��
�� �
� �
� � �
See a
lso: M
in01
Bayesian Methods for NLPSlide 98
Hal Daumé III ([email protected])10:16
Expectation Propagation Algorithm
�
Initialize approximate terms
�
Compute the approximate posterior:
�
Iterate:
�
Select a to update
�
Delete from the posterior by dividing and renormalizing:
�
Match and and minimize KL divergence to get a new posterior with marginal
�
Update
����
�� �
� � � � �
�2
�2 � �
��
� �2
2 � �
� ? � � �
� ? � � �� � ? � � � �
� � � �
� ? � � �
� � ? � � � ���
�� ��
�
� ? � � � � � @� � � �
� � ? � � �
Bayesian Methods for NLPSlide 99
Hal Daumé III ([email protected])10:16
EP in our Model
�
Our integral looks like:
�
We approximate non-prior terms:
�
The approximate posterior is:
w
zN
��� �� �
M
G D
� � ��� �� � � �� �
� � � ��� � � �
� � �� � ��� � � � � � � ��� � � � � � ��� �
�
! "#
$&%'( )
*+ *, ' - % .
, ) - % . ! /10 ,�2 - % 3 254 6 .
, ' - % . ! 7 89;:
% 9 :<>= :9;: ?@BA
CBDFEGIH J K LE M&NH MN
O '9: Resembles a Beta/Dirichletwith parameters
P E MN
Q RTS U � VXW Y[Z R S \ Z^] _ U`
a Y`R S U bdc e VXW Y Z RTf U f g e Z h`
i` j` k
l;m e _ h`
i` j` n
Bayesian Methods for NLPSlide 100
Hal Daumé III ([email protected])10:16
EP in our Model
�
Delete:
�
Match:
�
Choose such that:
�
Update:
w
zN
��� �� �
M
G D
� � ��� �� � � �� �
� � � ��� � � �
� � �� � ��� � � � � � � ��� � � � � � ��� �
�
Q �` �� � � Q �� �� Y`� � � e VXW Y[Z � l �` � l�� �` e l� j` �
� �� � � � �� � � �
� � � � � � � � � �
� � �
� ��� �� ��� !#" $% �� &(' ) * + � � � �� ��� $� ��
,% - ��� �� ��� !#" $% �� &(' ) * + ,% - ��� �� ��� $� ��
.0/ 1 2
3465
798 1 :<; 8 1 =>/?A@
B3 ;
345
C./ 1DEF
G/ 2 1 H/1 8 1=>/
1 8 1 =>/γ
I 1 8 1 : J
1 γ
I8 1 : J
1 γ
I8 1=/ J
γ
I 1 8 1 =/ JLearning rate
Difference betweenactual and expected
Approximatenormalization
Old normalization
New normalization
Bayesian Methods for NLPSlide 101
Hal Daumé III ([email protected])10:16
Summing up EP
�
Approximate terms by product, iteratively minimize each's KL divergence to the true posterior
�
Pros:
�
Efficient, accurate
�
Global approximation to the integral
�
Cons:
�
Typically requires lots of human effort
�
Only gives an approximation to the integral (not a bound)
�
Sometimes difficult to productize distributions
�
Doesn't necessarily converge
Bayesian Methods for NLPSlide 102
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 103
Hal Daumé III ([email protected])10:16
Summary of Methods PROS CONS
MAP Just as easy as EM Not Bayesian, overfitsOnly simple models
Summing Easy to implement Bounded regionsArbitrarily accurate Impossible for d>2
Monte Carlo Simple Proposal dist. is hardFew tunable params Lots of samples required
MCMC Should work in d>>2 Hard to discover convergenceNo need for global PDs Lots of useless computation
Laplace Not much harder than MAP Poor fit for multimodalEfficient for good A Inefficient for bad A
Variational Efficient, deterministic Still (massive) mode-basedOften very accurate Local approximations
EP Efficient, deterministic Lots of human effortVery accurate Weak convergence guarantees
Bayesian Methods for NLPSlide 104
Hal Daumé III ([email protected])10:16
Message Passing Algorithms
�
Two major choices:
�
What approximating distribution should we use?
�
What cost should we minimize?
Power EPexp familyD (p || q)
Struct MFexp familyKL(q || p)
Frac BPfactorizedD (p || q)
EPexp familyKL(p || q)
Mean FieldfactorizedKL(q || p)
Tree RepfactorizedD (p || q)
BPfactorizedKL(p || q)
a
a
a>1
q || p p || q
��� ��� � ��� �
� � ���
��� � � � � � � � � � ��� � �
�� � ��
�� ��� � ��� � ��� ��� � ��� �� �� � � � � ��� ���� � ���
See a
lso: M
in05
Bayesian Methods for NLPSlide 105
Hal Daumé III ([email protected])10:16
Other Integration Strategies
�
Message Passing:
�
Generalized BP
�
Iterated conditional modes
�
Max-product belief revision
�
TRW-max-product
�
Laplace propagation
�
Penniless propagation
�
Bound propagation
�
Variational message passing
�
Non-message-passing:
�
Contrastive free energy
�
Bethe free energy
�
Spline integration
Bayesian Methods for NLPSlide 106
Hal Daumé III ([email protected])10:16
Empirical Evaluation of Methods
�
Query-focused summarization model:
w
zN
�
� �M
G D
�
SQ
w
� G����
� � � � �� ����
���� � � ��� ��� �
��� � � � � � �����
��� � � � � �� �� � �!" #
�
� � � � �$ � � �! " %� & ' (
�
� � � � ��)� ��! " % � & * & ' (
Bayesian Methods for NLPSlide 107
Hal Daumé III ([email protected])10:16
Evaluation Data
�
All TREC data
�
Queries 51-350 and 401-450 (35k words)
�
All relevant documents (43k docs, 2.1m sents, 65.8m words)
�
Asked 7 annotators to select up to 4 sentences for an extract
�
Each annotated 25 queries (166 total)
�
Systems produce ranked lists of sentences
�
Compared on mean average precision, mean reciprocal rank and precision at 2
�
Computation Time:
�
MAP (2 hours)
�
Summing (2 days)
�
Monte Carlo (2 days)
�
MCMC (3 days)
Laplace (5 hours)Variational (4 hours)EP (2.5 hours)
Bayesian Methods for NLPSlide 108
Hal Daumé III ([email protected])10:16
Evaluation Results
��
� �� �
� �� �
� �� �
� �� �
�� � � � � � � � ��� ��� �
Rando
mPo
sition IR
MAP
Summing
Mon
te Carl
o
MCM
CLap
lace
Variati
onal EP
2 hours
2 days
2 days
3 days
3.5 hours
4 hours
2.5 hours
Bayesian Methods for NLPSlide 109
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 110
Hal Daumé III ([email protected])10:16
Bayesian Discriminative Models
�
Take a neural network and put a prior on the weights:
�
Computation requires a bit of calculus on Gaussians
�
Then use Laplace, variational or MCMC to perform integration over posterior
� ��� � & ' � � '�� � & '� � '�� �� � � � � �� � & ' � � � � � & ' � � � ��
�
� �� �� � � �� �
� � � ��� � �� � & ' � � � � � & '� � � � ��� � � � ���� � � �
�
� � � � � �� � � �� � � � �
See a
lso: M
K[3
8]
Bayesian Methods for NLPSlide 111
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 112
Hal Daumé III ([email protected])10:16
Gaussian Processes
�
Idea: instead of placing a prior over weights, place it over the function directly
�
Gaussian Process:
�
A collection of r.v.s such that any finite sample is jointly Gaussian
�
Uniquely specified by mean distribution and covariance function
� ��� � & ' � � ' � � & '� � '�� � � � � � �
�� � ��� � & ' � � � � � & ' � � � � � ��
� �� �� � � �� � �
� �� � � $�� � � �� � �� �
� � �� � �� � � $�� � $�� � � �� � �� � � � �� �
See a
lso: M
K[3
8]
Bayesian Methods for NLPSlide 113
Hal Daumé III ([email protected])10:16
Computing with GPs
�
Compute the new covariance matrix:
� � & ' �
� ���
��
��� � � � � � � ��� �� � � � � � �
Expected value Error bars
�� � � � � � � � �
Bayesian Methods for NLPSlide 114
Hal Daumé III ([email protected])10:16
Efficiently Computing with GPs
�
Inverting is cubic
�
Idea:
�
Iteratively add training points to maximize information gain (or some other criteria)
�
Only compute covariance over those
�
Leads to 'Informative Vector Machine (IVM)'
��� � � � � � � ��� � � � � � � � �
See a
lso: L
aw[0
3]
Bayesian Methods for NLPSlide 115
Hal Daumé III ([email protected])10:16
Example Covariance Functions
�
Linear:
�
Polynomial:
�
RBF:
�
Combo:
�
Bayesian advantage: we can tune all of these parameters!
� ��� � � ��� � � � �
� ��� � � �� � � � � � � �
� ��� � � ��� �� � � � � � � � ���� � � �! � "�
� �� � � ��� �$# % � & � � � ' %( ) �� � � � & � � �
Bayesian Methods for NLPSlide 116
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 117
Hal Daumé III ([email protected])10:16
Dirichlet Processes
�
Suppose we don't want to limit the number of components of a mixture model
�
Example:
�
Each word in a document is drawn from a topic
�
We don't know how many topics there are
�
Allow the number of topics to grow with the data
�
A Dirichlet Process is a collection of r.v.s such that any finite sample is jointly Dirichlet
�
Parameterized by a precision and a mean distribution
� ��� � �
A draw from a DP is, itself, a distribution
See a
lso: N
eal9
8
Bayesian Methods for NLPSlide 118
Hal Daumé III ([email protected])10:16
Dirichlet Distribution vs. Process
�
Suppose we place a DP prior:
�
Now we observe samples drawn from :
�
The posterior (after integrating out G) is again a DP:
�
This is exactly like a Dirichlet Distribution:
� � �� ��� ���
�
��� � � �� �� � � �
� �� � � � �� � � � � � � � � � �� � � ��
� �� � �� � �� � �� � � � � � � � ��
� ��
Distribution
Vector
Bayesian Methods for NLPSlide 119
Hal Daumé III ([email protected])10:16
Polya Urns
�
Start with an urn witha single black ball
�
Repeatedly draw balls:
�
If drawn ball is black, replace itand put in a ball of a new color
�
If ball is not black, replace itand put in a ball of the same color
�
Distribution is given by a DP:
� ��� � � � � �� � � � � � � � �
�� � �
� � � � � � � � � � � � �
�� � � ���
� � �
Bayesian Methods for NLPSlide 120
Hal Daumé III ([email protected])10:16
Gibbs Sampling for the DP
�
Suppose we have an infinite mixture model over words:
�
Key, Mult is conjugate to the DP mean (Dir )
�
Gibbs sampling
�
Assign a latent class to each data point
�
Repeat:
�
Resample each by:
�
Resample each by:
� � � � � � � � ��� �� �
� � � � �
�� � � � �� � � �
� � � �
� �
� � � �� � � ��� � � � ��
� � � �� � � � � �� � � �� � �
� � � � � ��� � � �� �� � � �� � ��
���
�� � � �� � �� � � � � � �
�� � � �� � � �
Bayesian Methods for NLPSlide 121
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 122
Hal Daumé III ([email protected])10:16
Bayesian Decision Theory
�
Bayesian statistics tells us how to compute distributions
�
What if we want to actually do something with a distribution?
�
Define a loss function:
�
Choose action to minimize the Bayesian expected loss:
� � ��� � � � ����� � � � � � � ��
� � � � � � � � � � �
� � � � � �� � �� ��� �
�
Frequentist approach: define a decision rule:
�
Define the risk of the decision rule as:
�
Now, search for admissible decision rules
��� ��� �� � � � � ��� �
� �� � � � !#" � $ % �� � � �'& � � ($ �& ) $ �& * � � % �� � � � & � �
What do weexpect to lose
Over allparameters
What do weexpect to lose
Over alldata sets
With knownparameters
See a
lso: B
er[4
.4]
Bayesian Methods for NLPSlide 123
Hal Daumé III ([email protected])10:16
Tutorial Outline
�
Introduction to the Bayesian Paradigm
�
Background Material
�
Graphical Models
�
Expectation Maximization
�
Priors, priors, priors (subjective, conjugate, reference, etc.)
�
Inference Problem and Solutions
�
MAP
�
Summing
�
Monte Carlo
�
Markov Chain Monte Carlo
�
Advanced Topics (time permitting)
�
Bayesian discriminative models
�
Non-parametric (infinite) models
�
Bayesian Decision Theory
�
References
�
Laplace Approximation
�
Variational Approximation
�
Expectation Propagation
�
Others...
Bayesian Methods for NLPSlide 124
Hal Daumé III ([email protected])10:16
Bayes in Action (NLP/IR/Text)
�
D. Blei, A. Ng, M. Jordan, Latent Dirichlet allocation, JMLR 2003.
�
T. Griffiths, M. Steyvers, D. Blei, J. Tenenbaum, Integrating topics and syntax. NIPS 2004.
�
A. McCallum, A. Corrada-Emmanuel, X. Wang, Topic and Role Discovery in Social Networks. IJCAI 2005.
�
Y. Zhang, J. Callan, T. Minka, Novelty and Redundancy Detection in Adaptive Filtering. SIGIR 2002.
�
T. Minka, Bayesian conditional random fields. AISTATS 2005.
�
K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, M. Jordan. Matching words and pictures. JMLR 2003.
Bayesian Methods for NLPSlide 125
Hal Daumé III ([email protected])10:16
For Further Information (Books)
�
James O. Berger, Statistical Decision Theory and Bayesian Analysis. Springer, 1985.
�
David MacKay, Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.
�
Larry Wasserman, All of Statistics: A Concise Course in Statistical Inference. Springer, 2003.
�
Christopher Bishop, “The new book.”2006.
Bayesian Methods for NLPSlide 126
Hal Daumé III ([email protected])10:16
For Further Information (Tutorials)
�
C. Andreiu, N. de Freitas, A. Doucet, M. Jordan, An Introduction to MCMC for Machine Learning. ML 2003
�
M. Wainwright, M. Jordan, Graphical models, exponential families and variational inference. UCB Stat TR#649, 2003.
�
K. Murphy, A Brief Introduction to Graphical Models and Bayesian Networks. www.cs.ubc.ca/~murphyk/Bayes/bayes.html
�
T. Minka, Using lower bounds to approximate integrals. www.research.microsoft.com/~minka/papers/rem.html, 2003.
Bayesian Methods for NLPSlide 127
Hal Daumé III ([email protected])10:16
Other References
�
N. Lawrence, Fast sparse Gaussian process methods: the informative vector machine. NIPS 2003.
�
T. Minka, Expectation Propagation for Approximate Bayesian Inference. UAI 2001.
�
T. Minka, Divergence Measures and Message Passing. AI-Stats 2005.
�
R. Neal, Markov chain sampling methods for Dirichlet process mixture models, TR. 9815, Dept. of Statistics, University of Toronto.
Bayesian Methods for NLPSlide 128
Hal Daumé III ([email protected])10:16
Thank you!
Questions? Comments?