02 probability - uni-potsdam.de probability.pdf · title: 02 probability created date: 10/19/2015...
TRANSCRIPT
Review: Probability Theory
BM1 Advanced Natural Language Processing
Alexander Koller
20 October 2015
Outline
• Random variables and probabilities.
• Joint and conditional probabilities,statistical independence, Bayes’ rule.
• Expected values.
• Statistical models, parameter estimation,probabilities vs. frequencies.
Elementary prob theory
• An random variable X, Y, … describes the possible outcomes of a random event and the probability of that outcome. ‣ Coin flip: outcomes A = {H, T}
Fair coin: P(X = H) = P(X = T) = 0.5Weighted coin: might have P(X = H) > P(X = T)
‣ Dice rolling: outcomes A = {1, 2, 3, 4, 5, 6} Fair die: P(X = 1) = … = P(X = 6) = 1/6
• Probability distribution of rv X is the function a ↦ P(X = a).
a P(X=a)
H 0.5
T 0.5
Events
• The set A of atomic events is called the sample space.
• We mostly work with finite A, i.e. discrete random variables. There are also continuous rvs.
• We can assign probability to complex events: ‣ P(X = 1 or X = 2): prob that X takes value 1 or 2.
‣ P(X ≥ 4): prob that X takes value 4, 5, or 6.
‣ P(X = 1 and Y = 2): prob that rv X takes value 1 and rv Y takes value 2.
Prob axioms
• The following axioms hold of probabilities: ‣ 0 ≤ P(X = a) ≤ 1 for all events X = a
‣ P(X ∈ A) = 1; P(X ∈ ∅) = 0
‣ P(X ∈ B) = P(X = a1) + … + P(X = an) for B = {a1, …, an} ⊆ A
• Example: if pd of X is uniform with N outcomes,i.e. P(X = ai) = 1/N for all i, then P(X ∈ B) = |B| / N.
Consequences
• The following rules for calculating with probs follow directly from the axioms. ‣ Union:
P(X ∈ B ∪ C) = P(X ∈ B) + P(X ∈ C) - P(X ∈ B ∩ C)
‣ In particular, if B and C are disjoint (and only then),P(X ∈ B ∪ C) = P(X ∈ B) + P(X ∈ C)
‣ Complement: P(X ∉ B) = P(X ∈ A - B) = 1 - P(X ∈ A).
• For simplicity, will now restrict presentation to events X = a. Basically everything generalizes to events X ∈ B.
Joint probs
• We are very often interested in the prob of two events X = a and Y = b occurring together, i.e. the joint probability P(X = a, Y = b). ‣ e.g. X = first roll of die, Y = second roll of die
• If we know joint pd, we can recover individual pds by marginalization. Very important!
P (X = a) =X
b
P (X = a, Y = b)
Conditional probs
• Joint probs are trickier than they seem because the outcome of X may influence the outcome of Y. ‣ X: draw first card from a deck of 52 cards
‣ Y: after this, draw second card from deck of cards
‣ P(Y is an ace | X is not an ace) = 4/51P(Y is an ace | X is an ace) = 3/51
• We write P(Y = a | X = b) for the conditional prob that Y has outcome a if we know that X has outcome b.
Conditional probs
• Connection between conditional and joint probs: P(X = a, Y = b) = P(Y = b | X = a) ∙ P(X = a) also: P(X = a, Y = b) = P(X = a | Y = b) ∙ P(Y = b)
• Thus “define” conditional prob as:
P (Y = b | X = a) =P (X = a, Y = b)
P (X = a)
=P (X = a, Y = b)Pb2B P (X = a, Y = b)
Conditional probs
• Connection between conditional and joint probs: P(X = a, Y = b) = P(Y = b | X = a) ∙ P(X = a) also: P(X = a, Y = b) = P(X = a | Y = b) ∙ P(Y = b)
• Thus “define” conditional prob as:
P (Y = b | X = a) =P (X = a, Y = b)
P (X = a)
=P (X = a, Y = b)Pb2B P (X = a, Y = b)
marginalization
Bayes’ Rule
• Important consequence of joint/cond connection:
• Allows us to “reverse” conditional. ‣ Say you meet someone with long hair (Y = L);
how probable is it that the person is a woman (X = W)?
‣ Assume P(Y = L | X = W) = 0.75, P(Y = L | X = M) = 0.15; also P(X = W) = P(X = M) = 0.5.
‣ Use Bayes’ rule to calculate P(X = W | Y = L) = 0.83.
P (X = a | Y = b) =P (Y = b | X = a) · P (X = a)
P (Y = b)
Statistical independence
• Very important special case: ‣ P(X = a | Y = b) = P(X = a)
‣ equivalently, P(X = a, Y = b) = P(X = a) ∙ P(Y = b)
• This means that outcome of Y has no influence on outcome of X. Events are statistically independent.
• Typical examples: coins, dice.
• Many events in natural language not independent, but we pretend they are to simplify models.
Example use of independence
• The following rule holds for complex joint events: P(X1 = a1, X2 = a2, …, Xn = an) = P(a1) ∙ P(a2 | a1) ∙ … ∙ P(an | a1, …, an-1)
• In practice, it is typically hard to estimate things like P(an | a1, …, an-1) well because not many training examples satisfy complex condition.
• Thus pretend all are independent. Then we have P(a1, …, an) ≈ P(a1) ∙ … ∙ P(an).
Expected values
• Frequentist interpretation of prob: if P(X = a) = p, and we repeat the experiment N times, then we see outcome “a” roughly p ∙ N times.
• Now imagine each outcome “a” comes with reward R(a). After N rounds of playing the game, what reward can we (roughly) expect?
• Measured by expected value:
EP [R] =X
a2A
P (X = a) ·R(a)
Expected values
• Originally invented for theory of games of chance. ‣ Casino should design game of Roulette such that expected
win for player is less than zero. (And it is.)
• In NLP, we will often not be sure what the correct analysis, e.g. parse tree, is; but we will have a pd over parse trees. Then use expected values to calculate relevant statistics.
Statistical models
• We want to use prob theory to estimate a model of a generating process from observations about its outcomes.
• Example: we flip a coin 100 times and observe H 61 times. Should we believe that it is a fair coin? ‣ observation: 61x H, 39x T
‣ model: assume rv X follows a Bernoulli distribution, i.e. X has two outcomes, and there is a value p such that P(X = H) = p and P(X = T) = 1 - p.
‣ want to estimate the parameter p of this model
Frequencies
• We distinguish two types of frequencies when we make observations: ‣ absolute frequency C(a) of outcome a is the total number of
times we have seen outcome a in our experiments
‣ relative frequency f(a) is the relative proportion of experiments that came out “a”, i.e. f(a) = C(a) / N, where N is the total number of observations.
Maximum Likelihood Estimation
• We want to estimate the parameters of our model from frequency observations. There are many ways to do this. For now, we focus on maximum likelihood estimation, MLE.
• Likelihood L(O ; p) is the probability of our model generating the observations O, given parameter values p.
• Goal: Find value for parameters that maximizes the likelihood.
Likelihood functions
(Wikipedia page on MLE; licensed from Casp11 under CC BY-SA 3.0)
parameter p
likel
ihoo
d L(
O ;
p)
count C(H):
ML Estimation
• For Bernoulli and multinomial models, it is extremely easy to estimate the parameters that maximize the likelihood: ‣ P(X = a) = f(a)
‣ in the coin example above, just take p = f(H)
• Why is this?
MLE for Bernoulli model
• Let’s say we had training data C of size N, and wehad NH observations of H and NT observations of T.
L(C) =NY
i=1
P (wi | p) =NY
i=1
pNH (1� p)NTlikelihood
log-likelihood
`(C) = logL(C) =
NX
i=1
logP (wi | p) = NH log p+NT log(1� p)
Logarithm is monotonic
Observation: If x1 > x2, then log(x1) > log(x2).
Therefore, argmax
pL(C) = argmax
p`(C)
Maximizing the log-likelihood
• Find maximum of function by setting derivative to zero:
• Unique solution is p = NH / N = f(H).
`(C) = NH log p+NT log(1� p)
d`(C)
dp=
NH
p� NT
1� p
More complex models
• Many, many models we use in NLP are multinomial probability distributions. More than two outcomes possible; think dice rolling.
• MLE result generalizes to multinomial models: P(X = a) = f(a).
• Maximizing log-likelihood uses technique called Lagrange multipliers to ensure parameters sum to 1.
• If you want to see the details, see Murphy paper on the website.
Conclusion
• Probability theory is essential tool in modern NLP.
• Important concepts today: ‣ random variable, probability distribution
‣ joint and conditional probs; Bayes’ rule; independence
‣ expected values
‣ statistical models; parameters; likelihood; MLE
• We will use all of these concepts again and again in this course. If you have questions, ask me early.