02 probability - uni-potsdam.de probability.pdf · title: 02 probability created date: 10/19/2015...

25
Review: Probability Theory BM1 Advanced Natural Language Processing Alexander Koller 20 October 2015

Upload: others

Post on 19-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Review: Probability Theory

BM1 Advanced Natural Language Processing

Alexander Koller

20 October 2015

Page 2: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Outline

• Random variables and probabilities.

• Joint and conditional probabilities,statistical independence, Bayes’ rule.

• Expected values.

• Statistical models, parameter estimation,probabilities vs. frequencies.

Page 3: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Elementary prob theory

• An random variable X, Y, … describes the possible outcomes of a random event and the probability of that outcome. ‣ Coin flip: outcomes A = {H, T}

Fair coin: P(X = H) = P(X = T) = 0.5Weighted coin: might have P(X = H) > P(X = T)

‣ Dice rolling: outcomes A = {1, 2, 3, 4, 5, 6} Fair die: P(X = 1) = … = P(X = 6) = 1/6

• Probability distribution of rv X is the function a ↦ P(X = a).

a P(X=a)

H 0.5

T 0.5

Page 4: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Events

• The set A of atomic events is called the sample space.

• We mostly work with finite A, i.e. discrete random variables. There are also continuous rvs.

• We can assign probability to complex events: ‣ P(X = 1 or X = 2): prob that X takes value 1 or 2.

‣ P(X ≥ 4): prob that X takes value 4, 5, or 6.

‣ P(X = 1 and Y = 2): prob that rv X takes value 1 and rv Y takes value 2.

Page 5: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Prob axioms

• The following axioms hold of probabilities: ‣ 0 ≤ P(X = a) ≤ 1 for all events X = a

‣ P(X ∈ A) = 1; P(X ∈ ∅) = 0

‣ P(X ∈ B) = P(X = a1) + … + P(X = an) for B = {a1, …, an} ⊆ A

• Example: if pd of X is uniform with N outcomes,i.e. P(X = ai) = 1/N for all i, then P(X ∈ B) = |B| / N.

Page 6: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Consequences

• The following rules for calculating with probs follow directly from the axioms. ‣ Union:

P(X ∈ B ∪ C) = P(X ∈ B) + P(X ∈ C) - P(X ∈ B ∩ C)

‣ In particular, if B and C are disjoint (and only then),P(X ∈ B ∪ C) = P(X ∈ B) + P(X ∈ C)

‣ Complement: P(X ∉ B) = P(X ∈ A - B) = 1 - P(X ∈ A).

• For simplicity, will now restrict presentation to events X = a. Basically everything generalizes to events X ∈ B.

Page 7: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Joint probs

• We are very often interested in the prob of two events X = a and Y = b occurring together, i.e. the joint probability P(X = a, Y = b). ‣ e.g. X = first roll of die, Y = second roll of die

• If we know joint pd, we can recover individual pds by marginalization. Very important!

P (X = a) =X

b

P (X = a, Y = b)

Page 8: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Conditional probs

• Joint probs are trickier than they seem because the outcome of X may influence the outcome of Y. ‣ X: draw first card from a deck of 52 cards

‣ Y: after this, draw second card from deck of cards

‣ P(Y is an ace | X is not an ace) = 4/51P(Y is an ace | X is an ace) = 3/51

• We write P(Y = a | X = b) for the conditional prob that Y has outcome a if we know that X has outcome b.

Page 9: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Conditional probs

• Connection between conditional and joint probs: P(X = a, Y = b) = P(Y = b | X = a) ∙ P(X = a) also: P(X = a, Y = b) = P(X = a | Y = b) ∙ P(Y = b)

• Thus “define” conditional prob as:

P (Y = b | X = a) =P (X = a, Y = b)

P (X = a)

=P (X = a, Y = b)Pb2B P (X = a, Y = b)

Page 10: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Conditional probs

• Connection between conditional and joint probs: P(X = a, Y = b) = P(Y = b | X = a) ∙ P(X = a) also: P(X = a, Y = b) = P(X = a | Y = b) ∙ P(Y = b)

• Thus “define” conditional prob as:

P (Y = b | X = a) =P (X = a, Y = b)

P (X = a)

=P (X = a, Y = b)Pb2B P (X = a, Y = b)

marginalization

Page 11: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Bayes’ Rule

• Important consequence of joint/cond connection:

• Allows us to “reverse” conditional. ‣ Say you meet someone with long hair (Y = L);

how probable is it that the person is a woman (X = W)?

‣ Assume P(Y = L | X = W) = 0.75, P(Y = L | X = M) = 0.15; also P(X = W) = P(X = M) = 0.5.

‣ Use Bayes’ rule to calculate P(X = W | Y = L) = 0.83.

P (X = a | Y = b) =P (Y = b | X = a) · P (X = a)

P (Y = b)

Page 12: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Statistical independence

• Very important special case: ‣ P(X = a | Y = b) = P(X = a)

‣ equivalently, P(X = a, Y = b) = P(X = a) ∙ P(Y = b)

• This means that outcome of Y has no influence on outcome of X. Events are statistically independent.

• Typical examples: coins, dice.

• Many events in natural language not independent, but we pretend they are to simplify models.

Page 13: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Example use of independence

• The following rule holds for complex joint events: P(X1 = a1, X2 = a2, …, Xn = an) = P(a1) ∙ P(a2 | a1) ∙ … ∙ P(an | a1, …, an-1)

• In practice, it is typically hard to estimate things like P(an | a1, …, an-1) well because not many training examples satisfy complex condition.

• Thus pretend all are independent. Then we have P(a1, …, an) ≈ P(a1) ∙ … ∙ P(an).

Page 14: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Expected values

• Frequentist interpretation of prob: if P(X = a) = p, and we repeat the experiment N times, then we see outcome “a” roughly p ∙ N times.

• Now imagine each outcome “a” comes with reward R(a). After N rounds of playing the game, what reward can we (roughly) expect?

• Measured by expected value:

EP [R] =X

a2A

P (X = a) ·R(a)

Page 15: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Expected values

• Originally invented for theory of games of chance. ‣ Casino should design game of Roulette such that expected

win for player is less than zero. (And it is.)

• In NLP, we will often not be sure what the correct analysis, e.g. parse tree, is; but we will have a pd over parse trees. Then use expected values to calculate relevant statistics.

Page 16: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Statistical models

• We want to use prob theory to estimate a model of a generating process from observations about its outcomes.

• Example: we flip a coin 100 times and observe H 61 times. Should we believe that it is a fair coin? ‣ observation: 61x H, 39x T

‣ model: assume rv X follows a Bernoulli distribution, i.e. X has two outcomes, and there is a value p such that P(X = H) = p and P(X = T) = 1 - p.

‣ want to estimate the parameter p of this model

Page 17: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Frequencies

• We distinguish two types of frequencies when we make observations: ‣ absolute frequency C(a) of outcome a is the total number of

times we have seen outcome a in our experiments

‣ relative frequency f(a) is the relative proportion of experiments that came out “a”, i.e. f(a) = C(a) / N, where N is the total number of observations.

Page 18: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Maximum Likelihood Estimation

• We want to estimate the parameters of our model from frequency observations. There are many ways to do this. For now, we focus on maximum likelihood estimation, MLE.

• Likelihood L(O ; p) is the probability of our model generating the observations O, given parameter values p.

• Goal: Find value for parameters that maximizes the likelihood.

Page 19: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Likelihood functions

(Wikipedia page on MLE; licensed from Casp11 under CC BY-SA 3.0)

parameter p

likel

ihoo

d L(

O ;

p)

count C(H):

Page 20: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

ML Estimation

• For Bernoulli and multinomial models, it is extremely easy to estimate the parameters that maximize the likelihood: ‣ P(X = a) = f(a)

‣ in the coin example above, just take p = f(H)

• Why is this?

Page 21: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

MLE for Bernoulli model

• Let’s say we had training data C of size N, and wehad NH observations of H and NT observations of T.

L(C) =NY

i=1

P (wi | p) =NY

i=1

pNH (1� p)NTlikelihood

log-likelihood

`(C) = logL(C) =

NX

i=1

logP (wi | p) = NH log p+NT log(1� p)

Page 22: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Logarithm is monotonic

Observation: If x1 > x2, then log(x1) > log(x2).

Therefore, argmax

pL(C) = argmax

p`(C)

Page 23: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Maximizing the log-likelihood

• Find maximum of function by setting derivative to zero:

• Unique solution is p = NH / N = f(H).

`(C) = NH log p+NT log(1� p)

d`(C)

dp=

NH

p� NT

1� p

Page 24: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

More complex models

• Many, many models we use in NLP are multinomial probability distributions. More than two outcomes possible; think dice rolling.

• MLE result generalizes to multinomial models: P(X = a) = f(a).

• Maximizing log-likelihood uses technique called Lagrange multipliers to ensure parameters sum to 1.

• If you want to see the details, see Murphy paper on the website.

Page 25: 02 Probability - uni-potsdam.de Probability.pdf · Title: 02 Probability Created Date: 10/19/2015 6:40:43 PM

Conclusion

• Probability theory is essential tool in modern NLP.

• Important concepts today: ‣ random variable, probability distribution

‣ joint and conditional probs; Bayes’ rule; independence

‣ expected values

‣ statistical models; parameters; likelihood; MLE

• We will use all of these concepts again and again in this course. If you have questions, ask me early.