comp 791a: statistical language processing
DESCRIPTION
COMP 791A: Statistical Language Processing. Mathematical Essentials Chap. 2. Motivations. Statistical NLP aims to do statistical inference for the field of NL Statistical inference consists of: taking some data (generated in accordance with some unknown probability distribution ) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/1.jpg)
1
COMP 791A: Statistical Language Processing
Mathematical EssentialsChap. 2
![Page 2: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/2.jpg)
2
Motivations Statistical NLP aims to do statistical inference
for the field of NL
Statistical inference consists of: taking some data (generated in accordance with
some unknown probability distribution) then making some inference about this distribution
Ex. of statistical inference: language modeling how to predict the next word given the previous
words to do this, we need a model of the language probability theory helps us finding such model
![Page 3: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/3.jpg)
3
Notions of Probability Theory
Probability theory deals with predicting how likely it is that
something will happen Experiment (or trial)
the process by which an observation is made Ex. tossing a coin twice
![Page 4: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/4.jpg)
4
Sample Spaces and events Sample space Ω :
set of all possible basic outcomes of an experiment Coin toss: Ω = head, tail Tossing a coin twice: Ω = HH, HT, TH, TT Uttering a word: |Ω| = vocabulary size
Every observation (element in Ω) is a basic outcome or sample point
An event A is a set of basic outcomes with A Ω Ω is then the certain event Ø is the impossible (or null) event
Example - rolling a die: Sample space Ω = 1, 2, 3, 4, 5, 6 Event A that an even number occurs A = 2, 4, 6
![Page 5: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/5.jpg)
5
Events and Probability
The probability of an event A is denoted p(A) also called the prior probability i.e. the probability before we consider any additional
knowledge Example: experiment of tossing a coin 3 times
Ω = HHH, HHT, HTH, HTT, THH, THT, TTH, TTT events with two or more tails:
A = HTT, THT, TTH, TTT P(A) = |A|/|Ω| = ½ (assuming uniform distribution)
events with all heads: A = HHH P(A) = |A|/|Ω| = ⅛
![Page 6: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/6.jpg)
6
Probability Properties A probability function P (or probability distribution):
Distributes a probability mass of 1 over the sample space Ω [0,1] P(Ω) = 1 For disjoint events Ai (ie : AiAj = Ø for all i ≠j)
P( Ai) = Σ P(Ai)
Immediate consequences: P(Ø ) = 0 P(Ā) = 1 - P(A) AB ==> P(A) ≤ P(B) ΣaєΩ P(a) = 1
![Page 7: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/7.jpg)
7
Joint probability
Joint probability of A and B: P(A,B) = P(AB)
Ω
A BA
B
![Page 8: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/8.jpg)
8
Conditional probability Prior (or unconditional) probability
Probability of an event before any evidence is obtained P(A) = 0.1 P(rain today) = 0.1 i.e. Your belief about A given that you have no evidence
Posterior (or conditional) probability Probability of an event given that all we know is B (some
evidence) P(A|B) = 0.8 P(rain today| cloudy) = 0.8 i.e. Your belief about A given that all you know is B
![Page 9: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/9.jpg)
9
Conditional probability (con’t)
Ω
A BA
B
P(B)B)P(A,
B)|P(A
![Page 10: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/10.jpg)
10
Chain rule
With 3 events, the probability that A, B and C occur is: The probability that A occurs Times, the probability that B occurs, assuming that A
occurred Times, the probability that C occurs, assuming that A and B
have occurred With multiple events, we can generalize to the Chain rule:
P(A1, A2, A3, A4, ..., An)
= P (Ai)
= P(A1) × P(A2|A1) × P(A3|A1,A2) × ... × P(An|A1,A2,A3,…,An-1)
(important to NLP)
P(B) x B)|P(A B)P(A, so P(B)
B)P(A, B)|P(A P(A) x A)|P(B B)P(A, so
P(A)B)P(A,
A)|P(B
![Page 11: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/11.jpg)
11
Bayes’ theorem
P(B)P(A)A)|P(B
B)|P(A :or
P(A)A)|P(B P(B)B)|P(A :then
P(A)A)|P(BB)P(A:and
P(B)B)|P(AB)P(A :that given
,
,
![Page 12: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/12.jpg)
12
So?
we typically want to know: P(Cause | Effect) ex: P(Disease | Symptoms)
ex: P(linguistic phenomenon | linguistic observations)
But this information is hard to gather However P(Effect | Cause) is easier to gather
(from training data) So
P(Effect)P(Cause)Cause)|P(Effect
Effect)|P(Cause
![Page 13: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/13.jpg)
13
Example Rare syntactic construction occurs in
1/100,000 sentences
A system identifies sentences with such a construction, but it is not perfect If sentence has the construction -->
system identifies it 95% of the time If sentence does not have the construction -->
system says it does 0.5% of the time
Question: if the system says that sentence S has the
construction… what is the probability that it is right?
![Page 14: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/14.jpg)
14
Example (con’t) What is P(sentence has the construction | the system says yes) ? Let:
cons = sentence has the construction yes = system says yes not_cons = sentence does not have the construction
we have: P(cons) = 1/100,000 = 0.00001 P(yes | cons) = 95% = 0.95 P(yes | not_cons) = 0.5% = 0.005
P(yes) = ? P(B) = P(B|A) P(A) + P(B|Ā) P(Ā) P(yes) = P(yes | cons) × P(cons) + P(yes | not_cons) × P(not_cons)
= 0.95 × 0.00001 + 0.005 × 0.9999
P(yes)P(cons)cons)|P(yes
yes)|P(cons
![Page 15: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/15.jpg)
15
Example (con’t)
(0.2%) 0.002 0.99990.005 0.000010.95
0.000010.95
P(yes)P(cons)cons)|P(yes
yes)|P(cons
So in only 1 sentence out of 500 that the system says yes, it is actually right!!!
So:
![Page 16: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/16.jpg)
16
How likely are we to have Head in a coin toss, given that it is raining today?
A: having a head in a coin toss B: raining today Some variables are independent…
How likely is the word “ambulance” to appear, given that we’ve seen “car accident”?
Words in text are not independent
Statistical Independence vs. Statistical Dependence
![Page 17: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/17.jpg)
17
Independent events Two events A and B are independent:
if the occurrence of one of them does not influence the occurrence of the other
i.e. A is independent of B if P(A) = P(A|B)
If A and B are independent, then: P(A,B) = P(A|B) x P(B) (by chain rule)
= P(A) x P(B) (by independence)
In NLP, we often assume independence of variables
![Page 18: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/18.jpg)
18
Bayes’ Theorem revisited (a golden rule in statistical NLP) If we are interested in which event B is most likely to occur
given an observation A we can chose the B with the largest P(B|A)
P(A) is a normalization constant (to ensure 0…1) is the same for all possible Bs (and is hard to gather anyways) so we can drop it
So Bayesian reasoning:
In NLP:
P(A)P(B) x B)|P(A
argmax A)|P(B argmax BB
_event)P(language vent)language_e|ionP(observatargmax ventlanguage_e
P(B) x B)|P(A argmaxB
![Page 19: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/19.jpg)
19
Application of Bayesian Reasoning Diagnostic systems:
P(Disease | Symptoms) Categorization:
P(Category of object| Features of object) Text classification: P(sports-news | words in text) Character recognition: P(character | bitmap) Speech recognition: P(words | signals) Image processing: P(face-person | image) …
![Page 20: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/20.jpg)
20
Random Variables A random variable X is a function
X: Ω --> Rn (typically n= 1)
Example – tossing 2 dice Ω = (1,1), (1,2), (1,3), … (6,6) X : Ω --> Rx assigns to each point in Ω, the sum of the 2
dice
X(1,1) = 2 X(1,2) = 3, … X(6,6) = 12 Rx= 2,3,4,5,6,7,8,9,10,11,12
A random variable X is discrete if: X: Ω --> S where S is a countable subset of R In particular, if X: Ω --> 0,1
then X is called a Bernoulli trial. A random variable X is continuous if:
X: Ω --> S where S is a continuum of numbers
![Page 21: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/21.jpg)
21
Probability distribution of an RV Let X be a finite random variable
Rx= x1, x2, x3,… xn
A probability mass function f gives the probability of X at different in points in Rx
f(xk) = P(X=xk) = p( xk) p(xk) ≥ 0 Σk p(xk) = 1
X x1 x2 x3 … xn
p(X) p(x1 ) p(x2 ) p(x3 ) … p(xn )
![Page 22: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/22.jpg)
22
Example: Tossing 2 dice X = sum of the faces
X: Ω --> S Ω = (1,1), (1,2), (1,3), …, (6,6) S = 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
X = maximum of the faces X: Ω --> S Ω = (1,1), (1,2), (1,3), …, (6,6) S = 1, 2, 3, 4, 5, 6
X 2 3 4 5 6 7 8 9 10 11 12
p(X) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
X 1 2 3 4 5 6
p(X) 1/36 3/36 5/36 7/36 9/36 11/36
![Page 23: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/23.jpg)
23
Expectation The expectation (μ) is the mean (or average or
expected value) of a random variable X
Intuitively, it is: the weighted average of the outcomes where each outcome is weighted by its probability
ex: the average sum of the dice
If X and Y are 2 random variables on the same sample space, then: E(X+Y) = E(X) + E(Y)
ii x )p(xE(X)
![Page 24: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/24.jpg)
24
Example The expectation of the sum of the faces on two dice?
(the average sum of the dice) If equiprobable… (2+3+4+5+…+12)/11 But not, equiprobable
Or more simply: E(SUM)=E(Die1+Die2)=E(Die1)+E(Die2) Each face on 1 die is equiprobable
E(Die) = (1+2+3+4+5+6)/6 = 3.5E(SUM) = 3.5 + 3.5 = 7
736252
361
12362
11363
4362
3361
2)p(xxE(X) ii
...
SUM (xi) 2 3 4 5 6 7 8 9 10 11 12
p(SUM=xi) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
![Page 25: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/25.jpg)
25
Variance and standard deviation The variance of a random variable X is a
measure of whether the values of the RV tend to be consistent over trials or to vary a lot
The standard deviation of X is the square root of the variance
Both measure the weighted “spread” of the values xi around the mean E(X)
2
ii E(X))-(x )p(xvar(X)
var(X)σx
![Page 26: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/26.jpg)
26
Example
What is the variance of the sum of the faces on two dice?
5.837)(12361
7)(11362
7)(6365
7)(5364
7)(4363
7)(3362
7)(2361
E(SUM))-(x )p(xvar(SUM)
222
2222
2
ii
...
SUM (xi) 2 3 4 5 6 7 8 9 10 11 12
p(SUM=xi) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
![Page 27: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/27.jpg)
27
Back to NLP
What is the probability that someone says the sentence:“Mary is reading a book.”
In general, for language events, the probability function P is unknown
We need to estimate P (or a model M of the language) by looking at a sample of data (training set)
2 approaches: Frequentist statistics Bayesian statistics (we will not see)
Language event
x1 x2 x3
p ? ? ?
![Page 28: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/28.jpg)
28
Frequentist Statistics
To estimate P, we use the relative frequency of the outcome in a sample of data
i.e. the proportion of times a certain outcome o occurs.
Where C(o) is the number of times o occurs in N trials For N--> ∞ the relative frequency stabilizes to some
number: the estimate of the probability function
Two approaches to estimate the probability function:
Parametric (assuming a known distribution) Non-parametric (distribution free)… we will not see
NC(o)
fo
![Page 29: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/29.jpg)
29
Parametric Methods
Assume that some phenomenon in language is modeled by a well-known family of distributions (ex. binomial, normal)
The advantages: we have an explicit probabilistic model of the
process by which the data was generated determining a particular probability distribution
within the family requires only the specification of a few parameters (so, less training data)
But: Our assumption on the probability distribution may
be wrong…
![Page 30: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/30.jpg)
30
Non-Parametric Methods
No assumption is made about the underlying distribution of the data
For ex, we can simply estimate P empirically by counting a large number of random events
But: because we use less prior information (no assumption on the distribution), more training data is needed
![Page 31: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/31.jpg)
31
Standard Distributions
Many applications give rise to the same basic form of a probability distribution - but with different parameters.
Discrete Distributions: the binomial distribution (2 outcomes) the multinomial distribution (more than 2 outcomes) …
Continuous Distributions: the normal distribution (Gaussian) …
![Page 32: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/32.jpg)
32
Binomial Distribution (discrete) Also known as Bernoulli distribution
Each trial has only two outcomes (success or failure) The probability of success is the same for each trial The trials are independent There are a fixed number of trials
Distribution has 2 parameters: nb of trials n probability of success p in 1 trial
Ex: Flipping a coin 10 times and counting the number of heads that occur
Can only get a head or a tail (2 outcomes) For each flip there is the same chance of getting a head (same
prob.) The coin flips do not effect each other (independence) There are 10 coin flips (n = 10)
![Page 33: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/33.jpg)
33
Examples
b(n,p) = B(10, 0.7)
Nb trials = 10 Prob(head) = 0.7
b(n,p) = B(10, 0.1)
Nb trials = 10 Prob(head) = 0.1
![Page 34: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/34.jpg)
34
Binomial probability function let:
n = nb of trials p = probability of success in any trial r = nb of successes out of the n trials
n)r(0r!r)!(n
n!r
n:where
p1 p r
n
trials n in successes r having of yprobabilit theP(r)
rnr
The number of ways of having r successes in n trials.
The probability of having r successes
The probability of having n-r failures.
![Page 35: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/35.jpg)
35
Example
What is the probability of rolling higher than 4 in 2 rolls of 3 dice rolls?
1st 2nd 3rd Probability
4 4 44 4 4
64
64
64
62
64
64
44
44
4
4
4
44 4
444 44 4 4
64
62
64
62
62
64
64
64
62
62
64
62
64
62
62
62
62
62
*
*
*
92
272
272
272
64
62
62
62
64
62
62
62
64 )()()(2)p(r
92
64
364
642
62 3
2
32)p(r
n trials =3p probability of success in 1 trial = r successes = 2
4
62
![Page 36: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/36.jpg)
36
Properties of binomial distribution
B(n,p) Mean E(X) = μ = np
Ex: Flipping a coin 10 times E(head) = 10 x ½ = 5
Variance σ2= np(1-p) Ex:
Flipping a coin 10 times σ2 = 10 x ½ ( ½ ) = 2.5
![Page 37: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/37.jpg)
37
Binomial distribution in NLP Works well for tossing a coin But, in NLP we do not always have complete independence from
one trial to the next Consecutive sentences are not independent Consecutive POS tags are not independent
So, binomial distribution in NLP is an approximation (but a fair one)
When we count how many times something is present or absent And we ignore the possibility of dependencies between one trial
and the next Then, we implicitly use the binomial distribution
Ex: Count how many sentences contain the word “the” Assume each sentence is independent
Count how many times a verb is used as transitive Assume each occurrence of the verb is independent of the others…
![Page 38: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/38.jpg)
38
Also known as Gaussian distribution (or Bell curve) to model a random variable X on an infinite sample space (ex. height, length…)
X is a continuous random variable if there is a function f(x) defined on the real line R = (-∞, +∞) such that:
f is non-negative f(x) ≥ 0 The area under the curve of f is one The probability that X lies in the interval [a,b] is equal to
the area under f between x=a and x=b
Normal Distribution (continuous)
1dx f(x)
b
adx f(x)b)XP(a
![Page 39: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/39.jpg)
39
has 2 parameters: mean μ standard deviation σ
Normal Distribution (con’t)
2
2
2σ
μ)(x
e2πσ
1p(x)
n(μ,σ)= n(0,1)
μ=0; σ= 1
n(μ,σ)=n(1.5,2)
μ=1.5; σ=2
![Page 40: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/40.jpg)
40
The standard normal distribution if μ=0 and σ=1, then called standard normal
distribution Z
1
034.1%dx f(x)1)XP(0
![Page 41: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/41.jpg)
41
Frequentist vs Bayesian Statistics
Assume we toss a coin 10 times, and get 8 heads: Frequentists will conclude (from the observations) that a
head comes 8/10 -- Maximum Likelihood Estimate (MLE) if we look at the coin, we would be reluctant to accept
8/10… because we have prior beliefs
Bayesian statisticians will use an a-priori probability distribution (their belief)
will update the beliefs when new evidence comes in (a sequence of observations)
by calculating the Maximum A Posteriori (MAP) distribution. The MAP probability becomes the new prior probability and
the process repeats on each new observation
![Page 42: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/42.jpg)
42
Essential Information Theory
Developed by Shannon in the 40s To maximize the amount of information
that can be transmitted over an imperfect communication channel (the noisy channel)
Notion of entropy (informational content): How informative is a piece of information?
ex. How informative is the answer to a question If you already have a good guess about the answer, the
actual answer is less informative… low entropy
![Page 43: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/43.jpg)
43
Entropy - intuition Ex: Betting 1$ to the flip of a coin
If the coin is fair: Expected gain is ½ (+1) + ½ (-1) = 0$ So you’d be willing to pay up to 1$ for advanced information
(1$ - 0$ average win)
If the coin is rigged P(head) = 0.99 P(tail) = 0.01
assuming you bet on head (!) Expected gain is 0.99(+1) + 0.01(-1) = 0.98$ So you’d be willing to pay up to 2¢ for advanced information
(1$ - 0.98$ average win)
Entropy of fair coin is 1$ > entropy of rigged coin 0.02$
![Page 44: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/44.jpg)
44
Entropy
Let X be a discrete RV Entropy (or self-information)
measures the amount of information in a RV average uncertainty of a RV the average length of the message needed to transmit
an outcome xi of that variable the size of the search space consisting of the possible
values of a RV and its associated probabilities measured in bits Properties:
H(X) ≥ 0 If H(X) = 0 then it provides no new information
n
1ii2i )p(x)logp(xH(X)
![Page 45: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/45.jpg)
45
Example: The coin flip Fair coin:
Rigged coin:
bit 121
log21
21
log21
- )p(x)logp(xH(X) 22
n
1ii2i
bits 0.081001
log1001
10099
log10099
- )p(x)logp(xH(X) 22
n
1ii2i
P(head)
Entr
op
y
![Page 46: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/46.jpg)
46
In simplified Polynesian, we have 6 letters with frequencies:
The per-letter entropy is
We can design a code that on average takes 2.5bits to transmit a letter
Can be viewed as the average nb of yes/no questions you need to ask to identify the outcome (ex: is it a ‘t’? Is it a ‘p’?)
p t k a i u
1/8 1/4 1/8 1/4 1/8 1/8
bits 2.5)81
log81
81
log81
41
log41
81
log81
41
log41
81
log81
(p(i)p(i)logH(p) 222222ui,a,k,t,p,i
2
p t k a i u
100 00 101
01 110
111
Example: Simplified Polynesian
![Page 47: COMP 791A: Statistical Language Processing](https://reader037.vdocuments.net/reader037/viewer/2022110402/56812a8e550346895d8e3bbe/html5/thumbnails/47.jpg)
47
Entropy in NLP
Entropy is a measure of uncertainty The more we know about something the lower its
entropy So if a language model captures more of the
structure of the language, then its entropy should be lower
in NLP, language models are compared by using their entropy. ex: given 2 grammars and a corpus, we use entropy
to determine which grammar beter matches the corpus.