language modeling
DESCRIPTION
Language Modeling. Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek). Word Prediction in Application Domains. Guessing the next word/letter Once upon a time there was ……. C’era una volta …. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/1.jpg)
Language Modeling
Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
![Page 2: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/2.jpg)
Word Prediction in Application Domains Guessing the next word/letter
Once upon a time there was ……. C’era una volta ….
Domains: speech modeling, augmentative communication systems (disabled persons), T9
![Page 3: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/3.jpg)
Word Prediction for Spelling
Andranno a trovarlo alla sua cassa domani. Se andrei al mare sarei abbronzato. Vado a spiaggia.
Hopefully, all with continue smoothly in my absence. Can they lave him my message? I need to notified the bank of this problem.
![Page 4: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/4.jpg)
Probs
Prior probability that the training data D will be observed P(D)
Prior probability of h, P(h), my include any prior knowledge that h is the correct hypothesis
P(D|h), probability of observing data D given a world where hypothesis h holds.
P(h|D), probability that h holds given the data D, i.e. posterior probability of h, because it reflects our confidence that h holds after we have seen the data D.
![Page 5: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/5.jpg)
The Bayes Rule (Theorem)
)(
)()|()|(
DP
hPhDPDhP
![Page 6: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/6.jpg)
Maximum Aposteriory Hypothesis and Maximum Likelihood
)|(maxarg
)|(
)()|(maxarg
)(
)()|(maxarg
)|(maxarg
hDPh
hDPDdataoflikelihood
hPhDP
DP
hPhDP
DhPh
HhML
Hh
Hh
HhMAP
![Page 7: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/7.jpg)
Bayes Optimal Classifier
Motivation: 3 hypotheses with the posterior probs of 0.4, 0.3 and 0.3. Thus, the first one is the MAP hypothesis. (!) BUT:
(A problem) Suppose new instance us classified positive by the first hyp., while negative by the other two. So, the porb. that the new instance is positive is 0.4 opposed to 0.6 for negative classification. The MAP is the 0.4 one !
Solution: The most probable classification of the new instance is obtained by combining the prediction for all hypothesis weighted by their posterior probabilities.
![Page 8: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/8.jpg)
Bayes Optimal Classifier
Classification: class
Bayes Optimal Classifier
Vv j
Hh
iijj
i
DhPhvPDvP )|()|()|(
Hh
iijVv
jVv
ijj
DhPhvPDvP )|()|(maxarg)|(maxarg
![Page 9: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/9.jpg)
Naïve Bayes Classifier
Bayes Optimal Classifier
Naïve version
)()|...(maxarg)...(
)()|...(maxarg
)...|(maxarg)|(maxarg
2121
21
21
jjnVvn
jjn
Vv
njVv
jVv
vPvaaaPaaaP
vPvaaaP
aaavPDvP
jj
jj
)|()(maxarg ji
ijVv
vaPvPj
![Page 10: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/10.jpg)
m-estimate of probability
mn
mpnc
mVocabularyp
Vocabularym
Vocabularyn
nc
1
||
1
||
||
1
![Page 11: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/11.jpg)
Tagging
P (tag = Noun | word = saw) = ?
)()|(maxarg)|(maxarg
)(
)()|()|(
)()|()()|(
wPwtPtwP
tP
wPwtPtwP
tPtwPwPwtP
tt
![Page 12: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/12.jpg)
)()|(maxarg)|(maxarg wPwtPtwPtt
Lan
gu
age
Mo
del
Use
cor
pus
to fi
nd th
em
![Page 13: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/13.jpg)
N-gram Model
The N-th word is predicted by the previous N-1 words.
What is a word? Token, word-form, lemma, m-tag, …
)(),,...,,( 1121n
nn wPwwwwP
![Page 14: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/14.jpg)
N-gram approximation models
)|()|(
)(...)|()|()(
...
)()|()|()(
)()|()(
1)1(
11
12
111
11
112213321
11221
nNnn
nn
nn
nn
n
wwPwwP
wPwwPwwPwP
wPwwPwwwPwwwP
wPwwPwwP
![Page 15: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/15.jpg)
bi-gram and tri-gram models)|()|( 1)1(
11
nNnn
nn wwPwwP
)|()( 11
1 k
n
kk
n wwPwPN=2 (bi):
N=3 (tri): )|()( 121
1 kk
n
kk
n wwwPwP
![Page 16: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/16.jpg)
Counting n-grams
)(
)()|(
)(
)(
)(
)(
)(
)()|(
1)1(
1)1(1
)1(
1
1
1
1
1
11
nNn
nn
NnnNnn
n
nn
nw
nn
nw
nnnn
wC
wwCwwP
wC
wwC
wwC
wwC
wwC
wwCwwP
![Page 17: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/17.jpg)
The Language Model Allows us to Calculate Sentence Probs P( Today is a beautiful day . ) =
P( Today | <Start>) * P (is | Today) * P( a | is) * P(beautiful|a) * P(day| beautiful) * P(. | day) * P(<End>| .)
Work in log space !
![Page 18: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/18.jpg)
Unseen n-grams and Smoothing Discounting (several types) Backoff Deleted Interpolation
![Page 19: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/19.jpg)
Deleted Interpolation
||
1
)(
)|(
)|(
)|(ˆ
4
3
12
121
12
V
wP
wwP
wwwP
wwwP
n
nn
nnn
nnn
![Page 20: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/20.jpg)
Searching For the Best Tagging
W_1 W_2 W_3 W_4 W_5 W_6 W_7 W_8
t_1_1 t_1_2 t_1_3 t_1_4 t_1_5 t_1_6 t_1_7 t_1_8t_2_1 t_2_2 t_2_3 t_2_5 t_2_8t_3_1 t_3_3t_4_1
Use Viterbi search to find the best path through the lattice.
![Page 21: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/21.jpg)
Cross Entropy
Entropy from the point of view of the user who has misinterpreted the source distribution to be q rather than p [Cross entropy is an upper bound of entropy]
iii
iii
qppH
qp
log)(
log
![Page 22: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/22.jpg)
Cross Entropy as a Quality Measure Two models, therefore 2 upper bounds of
entropy. The more accurate is the one with lower
cross entropy
![Page 23: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/23.jpg)
Imagine that y was generated with either model A or model B. Then:
)()(
)()|(
)()(
)()|(
)(),(
)(),(
)(...
)(...
,,...var)(
)()()()()(
yPyP
yPyBxP
yPyP
yPyAxP
yPyBxP
yPyAxP
BxPlet
AxPlet
andBAxrandomnewlet
yPBPyPAPyP
BBAA
BB
BBAA
AA
BB
AA
B
A
BA
BA
![Page 24: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/24.jpg)
Cont.
)(log)(
~)(
)()(
)(
)(log)(
~
),(
),()|(log)(
~
),(
),(log)|()(
~),(),(
),(log)|()(~
),(:
,
'
,
,,
,,
'
,,
'
'
'
'
'
yPyPF
FF
yP
yPyP
yxP
yxPyxPyP
yxP
yxPyxPyPAA
yxPyxPyPADefine
BAy
BAy
BAxBAy
BAyx
BAyx
Proof
of c
onve
rgen
ce o
f the
EM a
lgor
ithm
![Page 25: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/25.jpg)
Estimation - Maximization Algorithm Consider a problem in which the data D is a
set of instances generated by a probability distribution that is a mixture of k distinct Normal distributions (assuming same variances)
Hypothesis is therefore defined by the vector of the means of the distributions
![Page 26: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/26.jpg)
Estimation-Maximization Algorithm Step 1: Calculate the expected value of each
distribution, assuming that the current hypothesis holds
Step 2: Calculate a new maximum likelihood hypothesis assuming that the expected value is the true value. Then make the new hypothesis be the actual one.
Step 3: Goto Step 1.
![Page 27: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/27.jpg)
If we find lambda prime such that )()(),(),( '' FFAA So we need to maximize A with respect to lambda primeUnder the constraint that all lambdas sum up to one. Use Lagrange multipliers
)()(
)()(
~
)()(
)()(
~
0
)1(),(),(
'
'
'
''''
yPyP
yPyP
yPyP
yPyP
G
AG
BBAA
BB
yB
BBAA
AA
yA
i
BA
![Page 28: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/28.jpg)
The EM Algorithm
!
)()(
)()(
~
||
)()(
~;)()(
)()(
~
:
'
'
'
iterateand
CC
C
CC
C
yPyP
yPyPC
D
yCyP
yPyP
yPyPC
Define
assign
BA
BB
BA
AA
BBAA
BB
yB
D
BBAA
AA
yA
Can
be a
nalo
gica
lly g
ener
aliz
ed for
mor
e la
mbd
as
![Page 29: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/29.jpg)
Measuring success rates
Recall = (#correct answers)/(#total possible answers)
Precision = (#correct answers)/(#answers) Fallout = (#incorrect answers)/(#of spourious
facts in the text) F-measure = [(b^2+1)*P*R]/(b^2*P+R)
If b > 1 P is favored.
![Page 30: Language Modeling](https://reader036.vdocuments.net/reader036/viewer/2022062304/568144cc550346895db19677/html5/thumbnails/30.jpg)
Chunking as Tagging
Even certain parsing problems can be solved via tagging
E.g.: ((A B) C ((D F) G)) BIA tags: A/B B/A C/I D/B F/A G/A