lecture2 language modelinglecture 2: language modeling ltat.01.001 –natural language processing...
TRANSCRIPT
![Page 1: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/1.jpg)
Lecture 2: Language modeling
LTAT.01.001 – Natural Language ProcessingKairit Sirts ([email protected])
20.02.2019
![Page 2: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/2.jpg)
The task of language modeling
The cat sat on the mat
The mat sat on the cat
The cat mat the on sat
2
![Page 3: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/3.jpg)
Language modeling
Task:• Estimate the quality/fluency/grammaticality of a natural language
sentence or segment
Why?• Generate new sentences• Choose between several variants, picking the best sounding one.
3
![Page 4: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/4.jpg)
Language modeling
Word: !
Sentence: " = !!!"…!#
4
![Page 5: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/5.jpg)
Language modeling
Can we use some grammaticality checking rules to determine the fluency of the sentence !?• Theoretically - yes• In practice: • Grammar checking software is unreliable• Grammar checking software is only available for few languages• Its output is often non-continuous, which means that
• It cannot be used in optimization• It cannot be used to easily choose a better output from many viable hypotheses
5
![Page 6: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/6.jpg)
Language modeling
Instead we will try to calculate/model:
! " = ! $!$"…$#
>
>
6
P(The cat sat on the mat) P(The mat sat on the cat)
P(The cat mat the on sat)P(The mat sat on the cat)
![Page 7: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/7.jpg)
How to compute the sentence probability?
= #(#$% &'( )'( *+ ($% ,'()# '.. )%+(%+&%) = ?
= #(#$% ,'( )'( *+ ($% &'()# '.. )%+(%+&%) = ?
= #(#$% &'( ,'( ($% *+ )'()# '.. )%+(%+&%) = ?
# - the number or count of such sentencesThat’s clearly not doable in general!
7
P(The cat sat on the mat)
P(The mat sat on the cat)
P(The cat mat the on sat)
![Page 8: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/8.jpg)
How to compute the sentence probability?
Factorize the joint probability:• In general:
! ", $, % = ! " ! $ " ! % ", $
• Similarly:! '!, '", … , '#= ! '! ! '" '! ! '$ '!, '" …!('#|'!, '", … , '#%!)
• It still does not solve the problem!
8
![Page 9: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/9.jpg)
Sentence probability
• Cannot estimate directly:
! "!""…"# = #("!""…"#)# ()) *+,-+,.+*
• Cannot use the factorization:
! "!""…"# =%$%!
#!("$|"!…"$&!)
9
![Page 10: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/10.jpg)
Sentence probability
But word probabilities are doable:• Take a huge text (millions/billions of words)• Compute the probability for each word type (unique word)
! " = #(")# '(( ")*+, -. /ℎ1 /12/
10
Maximum likelihood (ML) estimate
![Page 11: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/11.jpg)
Sentence probability
• What if we treat each word as independent of other words? Then:
! " ≅ ! $! ×! $" ×⋯×!($#)
=
=
11
P(The cat sat on the mat) P(The mat sat on the cat)
P(The cat mat the on sat)P(The mat sat on the cat)
![Page 12: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/12.jpg)
Sentence probability• Maybe add some context?
= " #ℎ% " &'( #ℎ% " )'( &'( " *+ )'( " (ℎ% *+ "(-'(|(ℎ%)
= " #ℎ% " -'( #ℎ% " )'( -'( " *+ )'( " (ℎ% *+ "(&'(|(ℎ%)
= " #ℎ% " &'( #ℎ% " -'( &'( " (ℎ% -'( " *+ (ℎ% "()'(|*+) 12
! " ≅ ! $! ×! $"|$! ×!($#|$")×⋯×!($$|$$%!)P(The cat sat on the mat)
P(The mat sat on the cat)
P(The cat mat the on sat)
![Page 13: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/13.jpg)
Sentence probability – Markov property
Independence assumption or Markov assumption (in the context of language modeling):• The next word only depends on the current/last word.
• This is precisely the model we had on the previous slide and it is called bigram language model because we are looking at the word bigrams.
13
![Page 14: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/14.jpg)
N-gram language model
In general, we talk about n-gram language models, where the next word depends on a fixed history of n-1 words.
• Unigram model – all words are independent, the classical BOW approach• Bigram model• Trigram model – the next word depends on two last words• 4-gram model• 5-gram model
14
![Page 15: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/15.jpg)
Computing n-gram probabilities
• Unigrams: !! " !! = ##!# $%% #&'()
• Bigrams: !!*+!! " !! !!*+ = #(#!"#,#!)#(#!"#)
• Trigrams: !!*/!!*+!! " !! !!*/, !!*+ = #(#!"$,#!"#,#!)#(#!"$,#!"#)
15
![Page 16: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/16.jpg)
Sentence probability according to n-gram model• If
! "! "", "#, … , "!$" ≅ ! "! "!$% , … , "!$"• Where & = ()*+, *+(& − 1:• Unigrams: ! = 0• Bigrams: ! = 1• Trigrams: ! = 2, etc
• Then
! / =0!&"
'!("!|"", "#, … , "!$") ≅0
!&"
'!("!|"!$% , … , "!$")
16
![Page 17: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/17.jpg)
Bigram language model: example
An example corpus:1. the cat saw the mouse2. the cat heard a mouse3. the mouse heard4. a mouse saw5. a cat saw
17
Bigram Count Unigram Count Bigram probSTART the STARTthe cat thecat saw catsaw the sawthe mouse themouse END mousecat heard catheard a hearda mouse aSTART a STARTmouse saw mousesaw END sawa cat a
![Page 18: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/18.jpg)
Bigram language model: example
An example corpus:1. the cat saw the mouse2. the cat heard a mouse3. the mouse heard4. a mouse saw5. a cat saw
18
Bigram Count Unigram Count Bigram probSTART the 3 START 5 0.6the cat 2 the 4 0.5cat saw 2 cat 3 0.67saw the 1 saw 3 0.33the mouse 2 the 4 0.5mouse END 2 mouse 4 0.5cat heard 1 cat 3 0.33heard a 1 heard 2 0.5a mouse 2 a 3 0.67START a 2 START 5 0.4mouse saw 1 mouse 4 0.25saw END 2 saw 3 0.67a cat 1 a 3 0.33
![Page 19: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/19.jpg)
Bigram language model: example
P(The cat heard) = ?
19
Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5
![Page 20: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/20.jpg)
Bigram language model: example
P(The cat heard) = = P(START the) x P(the cat) x P(cat heard) x P(heard END)
20
Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5
![Page 21: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/21.jpg)
Bigram language model: example
P(The cat heard) = = P(START the) x P(the cat) x P(cat heard) x P(heard END) = 0.6*0.5*0.33*0.5 = 0.0495
21
Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5
![Page 22: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/22.jpg)
Bigram language model: example
P(The mouse saw the cat) = ?
22
Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5
![Page 23: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/23.jpg)
Bigram language model: example
P(the mouse saw the cat) = =P(START the) x P(the mouse) x P(mouse saw) x P(saw the) x P(the cat) x P(cat END)
23
Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5
![Page 24: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/24.jpg)
Bigram language model: example
P(the mouse saw the cat) = =P(START the) x P(the mouse) x P(mouse saw) x P(saw the) x P(the cat) x P(cat END)= 0.6*0.5*0.25*0.33*0.5*0 = 0
24
Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5
![Page 25: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/25.jpg)
Morphology
25Source: www.rabiaergin.com
![Page 26: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/26.jpg)
Sparsity issues
Natural languages are sparse!
Consider vocabulary of size 60000• How many possible unigrams, bigrams, trigrams are there?• How large a text corpus do we need to obtain reliable statistics for all
ngrams?• Does more data solve the problem completely?
26
![Page 27: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/27.jpg)
Zipf’s law
• Given some corpus of natural language text, the frequency of any word is inversely proportional to its rank in the frequency table• The most frequent word will occur approximately twice as often as the
second most frequent word• The second most frequent word will occur approximately twice as often as
the third most frequent word etc
27
![Page 28: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/28.jpg)
Zipf’s law
28
Masrai and Milton, 2006. “How different is Arabic from Other Languages? The Relationship between Word Frequency and Lexical Coverage”
![Page 29: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/29.jpg)
Smoothing
The general idea: Find a way to fill the gaps in counts• Take care not to change the original distribution too much• Fill in the gaps only as much as needed: as the corpus grows larger
there are less gaps to fill.
• Smoothing methods• Add λ method• Interpolation• (Modified) Kneser-Ney• There are others
29
![Page 30: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/30.jpg)
Add λ method
Assume all n-grams occur λ times more than they actually occur.• Usual bigram probability:
! "! "!"# = #("!"#, "!)#("!"#)
• Add 0 < * ≤ 1 to all bigram counts:
!$ "!|"!"# = # "!"#, "! + *#("!"#) + *|/|
• Special case * = 1: add-one smoothing
30
![Page 31: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/31.jpg)
Add λ method
• Advantages• Very simple• Easy to apply
• Disadvantages• Performs poorly (according to Chen & Goodman)• All unseen events receive the same probability• All events are upgraded by λ
31
![Page 32: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/32.jpg)
Interpolation (Jelinek-Mercer smoothing)
If the bigram !!"# !! is unseen:• Originally its probability would be 0:
" !! !!"# = 0• Instead of 0 we could use the probability of the shorter n-gram
(unigram):"(!!)
• We must make sure that the total probability mass remains the same• Thus interpolate between the unigram and bigram distribution
"$% !! !!"# = '" !! !!"# + 1 − ' "(!!)
32
![Page 33: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/33.jpg)
Interpolation (Jelinek-Mercer smoothing)
• Recursive formulation: nth-order smoothed model is defined recursively as linear interpolation between the nth-order maximum likelihood (ML) model and the (n-1)th-order smoothed model!!" "# "#$% , … , "#$&= &#$%! "# "#$% , … , "#$& + 1 − &#$% !!" "# "#$%'&, … , "#$&
• Can ground the recursion with:• 1st order unigram model• 0th order uniform model
! " = 1|&|
33
![Page 34: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/34.jpg)
Software for language modelling
• KenLM• https://github.com/kpu/kenlm
• SRILM• http://www.speech.sri.com/projects/srilm/
• IRSTLM• http://hlt-mt.fbk.eu/technologies/irstlm
• Others:• http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel
34
![Page 35: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/35.jpg)
Language model evaluation
• Intrinsic evaluation• Perplexity• Quick and simple• Improvements in perplexity might not translate into improvements in
downstream tasks• Extrinsic evaluation• In a down-stream task (like machine translation, speech recognition etc)• More difficult and time-consuming• More accurate evaluation (although beware of confounding with other
factors)
35
![Page 36: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/36.jpg)
Perplexity
• Perplexity is a measurement of how well a probability model predicts a sample.• Language model is a probability model over language• To evaluate a language model, compute the perplexity over held-out
set (test set)
!! = 2!"# ∑!"#
$ %&'% ((*!|*!&',…,*!&#)
36
![Page 37: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/37.jpg)
Perplexity
• The lower the perplexity the better the language model, i.e. the less “surprised” the model is on seeing the evaluation data• The exponent is really the cross-entropy, which measures the number
of bits needed to represent a word:
! "#, #̂ = −(!"#
$"#(*!) log% #̂(*!)
• "# *! = #((!)$ - empirical unigram probability
• #̂ *! = #*+(*!|*!,- , … , *!,#)
37
![Page 38: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/38.jpg)
Perplexity
• Let’s assume that the cross-entropy on a test set is 7.95• This means that each word in the test set could be encoded with 7.95
bits• The model perplexity would be 27.95=247 per word• This means that the model is as confused on test data as if it would
have to choose uniformly at random from 247 possibilities for each word.
38
![Page 39: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/39.jpg)
Perplexity
• Perplexity is corpus-specific: only the perplexities calculated on the same test set are comparable• For meaningful comparison, the vocabulary sizes of the two language
models must be the same, e.g.• You can compare a bigram language model to a trigram language model that
both use vocabulary size 10000• You cannot compare a trigram language model using a vocabulary size 10000
to a trigram language model using vocabulary size 20000
39
![Page 40: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/40.jpg)
Neural language models
• Window-based feed-forward neural language model• Recurrent neural language model
40
![Page 41: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/41.jpg)
Feed-forward neural language model (Bengioet al., 2003)
41
! = # $!"#$% ; … ; # $!"& ; # $!"%
' = ( !)' + +'
, $! $!"#$%, … , $!"&, $!"%= ./01234(')( + +()
![Page 42: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/42.jpg)
Recurrent neural language model (Mikolov et al., 2010)
Source: http://colah.github.io
!! = #(%!&+!!"#( + )$)
+ ,! ,#, … , ,!"%, ,!"#= /012345(!!&& + )&)
![Page 43: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/43.jpg)
Training the language model with cross-entropy loss
!!"#$$%&'("#)* "#, # = −'+,-
.#+ log "#+ = − log "#(
|V| - the vocabulary sizet – index of the correct word
43
![Page 44: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/44.jpg)
Why is the softmax over large vocabulary computationally costly?• What is a softmax?
!"#$%&' (! = *"!∑!" *"!"
• Now take the derivative from this with respect to (!• The sum over the whole vocabulary will remain in the derivative
(check it yourself)
44
![Page 45: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/45.jpg)
How to handle large softmax?
• Hierarchical softmax• Decompose the softmax layer into
binary tree• Reduce the complexity of the output
distribution from O(|V|) to O(log|V|)• Self-normalization• Approximate softmax
45
source: https://becominghuman.ai
![Page 46: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/46.jpg)
What to do with infrequent words
• Typically, the vocabulary size is fixed, ranging anywhere between 10K-200K words• Still, there will always be words that are not part of the vocabulary
(remember Turkish?)• The most common approach is to simply replace all out-of-vocabulary
(OOV) words with a special UNK token• Another option is to reduce the sparsity by constructing vocabulary
from subword units:• Morphemes, characters, syllables, …
46
![Page 47: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/47.jpg)
What to do with infrequent words
• What if there are no UNK’s in the training set?• Use a random UNK vector during testing• Randomly replace some infrequent words with UNK during training
• Construct word embeddings from characters (we’ll talk about it in more detail later)• Works for input (context) words• Cannot use for output words
47
![Page 48: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/48.jpg)
Character-level language model
• For instance for generating text with mark-up• A. Karpathy, 2015. The Unreasonable Effectiveness of Recurrent
Neural Networks• Generated text based on a LM trained on Wikipedia:
48
![Page 49: Lecture2 language modelingLecture 2: Language modeling LTAT.01.001 –Natural Language Processing Kairit Sirts (kairit.sirts@ut.ee) 20.02.2019 Thetaskof language modeling The cat sat](https://reader033.vdocuments.net/reader033/viewer/2022052100/603a50de40b68a4f107c3ab6/html5/thumbnails/49.jpg)
Using language models
• For scoring sentences• Speech recognition• Using LM for text classification• Statistical machine translation
• For generating text• Neural machine translation• Dialogue generation• Abstractive summarization
49