Download - Language Models For Speech Recognition
![Page 1: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/1.jpg)
Language Models For Speech Recognition
![Page 2: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/2.jpg)
Speech Recognition
: sequence of acoustic vectors
Find the word sequence so that:
The task of a language model is to make available to the recognizer adequate estimates of the probabilities
( | ) ( )( | )
( )
P A W P WP W A
P A=
ˆ arg max ( | ) ( )W
W P A W P W=
)(WP
A
![Page 3: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/3.jpg)
Language Models
nwwwW ,...,, 21 Vwi
1 2 1 1 1( ) ( )* ( | )*...* ( | ,..., )n nP W P w P w w P w w w -=
difficultisnrecognitiospeechW
( ) ( )* ( | )*
( | )* ( | )
P W P speech P recognition speech
P is speech recognition P difficult speech recognition is
=
![Page 4: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/4.jpg)
N-gram models
Make the Markov assumption that only the prior local context – the last (N-1) words – affects the next word
N=3 trigrams
N=2 bigrams
N=1 unigrams
nwwwW ,...,, 21 Vwi
2 11
( ) ( | , )n
i i ii
P W P w w w- -=
=Õ
11
( ) ( | )n
i ii
P W P w w -=
=Õ
1
( ) ( )n
ii
P W P w=
=Õ
![Page 5: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/5.jpg)
Parameter estimation
Maximum Likelihood Estimator
N=3 trigrams
N=2 bigrams
N=1 unigrams
This will assign zero probabilities to unseen events
1 2 33 1 2 3 1 2
1 2
( , , )( | , ) ( | , )
( , )
c w w wP w w w f w w w
c w w= =
1 22 1 2 1
1
( , )( | ) ( | )
( )
c w wP w w f w w
c w= =
11 1
( )( ) ( )
c wP w f w
N= =
![Page 6: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/6.jpg)
Number of Parameters
For a vocabulary of size V, a 1-gram model has V-1 independent parameters
A 2-gram model has V2-1 independent parameters
In general, an n-gram model has Vn-1 independent parameters
Typical values for a moderate size vocabulary of 20000 words are:
Model Parameters
1-gram 20000
2-gram 200002 = 400 million
3-gram 200003 = 8 trillion
![Page 7: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/7.jpg)
Number of Parameters
|V|=60.000 N=35M Eleftherotypia daily newspaper
Count 1-grams 2-grams 3-grams
1 160.273 3.877.976 13.128.073
2 51.725 784.012 1.802.348
3 27.171 314.114 562.264
>0 390.796 5.834.632 16.515.051
>=0 390.796 36x108 216x1012
1n
N
In a typical training text, roughly 80% of trigrams occur only once
Good-Turing estimate: ML estimates will be zero for
37.5% of the 3-grams and for 11% of the 2-grams
![Page 8: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/8.jpg)
Problems
Data sparseness: we have not enough data to train the model parameters
Solutions Smoothing techniques: accurately estimate probabilities in the presence
of sparse data– Good-Turing, Jelinek-Mercer (linear interpolation), Katz (backing-off)
Build compact models: they have fewer parameters to train and thus require less data
– equivalence classification of words (e.g. grammatical rules (noun, verb, adjective, preposition), semantic labels (city, name, date))
![Page 9: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/9.jpg)
Smoothing
Make distributions more uniform
Redistribute probability mass from higher to lower probabilities
![Page 10: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/10.jpg)
Additive Smoothing
For each n-gram that occurs r times, pretend that it occurs r+1 times
e.g bigrams 1 22 1
1
( , ) 1( | )
( )
c w wP w w
c w V
+=
+
![Page 11: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/11.jpg)
Good-Turing Smoothing
For any n-gram that occurs r times, pretend that it occurs r* times
is the number of n-grams which occurs r times
To convert this count to a probability we just normalize
Total probability of unseen n-grams
* 1( 1) r
r
nr r
n+= +
rn
*
GT
rP
N=
1n
N
![Page 12: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/12.jpg)
Example
r(=MLE) nr r*(=GT)
0 3.594.165.368 0.001078
1 3.877.976 0.404
2 784.012 1.202
3 314.114 2.238
4 175.720 3.187
5 112.006 4.199
6 78.391 5.238
7 58.661 6.270
![Page 13: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/13.jpg)
Good-Turing
Intuitively
Jelinek-Mercer Smoothing(linear interpolation)
(DISCARD THE) 0
(DISCARD THOU) 0
c
c
=
=
Interpolate a higher-order model with a lower-order model
Given fixed pML, it is possible to search efficiently for the λ that maximize the probability of some data using the Baum-Welch algorithm
(THE|DISCARD) (THOU|DISCARD)p p=
(THE|DISCARD) (THOU|DISCARD)p p>
interp 1 1( | ) ( | ) (1 ) ( )i i ML i i ML ip w w p w w p wl l- -= + -
![Page 14: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/14.jpg)
Katz Smoothing (backing-off)
For those events which wave been observed in the training data we assume some reliable estimate of the probability
For the remaining unseen events we back-off to some less specific distribution
is chosen so that the total probability sums to 1
1
1
if
( , ) if 1
( ) ( ) if 0BO i i r
i ML i
r r k
c w w d r r k
a w p w r-
-
ì ³ïïïï= £ <íïïï =ïî
11
1
( , )( | )
( , )i
BO i iBO i i
BO i iw
c w wp w w
c w w-
--
=å
1( )ia w -
![Page 15: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/15.jpg)
Witten-Bell Smoothing
Model the probability of new events, estimating the probability of seeing such a new event as we proceed through the training corpus (i.e. the total number of word types in the corpus)
1 11 1
1 1 11 1 2( | ) ( | ) (1 ) ( | )i i
i n i n
i i iWB i i n ML i i n WB i i nw wp w w p w w p w wl l- -
- + - +
- - -- + - + - += + -
11
1 ii nw
T
T Nl -
- +- =
+
![Page 16: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/16.jpg)
Absolute Discounting
Subtract a constant D from each nonzero count
{ }1
1
11 11 2
1
max ( ) ,0( | ) (1 ) ( | )
( )ii n
i
ii ni i
abs i i n abs i i ni wi n
w
c w Dp w w p w w
c wl -
- +
- +- -- + - +
- +
-= + -
å
1
1 22
nD
n n=
+
![Page 17: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/17.jpg)
Kneser-Ney
Lower order distribution not proportional to to the number of occurrences of a word, but to the number of different words that it follows
{ }11 1 11 1 2
1
max ( ) ,0( | ) ( ) ( | )
( )i
ii ni i i
KN i i n i n KN i i nii n
w
c w Dp w w w p w w
c wg
- +- - -- + - + - +
- +
-= +
å
![Page 18: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/18.jpg)
Modified Kneser-Ney
( )1 11 1 11 1 2
1
( ) ( )( | ) ( ) ( | )
( )i
i ii n i ni i i
MKN i i n i n MKN i i nii n
w
c w D c wp w w w p w w
c wg
- + - +- - -- + - + - +
- +
-= +
å
1
2
3
0 if 0
if 1( )
if 2
if 3
c
D cD c
D c
D c+
ì =ïïïï =ïï=íï =ïïï ³ïïî
1
1 2
21
1
32
2
43
3
2
1 2
2 3
3 4
nY
n n
nD Y
n
nD Y
n
nD Y
n+
=+
= -
= -
= -
![Page 19: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/19.jpg)
Measuring Model Quality
Consider the language as an information source L, which emits a sequence of symbols wi from a finite alphabet (the vocabulary)
The quality of a language model M can be judged by its cross entropy with regard to the distribution PT(x) of some hitherto unseen text T:
Intuitively speaking cross entropy is the entropy of T as “perceived” by the model M
1( ; ) ( ) log ( ) log ( )T M T M M
x x
H P P P x P x P xn
=- » -å å
![Page 20: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/20.jpg)
Perplexity
Perplexity:
In a language with perplexity X, every word can be followed be X different words with equal probabilities
( ; )( ) 2 T MH P PMPP T =
![Page 21: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/21.jpg)
Elements of Information Theory
Entropy
Mutual Information
pointwise
Kullback-Leiblel (KL) divergence
( ) ( ) log ( )x X
H X p x p xÎ
=- å
( , )( ; ) ( , ) log
( ) ( )x X y Y
p x yI X Y p x y
p x p yÎ Î
=å å
( , )( , ) log
( ) ( )
p x yI x y
p x p y=
( )( || ) ( ) log
( )x X
p xD p q p x
q xÎ
=- å
![Page 22: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/22.jpg)
The Greek Language
Highly inflectional language
A Greek vocabulary of 220K words is needed in order to achieve 99.6% lexical coverage
English French Greek German
Source Wall Street Journal
Le Monde Eleytherotypia Frankfurter Rundschau
Corpus size 37.2 M 37.7 M 35 M 31.5 M
Distinct words 165 K 280 K 410 K 500 K
Vocabulary size 60 K 60 K 60 K 60 K
Lexical coverage 99.6 % 98.3 % 96.5 % 95.1 %
![Page 23: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/23.jpg)
Perplexity
English French Greek German
Vocabulary Size 20 K 20 K 64 K 64 K
2-gram PP 198 178 232 430
3-gram PP 135 119 163 336
![Page 24: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/24.jpg)
Experimental Results
1M 5M 35M
Smoothing PP WER PP WER PP WER
Good-Turing 341 27.71 248 23.48 163 19.59
Witten-Bell 354 27.42 251 24.17 163 19.84
Absolute Discounting 344 28.47 256 24.25 169 20.78
Modified Kneser-Ney 328 26.78 237 21.91 156 18.57
1M 5M 35M
OOV 4.75% 3.46% 3.17%
![Page 25: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/25.jpg)
Hit Rate
hit rate % (1M) hit rate % (5M) hit rate % (35M)
1-gram 27.3 16.4 7.4
2-gram 52.5 49.9 40
3-gram 20.2 33.7 52.6
![Page 26: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/26.jpg)
Class-based Models
Some words are similar to other words in their meaning and syntactic function
Group words into classes– Fewer parameters– Better estimates
![Page 27: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/27.jpg)
Class-based n-gram models
Suppose that we partition the vocabulary into G classes
This model produces text by first generating a string of classes g1,g2,…,gn
and then converting them into the words wi, i=1,2,…n with probability p(wi|gi)
An n-gram model has Vn-1 independent parameters (216x1012) A class-based model has Gn-1+V-G parameters ( 109 )
Gn-1 of an n-gram model for a vocabulary of size G
V-G of the form p(wi|gi)
2 1 2 1( | , ) ( | , ) ( | )i i i i i i i ip w w w p g g g p w g- - - -=
![Page 28: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/28.jpg)
Relation to n-grams
2 1 2 1( | , ) ( | , ) ( | )i i i i i i i ip w w w p g g g p w g- - - -=
![Page 29: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/29.jpg)
Defining Classes
Manually– Use part-of-speech labels by linguistic experts or a tagger– Use stem information
Automatically– Cluster words as part of an optimization method
e.g. Maximize the log-likelihood of test text
![Page 30: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/30.jpg)
Agglomerative Clustering
Bottom-up clustering
Start with a separate cluster for each word
Merge that pair for which the loss in average MI is least
1
1 2
1( , ) log ( ,..., )
( ) ( ; )
NH L G p w wN
H w I g g
=-
= -
![Page 31: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/31.jpg)
Example
Syntactical classes verbs, past tense: άναψαν, επέλεξαν, κατέλαβαν, πλήρωσαν, πυροβόλησαν nouns, neuter: άλογο, δόντι, δέντρο, έντομο, παιδί, ρολόι, σώμα Adjectives, masculine:δημοκρατικός, δημόσιος, ειδικός, εμπορικός, επίσημος
Semantic classes last names: βαρδινογιάννης, γεννηματάς, λοβέρδος, ράλλης countries: βραζιλία, βρετανία, γαλλία, γερμανία, δανία numerals: δέκατο, δεύτερο, έβδομο, εικοστό, έκτο, ένατο, όγδοο
Some not so well defined classes ανακριβής, αναμεταδίδει, διαφημίσουν, κομήτες, προμήθευε εξίσωση, έτρωγαν, και, μαλαισία, νηπιαγωγών, φεβρουάριος
![Page 32: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/32.jpg)
Stem-based Classes
άγνωστ: άγνωστος, άγνωστου, άγνωστο, άγνωστον, άγνωστοι, άγνωστους, άγνωστη, άγνωστης, άγνωστες, άγνωστα,
βλέπ: βλέπω, βλέπεις, βλέπει, βλέπουμε, βλέπετε, βλέπουν εκτελ: εκτελεί, εκτελούν, εκτελούσε, εκτελούσαν, εκτελείται, εκτελούνται εξοχικ: εξοχικό, εξοχικά, εξοχική, εξοχικής, εξοχικές ιστορικ: ιστορικός, ιστορικού, ιστορικό, ιστορικοί, ιστορικών, ιστορικούς,
ιστορική, ιστορικής, ιστορικές, ιστορικά καθηγητ: καθηγητής, καθηγητή, καθηγητές, καθηγητών μαχητικ: μαχητικός, μαχητικού, μαχητικό, μαχητικών, μαχητική, μαχητικής,
μαχητικά
![Page 33: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/33.jpg)
Experimental Results
G PP (1M) PP (5M) PP (35M)
1 1309 1461 1503
133 (POS) 1047 1143 1167
500 - - 314
1000 - - 266
2000 - - 224
30000 (stem) 383 299 215
60000 328 237 156
![Page 34: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/34.jpg)
Example
Interpolate class-based and word-based models
(ατυχηματων|τροχαιων) (ατυχημ|τροχαι) (ατυχηματων|ατυχημ)p p p=
2 1 2 1 2 1( | , ) ( | , ) (1 ) ( | , ) ( | )i i i i i i i i i i ip w w w p w w w p g g g p w gl l- - - - - -= + -
![Page 35: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/35.jpg)
Experimental Results
1M 5M 35M
G PP WER PP WER PP WER
133 (POS) 325 27.11 236 22.00 156 18.52
500 - - - - 151 18.63
1000 - - - - 150 18.61
2000 - - - - 149 18.65
30000 (stem) 319 26.99 232 22.04 154 18.44
60000 328 26.78 237 21.91 156 18.57
![Page 36: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/36.jpg)
Hit Rate
hit rate % (1M) hit rate % (5M) hit rate % (35M)
1-gram 21.3 12.1 5.1
2-gram 56 50.4 37.6
3-gram 22.7 37.6 57.4
hit rate % (1M) hit rate % (5M) hit rate % (35M)
1-gram 27.3 16.4 7.4
2-gram 52.5 49.9 40
3-gram 20.2 33.7 52.6
![Page 37: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/37.jpg)
Experimental Results
1M 5M 35M
Model PP WER PP WER PP WER
ME 3gram 331 26.83 239 21.94 158 18.60
ME 3gram+stem 320 26.54 227 21.66 143 18.29
1M 5M 35M
Model PP WER PP WER PP WER
BO 3gram 328 26.78 237 21.91 156 18.57
Interp. 3gram+stem 319 26.99 232 22.04 154 18.44
![Page 38: Language Models For Speech Recognition](https://reader033.vdocuments.net/reader033/viewer/2022061609/568135dc550346895d9d5100/html5/thumbnails/38.jpg)
Where do we go from here?
Use syntactic information
The dog on the hill barked
Constraints
( ),
1 if ( ) is the preceding head word in and ( , )
0 otherwiseh t w
h t x y wf x y
ì =ïï=íïïî
( ), ( ),
1 if ( ), ( ) are the preceding head words in and ( , )
0 otherwiseh s h t w
h s h t x y wf x y
ì =ïï=íïïî
( ), ( ), ( ),, , ,
( | , , , )( , , , )
h t w h s h t wv w u v wwe e e e ep w u v s t
Z u v s t
l ll ll
=