large vocabulary continuous speech recognition. subword speech units
TRANSCRIPT
).()|(maxargˆ
.)(
)()|()|(
).|(max)|ˆ(ˆ
WPWYPW
YP
WPWYPYWP
YWPYWPW
W
W
Large VocabularyLarge VocabularyContinuous Speech RecognitionContinuous Speech Recognition
Subword Speech UnitsSubword Speech Units
HMM-Based Subword Speech UnitsHMM-Based Subword Speech Units
,: 321 IW WWWWS
),()()()()()(
)()()()()()(:
1)(213)(3231
2)(22211)(1211
3
21
WUWUWUWUWUWU
WUWUWUWUWUWUS
IWLIIWL
WLWLU
Training of Subword UnitsTraining of Subword Units
Training of Subword UnitsTraining of Subword Units
Training ProcedureTraining Procedure
Errors and performance Errors and performance evaluation in PLU recognitionevaluation in PLU recognition
Substitution error (s)Substitution error (s) Deletion error (d)Deletion error (d) Insertion error (i)Insertion error (i)
Performance evaluation:Performance evaluation: If the total number of PLUs is N, we define:If the total number of PLUs is N, we define:
Correctness rate: N – s – d /NCorrectness rate: N – s – d /N Accuracy rate: N – s – d – i / NAccuracy rate: N – s – d – i / N
otherwise
validiswwifwwP
wwwPwwwwP
wwwwP
wwwPwwPwPwwwPWP
wwwW
jkkj
jNjjjQ
Q
Q
0
1)|(
),|()|(
|(
)|()|()()()(
,
11121
).121
21312121
21
Language Models for LVCSRLanguage Models for LVCSR
Word Pair Model: Specify which word pairs are valid
Statistical Language ModelingStatistical Language Modeling
)(
)(
)(
),(
),(
),,(),|(ˆ
,),,(
),,,(),,|(ˆ
),,,,|()(
13
1
212
21
3211213
11
1111
1211
i
Nii
NiiiNiii
Niii
Q
iiN
wF
wFp
wF
wwFp
wwF
wwwFpwwwP
wwF
wwwFwwwP
wwwwPWP
),,,(log1
lim
)(log)(
)()()(),,,(
),,,(log),,,(1
lim
21
2121
2121
Vw
QQQ
wwwPQ
H
wPwPH
wPwPwPwwwP
wwwPwwwPQ
H
Perplexity of the Language ModelPerplexity of the Language Model
Entropy of the Source:
First order entropy of the source:
If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the Source puts out,
H
Qp
Ni
Q
iiiip
Q
wwwPB
wwwPQ
H
wwwwPQ
H
wwwPQ
H
p /121
21
11
21
21
),,,(ˆ2
),,,(ˆlog1
),,,|(log1
),,,(log1
We often compute H based on a finite but sufficiently large Q:
H is the degree of difficulty that the recognizer encounters, on average,When it is to determine a word from the same source.
Using language model, if the N-gram language model PN(W) is used,An estimate of H is:
In general:
Perplexity is defined as:
Overall recognition system based on subword unitsOverall recognition system based on subword units
Naval Resource (Battleship) Management Task:991-word vocabularyNG (no grammar): perplexity = 991
)(},{)(})({})({)(},{)(:
322.|BE|sentence,aendorbegincannot
448|EB|sentence,aendcanbutsentenceabegincannot
64|EB|sentence,aendcannotbutsentenceabegincon
117|BE|sentence,aendorbegineithercon
that
that
that
that
words
word
words
words
of
of
of
of
set
set
set
set
}{
}{
}{
}{
silenceBEEBsilenceWWsilenceBEEBsilenceS
BE
EB
EB
BE
Word pair grammarWord pair grammar
We can partition the vocabulary into four nonoverlapping sets of words:
The overall FSN allows recognition of sentences of the form:
WP (word pair) grammar:Perplexity=60
FSN based on Partitioning Scheme:995 real arcs and18 null arcs
WB (word bigram)Grammar:Perplexity =20
Control of word insertion/word Control of word insertion/word deletion ratedeletion rate
In the discussed structure, there is In the discussed structure, there is no control on the sentence lengthno control on the sentence length
We introduce a word insertion We introduce a word insertion penalty into the Viterbi decodingpenalty into the Viterbi decoding
For this, a fixed negative quantity is For this, a fixed negative quantity is added to the likelihood score at the added to the likelihood score at the end of each word arcend of each word arc
diphone.(LRC)contextrightleft
diphone,(RC)contextrightt$
diphone(LC)contextleft$
Untis.DependentWord
UntisPhoneMultiple
Dependent)(ContextTriphones
UnitstIndependenContext
)(
1
$
)(
1
)(
2
)(
2
$
:
:
:
:
)4(
)3(
)2(
)1(
RL
R
L
ppp
pp
pp
abovev
v
vah
v
aboveah
ah
vahb
ah
aboveb
b
ahbax
b
aboveax
ax
bax
ax
above
above
above
above
Context-dependent subword unitsContext-dependent subword units
Creation of context-dependent diphones and triphones
$$$$$$
.
$)(
)($
$$
$
$
.3
.2
.1
,)(
spspipishishawawowshowsh
otherwise
Tppcif
Tppcif
p
pp
pp
ppp
ppp
ppp
thenTpppcIf
L
R
L
R
RL
RL
RL
RL
If c(.) is the occurrence count for a given unit, we can use a unit reduction rule such as:
$$ spspipishishshawawowawowshowsh
CD units using only intraword units for “show all ships”:
CD units using both intraword and itnerword units:
Smoothing and interpolation of CD PLU Smoothing and interpolation of CD PLU modelsmodels
.1
,
ˆ
$$$$
$$$$$$
$$
pppppLpppL
pppppp
ppLppLpppLpppLpppL
RR
RR
RRR
BB
BBB
Implementation issues using Implementation issues using CD unitsCD units
Word junction effectsWord junction effects
To handle known phonological changes, a set of phonological rules are Superimposed on both the training and recognition networks.Some typical phonological rules include:
Recognition results using CD Recognition results using CD unitsunits
Position dependent unitsPosition dependent units
qppppq
YLYLpD ||min)(
Unit splitting and Unit splitting and clusteringclustering
A key source of difficulty in continuous speech recognition is the So-called function words, which include words like a, and, for, in, is.The function words have the following properties:
Creation of vocabulary-Creation of vocabulary-independent unitsindependent units
Semantic PostprocessorFor Recognition