utterance verification in continuous speech recognition decoding and training procedures
DESCRIPTION
Utterance verification in continuous speech recognition decoding and training Procedures. Author :Eduardo Lleida , Richard C. Rose. Reporter : 陳燦輝. Reference. - PowerPoint PPT PresentationTRANSCRIPT
Utterance verification in continuous speech recognition decoding and training Procedures
Author :Eduardo Lleida, Richard C. Rose
Reporter : 陳燦輝
2
Reference
[1] Eduardo Lleida, Richard C. Rose, “Utterance Verification in Continuous Speech Recognition: Decoding and Training Procedures”, IEEE Trans. SAP 2000.
[2] J. K. Chan and F. K. Soong, “An N-best candidates-based discriminative training for speech recognition applications”, Computer Speech and Language, 1995
[3] W. Chou, B. H. Juang, and C. H. Lee, “Segmental GPD training of HMM based speech recognizer,” ICASSP 1992
[4] B. H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. Signal Processing 1992
3
Outline
Introduction to Utterance Verification (UV) Utterance Verification Paradigms Utterance Verification Procedures Confidence Measures
Likelihood Ratio-based Training
Experimental results
Summary and Conclusions
4
Introduction to Utterance Verification
Utterance Verification Paradigms
d.a threshol to LR thecomparingby y word vocabularlegitimatea toscorrespond
Ysequence nobservatio that thehypothesis therejectsor accepts cha test whi describe )1(
model ealternativ by the generated was Y,hypothesis vealthernati :
model rrect) target(coby the generated was Y,hypothesis null :
words.OOV and vocabulary- withinboth containing utterance
speecha ngrepresenti },,,{ vectorsfeature of sequenea is Ywhere
)1()|(
)|(),,(
string. decodeda in wordsindividual verifying
for procedure testinghypothesis based (LR) ratio likelihooda as dimplemente isIt
1
0
21
1
0
a
c
T
H
H
a
cac
H
H
yyyY
YP
YPYLR
5
Introduction to Utterance Verification (cont)
Utterance Verification Paradigms Some problems of UV
The observation vectors Y might be associated with a hypothesized word that is embedded in a string of words.
The lack of language model
6
Introduction to Utterance Verification (cont)
Utterance Verification Procedures Two-Pass Procedure :
Fig. 1. Two-pass utterance verification where a word string and associated segmentation boundaries that are hypothesized by a maximum likelihood CSR decoder are verified in a second stage using a likelihood ratio test.
7
Introduction to Utterance Verification (cont)
Utterance Verification Procedures One-Pass Procedure :
Fig. 2. One-pass utterance verification where the optimum decoded string isthat which directly maximizes a likelihood ratio criterion
8
Introduction to Utterance Verification (cont)
Utterance Verification Procedures Likelihood Ratio Decoder
)3()|,(
)|,(maxargQ
sequence state theobtaining toscorrespond thusproblem decoding ratio likelihood The
utterance.
thein nobservatio ofnumber theis T where)},q,(q,),q,{(q Q as writtenbe can
and by defined space model the through},{ Q sequence state discrete A
)2()|,(
)|,(),,(
as rewritten be can (1) in LR the then HMMs,both are and
,
~
aT
cT
a1
c1
aa
cc
acac
aa
Q
cc
Qac
ac
QYP
QYP
QYP
QYPYLR
ac
a
c
9
Introduction to Utterance Verification (cont)
Utterance Verification Procedures Likelihood Ratio Decoder
lyrespective models, hypothesis alternate and target thein states ofnumber theare and where
)4(1;1)(
)(),(
)(
)(),(max),(
sequence. state singlea for obtainedy probabilithighest theas ),(quantity
thedefine can wealgorithm, Viterbi thein used is that logic inductive same theFollowing
1
11
11;1
ac
acam
am
cn
cn
tam
tcn
ajm
cin
tNjNi
t
t
NN
NnNnyb
ybmn
yb
yb
a
ajimn
mn
ac
10
Introduction to Utterance Verification (cont)
Utterance Verification Procedures Likelihood Ratio Decoder
There are two issues that must be addressed if the LR decoding is to be applicable to actual speech recognition tasks.
1. computation complexity.
2. the definition of the alternate hypothesis model.
11
Introduction to Utterance Verification (cont)
Utterance Verification Procedures computation complexity
Fig. 3. A possible three-dimensional HMM search space.
12
Introduction to Utterance Verification (cont)
Utterance Verification Procedures computation complexity Unit level constraint : the target model and alternate model must
occupy their unit initial states and unit final states at the same time instant :
where corresponds to the state sequence for unit
)(),(),(),(),,( 2111 MaT
cT
ac uQuQuQqqqqQ
)( juQ j
13
Introduction to Utterance Verification (cont)
Utterance Verification Procedures computation complexity state level constraint :
)6(1)(
)()(
,,2)(
)()(max)(
as defined is level frame at the recursion the wherealgorithm
Viterbi modifieda of form the takecan sequence state optimum thegidentifyin of process theresult,a As
)5()|,(
)|,(maxargQ
(3) toconstraint state
applying decoded is sequence state optimum single the way, thisIn T. ,1, t eachfor qq
thatso defind is constraint state theand s, topologieidentical with HMMsare and that suppose
1
11
11
~
at
ct
ac
aaj
aj
cj
cj
taj
tcj
aij
cij
tNi
t
a
c
Q
t
Nnyb
ybj
Ttyb
yb
a
aij
QYP
QYP
q
14
Introduction to Utterance Verification (cont)
Utterance Verification Procedures Definition of alternative Models
The alternative hypothesis model has two roles in UV
1. to reduce the effect of sources of variability.
2. to be more specifically to represent the incorrectly decoded hypotheses that are frequently confused with a given lexical item.
15
Introduction to Utterance Verification (cont)
Utterance Verification Procedures Definition of alternative Models
The alternate model must somehow “cover” the entire space of out-of-vocabulary lexical unit.
If OOV utterances that are easily confused with vocabulary words are to be detected, the alternate model must provide a more detailed representation of the utterances that likely to be decoded as false alarms for individual vocabulary words
16
Introduction to Utterance Verification (cont)
Utterance Verification Procedures Definition of alternative Model
word.particular withassociated errors of tionrepresenta detailed
a provide tois model, ealternativimposter an as here toreferred , of purpose The
space. feature theof tionrepresenta broada provide tois model, ealternativ background
as here referred , of purpose The ghts.linear wei are 1 and where
)7()|()|()|(
modelsdifferent twoof ncombinatiolinear a asy probabilit hypothesis alternate the
define tois model alternate for the above outline conditions esatisfy th way toOne
aim
abgbgimbg
aimim
abgbg
a YPYPYP
17
Introduction to Utterance Verification (cont)
Utterance Verification Procedures Confidence measures
It was suggested that modeling errors may result in extreme values in local likelihood ratios which may cause undo influence at the word or phrase level. In order to minimize these effects, we investigated several word level likelihood ratio based confidence measures that can be computed using a non-uniform of sub-word level confidence measures.
18
Introduction to Utterance Verification (cont)
Utterance Verification Procedures Confidence measures
shift.a is and function theof slope thedefines where
)9()))((logexp(1
1)(
function sigmoida use tois measure confidence word-sub theof range dynamic limit the To
segment. decoded thein frames ofnumber theis where
)8()|(
)|(log
1),,(log
1)(
as obtained be can, Y vectors,nobservatio
ofsegment a over decoded unit word-suba for score likelihood levelunit unweighted The
u
uu
uu
u
au
cu
u
acu
u
uLRuU
N
YP
YP
NYLR
NuLR
u
19
Introduction to Utterance Verification (cont)
Utterance Verification Procedures Confidence measures
scores. confidence level word-sub weightedsigmoid theof means geometric and arithmetic theare and and
scores, confidence level word-sub unweighted theof means geometric and arithmetic theare and Where
)13())(log1
exp()(
)12()(1
)(
)11())(log1
exp()(
)10()(1
)(
:,,1, units word-sub of composed a wordfor defined are measures following The
compared. are U()scores ratio likelihood level
unit weightedsigmoid theand LR(),scores, ratio likelihood levelunit unweighted theboth
of means geometric and arithmetic the toingcorrespond measures confidence level Word
43
21
,1
4
,1
3
,1
2
,1
1
,n
WW
WW
uUN
wW
uUN
wW
uLRN
wW
uLRN
wW
Niu
in
N
inn
in
N
inn
in
N
inn
in
N
inn
nin
n
n
n
n
20
Likelihood Ratio-based Training
The goal of the training procedure is to increase the average value of for correct hypotheses and decrease the average value of for false alarms.
LR based training is a discriminative training algorithm that based on a cost function which approximates a log likelihood ratio.
),,( acYLR
),,( acYLR
21
Likelihood Ratio-based Training (cont)
Using distance measure to underlie the cost function.
},,{segment over the
unit as decodedsegment speech theof frame final and initial theare and where
)15()(1
1)(
as distances based frame theaveragingby obtained is distance basedsegment The
)14())(log())(log()(
by given is (4) in used assumption under thestrategy decoding pass single usingby
obtained whichsequence thein n transitiostate eachfor defined distance based frame A
1-tq
ij
ufui
uu
t
uf
uiuu
ttu
fi
tq
tt
ttif
uu
taj
aijt
cj
cijt
yyY
utt
yrtt
YR
ybaybayr
22
Likelihood Ratio-based Training (cont)
set. training thein unit theof soccurrence ofnumber theis where
)18(),(1
)},({
data training thein unit theof soccurrence
allover computedcost average theas daproximate is )},({
and constant, rate learning theis and procedure,descent gradient
theof update th at the computed model theis },,{ where
)17()},({
)},({cost expected theon performed is updategradient A
imposter ,1
correct ,1)(
as defined is )( functionindicator thewhere
)16()))()(exp(1
1),,(
function sigmoida using unit for ),,( functioncost theDefine
1
uu
u1
uN
YFN
YFE
u
YFE
n
YFE
YFE
u
uu
u
YRuYF
uYF
u
uuu
N
iu
uuu
uuu
nau
cu
uuun
un
uuu
uu
uu
au
cu
uu
au
cu
uu
u
23
Likelihood Ratio-based Training (cont)
Imposters with scores greater than and targets with scores lower than tend to increase the average cost function. Therefore, if we minimize this function we can reduce the misclassification between targets and imposters.
24
Likelihood Ratio-based Training (cont)
models. HMM )e(alternativor ) target( torefers index thewhere
)19()(log
),()),(1)(,(1
)(log),(
)))()((exp(1
)))()((exp(
)))()((exp(1
11
)(log1
))))()((exp(1(
))))()((exp())((
)))(log())((log(1
)(
)))()((exp(1
1
)(
)(
),(),(
parameter HMM an
respect to withfunctioncost theofgradient the},y,,{y Yofsegment a Given
1
1
12
1
ki
T1u
11
akckk
ybkYYFYF
T
ybkY
YRu
YRu
YRuT
yb
TYRu
YRuu
ybayba
TYR
YRu
YR
YR
YFYF
ki
tkq
uuuuuuuu
T
t
ki
tkq
uuuuu
uuuu
uuuuu
T
t
ki
tkq
T
tuuuu
uuuuu
ki
taq
aqt
cq
cq
T
tuu
uuuu
ki
uu
uu
uuuki
uuu
t
t
t
tttt
25
Likelihood Ratio-based Training (cont)
. weight mixturea and ,matrix covariance diagonala ,
vector meana by zedcharacteri mixture Gaussian th thedensities, Gaussian Mof mixturea is
)20(,),;((y)
whereprocedure updategradient the
in estimated-re toparameters ofset complete therepresent (13) in () densities nobservatio The
;,1
;,1
;,1
;,1
),(
model hypothesis alternate or the target the withassociated is parameter
estimated-re he whether tand decodedy incorrectlor correctly werensobservatio decoded the
whether theon dependinggradient theof direction thedicates ),( functionindicator The
k
M
1m
ki
k
kjm
kmq
kmq
kmq
kmq
kjm
kqt
kqt
u
u
c
um
ackuyNcb
b
ckimposterY
akimposterY
akCorrectY
ckCorrectY
kY
kY
t
t
tt
26
Likelihood Ratio-based Training (cont)
)23()(log
Hence
)22(log
)21(
)exp(
)exp(
are ations transform the where, and ,parameters ed transform therespect to with
takenisgradient The .1 constraint stochastic theand 0 variance theof
ssdefinitene positive theas such sconstraint certainsatisfy tohave parameters HMM The
_
_
_
1
_
__
,1
kmq
kmtqk
mq
tkq
kmiq
k
miq
k
iq
M
i
k
mqkmq
k
mq
k
miq
kiq
M
i
kmiq
tt
t
t
tt
t
t
t
tt
tt
c
c
yb
c
cc
c
c
)(
),|( where
tkq
kmq
kmiq
kjmk
mtq yb
uyNc
t
tt
t
27
Likelihood Ratio-based Training (cont)
22
1
)()(2
11
1
),|(
k
mitqk
mtqti
D
ittt
uyD
i
k
miqkmq
kmq euyN
)24(
)(
)()(
),|(
)1())(2)(()2
1(
)(
)(log
2
2
2
)()(2
11
1
22
1
k
miq
kmiqtik
jmt
k
miq
kmiqti
tkq
kmq
kmiq
kjm
kmiqti
k
miq
uyk
miq
D
itkq
kmq
kmiq
tkq
t
t
t
t
t
tt
tt
k
mitqk
mitqti
D
it
t
t
t
t
uy
uy
yb
uyNc
uy
eyb
c
u
yb
28
Likelihood Ratio-based Training (cont)
)25(1
)(
)(
)(
)(
),|(
)exp(
)()()2()2
1(
)(
)(log)(log
2
1
3
2
_
_
32)()(
2
11
1
)()(2
12
1
__
22
1
22
1
k
miq
kmiqtik
mtq
k
miq
k
miqk
miq
kmiqti
tkq
kjm
kjm
kjm
k
miq
k
miq
k
miqkmiqti
uyk
miq
D
i
uyk
mdq
D
id
k
miq
D
i
tkq
kmq
k
miq
k
miq
k
miq
tkq
k
miq
tkq
t
t
t
tt
t
t
t
t
t
tt
k
mitqk
mitqti
D
it
k
mitqk
mitqti
D
itt
t
t
t
t
t
t
t
t
uy
uy
yb
uyNc
uye
e
yb
c
ybyb
22
1
)()(2
11
1
),|(
k
mitqk
mtqti
D
i
mitqtt
uyD
i
kkmq
kmq euyN
29
Likelihood Ratio-based Training (cont)
)27()()(
)(1
)26()()(
)(1
obtained be thencan )(log ofgradient The
)exp()exp(
)exp(
)exp()exp(
)exp(
are ations transformThe
.1 constraint stochastic esatisfy th toformedmust trans The parameter.
dependent state be could and of weight of both ,generality of lossWithout
)()()( ofconsider weWhen
_
_
__
_
__
_
timqimt
bgqbg
timq
im
im
aq
timqimt
bgqbg
tbgq
bg
bg
aq
taq
imbg
im
im
imbg
bg
bg
imbg
imbg
timqimt
bgqbgt
aq
ybyb
ybb
ybyb
ybb
yb
ybybyb
ttqttq
t
tq
tq
t
ttqttq
t
tq
tq
t
t
tqtq
tq
tq
tqtq
tq
tq
tqtq
tqtq
ttqttqt
30
Likelihood Ratio-based Training
(cont) The complete likelihood ratio based training p
rocedure : Train initial ML HMMs, and for each unit. For each iteration over the training database :
Obtain hypothesized sub-word unit string, segmentation using the LR decoder
Align the decoded sub-word unit as correct or false alarm, to obtain indicator function
Update gradient of the expected cost, Update the model parameter in (17)
c a
),( kYu
)},({ uu
u YFE
u
31
Experimental results
Speech corpora : movie locator task
In a trial of the system over the public switched telephone network, the service was configured to accept approximately 105 theater names, 135 city names, and between 75 and 100 current movie titles.
A corpus of 4777 spontaneous spoken utterances from the trial were used in our evaluation.
32
Experimental results (cont)
A total of 3025 sentences were used for training acoustic models and 1752 utterances were used for testing.
The sub-word models used in the recognizer consisted of 43 context independent units.
Recognition was performed using a finite state grammar built form the specification of the service, with a lexicon of 570 different words.
33
Experimental results (cont)
The total number of words in the test set was 4864, where 134 of them were OOV.
Recognition performance of 94% word accuracy was obtained on the “in-grammar” utterance.
The feature set used for recognition included 12 mel-cepstrum, 12 delta mel-cepstrum, 12 delta-delta, mel-cepstrum, energy, delta energy, delta-delta energy coefficients, and cepstral mean normalization was applied.
34
Experimental results (cont)
A single “background” HMM alternate model, , containing three states with 32 mixtures per state was used.
A separate “imposter” alternative HMM model, was trained for each sub-word unit. These models contained three states with eight mixtures state.
abg
aim
35
Experimental results (cont)
Performance is described both in terms of the receiver operating characteristic curves (ROC) and curves displaying type I + type II error plotted against the decision threshold setting.
wordsdecodedy incorrectl all of num
alarm false of num : rateaccept false
wordsdecodedcorrectly all of num
rejection false of num-1 : rate Detection
system. ion verificatutterance by the accepted also and recognizer by the generated
are whichonssubstituti and insertions worddecodedy incorrectl :alarm) (falseerror II type
process. on verficatiutterance the
by rejectedbut recognizer thein decodedcorrectly are words: rejection) eerror(fals I type
36
Experimental results (cont)
Experiment 1 : Comparison of UV Measures
Fig. 4. ROC curve comparing performance of confidence measures using W1 (w); (dashed line) and W2 (w); (solid line) (left figure).
and using W3 (w); (dashed line) and W4 (w); (solid line) (right figure).
37
Experimental results (cont)
Experiment 1 : Comparison of UV Measures
Fig. 5. type I + type II comparing performance of confidence measures usingW3 (w); (dashed line) and W4 (w); (solid line)
It appears from the error plot in fig. 5 that W4 is less sensitive to the setting of the confidence threshold.
In the remain simulation, the W4 will be used.
38
Experimental results (cont)
Experiment 2 :Investigation of LR Training and UV strategies
TABLE IUtterance Verification
performance: type I + type II minimum error rate for theone-pass (OP) and the two-pass (TP) utterance
verificationprocedure. b% number of mixtures for the backgroundmodel, i% number of mixtures for the imposter model
Fig. 6. Likelihood ratio training, ROC curves for initial models (dash-dot line),
one iteration (dash line) and two iterations (solid line). The *-points are the
minimum type I + type II error.
39
Experimental results (cont)
Experiment 2 :Investigation of LR Training and UV strategies
Fig. 7. One-pass versus two-pass UV comparison with the b32.i8configuration and two iterations of the likelihood ratio training.
40
Experimental results (cont)
Experiment 3 : whether or not the LR training procedures actually improved speech recognition performance
TABLE IIspeech recognition performance given in terms of word accuracy without using utterance
verification and utterance verification performance given as the sum of type I and type II error
41
Experimental results (cont)
Experiment 4 : measured over in-grammar and out-of-grammar utterances, respectively.
Fig. 8. In-grammar and out-of-grammar sentences. Initial models: dot-dash line, one iteration: dash line and two iterations: solid line.
42
Summary and Conclusions
The one-pass decoding procedure improved UV performance over the two pass approach.
Likelihood ratio training and decoding has also been successfully applied to other task including speaker dependent voice label recognition.
Further research should involve the investigation of decoding and training paradigms for UV that incorporate additional, non-acoustic sources of knowledge.