utterance verification in continuous speech recognition decoding and training procedures

Utterance verification in continuous speech recognition decoding and training Procedures

Author :Eduardo Lleida, Richard C. Rose

Reporter : 陳燦輝

2

Reference

[1] Eduardo Lleida, Richard C. Rose, “Utterance Verification in Continuous Speech Recognition: Decoding and Training Procedures”, IEEE Trans. SAP 2000.

[2] J. K. Chan and F. K. Soong, “An N-best candidates-based discriminative training for speech recognition applications”, Computer Speech and Language, 1995

[3] W. Chou, B. H. Juang, and C. H. Lee, “Segmental GPD training of HMM based speech recognizer,” ICASSP 1992

[4] B. H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. Signal Processing 1992

3

Outline

Introduction to Utterance Verification (UV) Utterance Verification Paradigms Utterance Verification Procedures Confidence Measures

Likelihood Ratio-based Training

Experimental results

Summary and Conclusions

4

Introduction to Utterance Verification

Utterance Verification Paradigms

d.a threshol to LR thecomparingby y word vocabularlegitimatea toscorrespond

Ysequence nobservatio that thehypothesis therejectsor accepts cha test whi describe )1(

model ealternativ by the generated was Y,hypothesis vealthernati :

model rrect) target(coby the generated was Y,hypothesis null :

words.OOV and vocabulary- withinboth containing utterance

speecha ngrepresenti },,,{ vectorsfeature of sequenea is Ywhere

)1()|(

)|(),,(

string. decodeda in wordsindividual verifying

for procedure testinghypothesis based (LR) ratio likelihooda as dimplemente isIt

1

0

21

1

0

a

c

T

H

H

a

cac

H

H

yyyY

YP

YPYLR

5

Introduction to Utterance Verification (cont)

Utterance Verification Paradigms Some problems of UV

The observation vectors Y might be associated with a hypothesized word that is embedded in a string of words.

The lack of language model

6


Utterance Verification Procedures Two-Pass Procedure :

Fig. 1. Two-pass utterance verification where a word string and associated segmentation boundaries that are hypothesized by a maximum likelihood CSR decoder are verified in a second stage using a likelihood ratio test.

7


Utterance Verification Procedures One-Pass Procedure :

Fig. 2. One-pass utterance verification where the optimum decoded string isthat which directly maximizes a likelihood ratio criterion

8


Utterance Verification Procedures Likelihood Ratio Decoder

)3()|,(

)|,(maxargQ

sequence state theobtaining toscorrespond thusproblem decoding ratio likelihood The

utterance.

thein nobservatio ofnumber theis T where)},q,(q,),q,{(q Q as writtenbe can

and by defined space model the through},{ Q sequence state discrete A

)2()|,(

)|,(),,(

as rewritten be can (1) in LR the then HMMs,both are and

,

~

aT

cT

a1

c1

aa

cc

QQ

acac

aa

Q

cc

Qac

ac

QYP

QYP

QQ

QYP

QYPYLR

ac

a

c

9



lyrespective models, hypothesis alternate and target thein states ofnumber theare and where

)4(1;1)(

)(),(

)(

)(),(max),(

sequence. state singlea for obtainedy probabilithighest theas ),(quantity

thedefine can wealgorithm, Viterbi thein used is that logic inductive same theFollowing

1

11

11;1

ac

acam

am

cn

cn

tam

tcn

ajm

cin

tNjNi

t

t

NN

NnNnyb

ybmn

yb

yb

a

ajimn

mn

ac

10



There are two issues that must be addressed if the LR decoding is to be applicable to actual speech recognition tasks.

1. computation complexity.

2. the definition of the alternate hypothesis model.

11


Utterance Verification Procedures computation complexity

Fig. 3. A possible three-dimensional HMM search space.

12


Utterance Verification Procedures computation complexity Unit level constraint : the target model and alternate model must

occupy their unit initial states and unit final states at the same time instant :

where corresponds to the state sequence for unit

)(),(),(),(),,( 2111 MaT

cT

ac uQuQuQqqqqQ

)( juQ j

13


Utterance Verification Procedures computation complexity state level constraint :

)6(1)(

)()(

,,2)(

)()(max)(

as defined is level frame at the recursion the wherealgorithm

Viterbi modifieda of form the takecan sequence state optimum thegidentifyin of process theresult,a As

)5()|,(

)|,(maxargQ

(3) toconstraint state

applying decoded is sequence state optimum single the way, thisIn T. ,1, t eachfor qq

thatso defind is constraint state theand s, topologieidentical with HMMsare and that suppose

1

11

11

~

at

ct

ac

aaj

aj

cj

cj

taj

tcj

aij

cij

tNi

t

a

c

Q

t

Nnyb

ybj

Ttyb

yb

a

aij

QYP

QYP

q

14


Utterance Verification Procedures Definition of alternative Models

The alternative hypothesis model has two roles in UV

1. to reduce the effect of sources of variability.

2. to be more specifically to represent the incorrectly decoded hypotheses that are frequently confused with a given lexical item.

15


Utterance Verification Procedures Definition of alternative Models

The alternate model must somehow “cover” the entire space of out-of-vocabulary lexical unit.

If OOV utterances that are easily confused with vocabulary words are to be detected, the alternate model must provide a more detailed representation of the utterances that likely to be decoded as false alarms for individual vocabulary words

16


Utterance Verification Procedures Definition of alternative Model

word.particular withassociated errors of tionrepresenta detailed

a provide tois model, ealternativimposter an as here toreferred , of purpose The

space. feature theof tionrepresenta broada provide tois model, ealternativ background

as here referred , of purpose The ghts.linear wei are 1 and where

)7()|()|()|(

modelsdifferent twoof ncombinatiolinear a asy probabilit hypothesis alternate the

define tois model alternate for the above outline conditions esatisfy th way toOne

aim

abgbgimbg

aimim

abgbg

a YPYPYP

17


Utterance Verification Procedures Confidence measures

It was suggested that modeling errors may result in extreme values in local likelihood ratios which may cause undo influence at the word or phrase level. In order to minimize these effects, we investigated several word level likelihood ratio based confidence measures that can be computed using a non-uniform of sub-word level confidence measures.

18



shift.a is and function theof slope thedefines where

)9()))((logexp(1

1)(

function sigmoida use tois measure confidence word-sub theof range dynamic limit the To

segment. decoded thein frames ofnumber theis where

)8()|(

)|(log

1),,(log

1)(

as obtained be can, Y vectors,nobservatio

ofsegment a over decoded unit word-suba for score likelihood levelunit unweighted The

u

uu

uu

u

au

cu

u

acu

u

uLRuU

N

YP

YP

NYLR

NuLR

u

19



scores. confidence level word-sub weightedsigmoid theof means geometric and arithmetic theare and and

scores, confidence level word-sub unweighted theof means geometric and arithmetic theare and Where

)13())(log1

exp()(

)12()(1

)(

)11())(log1

exp()(

)10()(1

)(

:,,1, units word-sub of composed a wordfor defined are measures following The

compared. are U()scores ratio likelihood level

unit weightedsigmoid theand LR(),scores, ratio likelihood levelunit unweighted theboth

of means geometric and arithmetic the toingcorrespond measures confidence level Word

43

21

,1

4

,1

3

,1

2

,1

1

,n

WW

WW

uUN

wW

uUN

wW

uLRN

wW

uLRN

wW

Niu

in

N

inn

in

N

inn

in

N

inn

in

N

inn

nin

n

n

n

n

20


The goal of the training procedure is to increase the average value of for correct hypotheses and decrease the average value of for false alarms.

LR based training is a discriminative training algorithm that based on a cost function which approximates a log likelihood ratio.

),,( acYLR

),,( acYLR

21

Likelihood Ratio-based Training (cont)

Using distance measure to underlie the cost function.

},,{segment over the

unit as decodedsegment speech theof frame final and initial theare and where

)15()(1

1)(

as distances based frame theaveragingby obtained is distance basedsegment The

)14())(log())(log()(

by given is (4) in used assumption under thestrategy decoding pass single usingby

obtained whichsequence thein n transitiostate eachfor defined distance based frame A

1-tq

ij

ufui

uu

t

uf

uiuu

ttu

fi

tq

tt

ttif

uu

taj

aijt

cj

cijt

yyY

utt

yrtt

YR

ybaybayr

22


set. training thein unit theof soccurrence ofnumber theis where

)18(),(1

)},({

data training thein unit theof soccurrence

allover computedcost average theas daproximate is )},({

and constant, rate learning theis and procedure,descent gradient

theof update th at the computed model theis },,{ where

)17()},({

)},({cost expected theon performed is updategradient A

imposter ,1

correct ,1)(

as defined is )( functionindicator thewhere

)16()))()(exp(1

1),,(

function sigmoida using unit for ),,( functioncost theDefine

1

uu

u1

uN

YFN

YFE

u

YFE

n

YFE

YFE

u

uu

u

YRuYF

uYF

u

uuu

N

iu

uuu

uuu

nau

cu

uuun

un

uuu

uu

uu

au

cu

uu

au

cu

uu

u

23


Imposters with scores greater than and targets with scores lower than tend to increase the average cost function. Therefore, if we minimize this function we can reduce the misclassification between targets and imposters.

24


models. HMM )e(alternativor ) target( torefers index thewhere

)19()(log

),()),(1)(,(1

)(log),(

)))()((exp(1

)))()((exp(

)))()((exp(1

11

)(log1

))))()((exp(1(

))))()((exp())((

)))(log())((log(1

)(

)))()((exp(1

1

)(

)(

),(),(

parameter HMM an

respect to withfunctioncost theofgradient the},y,,{y Yofsegment a Given

1

1

12

1

ki

T1u

11

akckk

ybkYYFYF

T

ybkY

YRu

YRu

YRuT

yb

TYRu

YRuu

ybayba

TYR

YRu

YR

YR

YFYF

ki

tkq

uuuuuuuu

T

t

ki

tkq

uuuuu

uuuu

uuuuu

T

t

ki

tkq

T

tuuuu

uuuuu

ki

taq

aqt

cq

cq

T

tuu

uuuu

ki

uu

uu

uuuki

uuu

t

t

t

tttt

25


. weight mixturea and ,matrix covariance diagonala ,

vector meana by zedcharacteri mixture Gaussian th thedensities, Gaussian Mof mixturea is

)20(,),;((y)

whereprocedure updategradient the

in estimated-re toparameters ofset complete therepresent (13) in () densities nobservatio The

;,1

;,1

;,1

;,1

),(

model hypothesis alternate or the target the withassociated is parameter

estimated-re he whether tand decodedy incorrectlor correctly werensobservatio decoded the

whether theon dependinggradient theof direction thedicates ),( functionindicator The

k

M

1m

ki

k

kjm

kmq

kmq

kmq

kmq

kjm

kqt

kqt

u

u

c

um

ackuyNcb

b

ckimposterY

akimposterY

akCorrectY

ckCorrectY

kY

kY

t

t

tt

26


)23()(log

Hence

)22(log

)21(

)exp(

)exp(

are ations transform the where, and ,parameters ed transform therespect to with

takenisgradient The .1 constraint stochastic theand 0 variance theof

ssdefinitene positive theas such sconstraint certainsatisfy tohave parameters HMM The

_

_

_

1

_

__

,1

kmq

kmtqk

mq

tkq

kmiq

k

miq

k

iq

M

i

k

mqkmq

k

mq

k

miq

kiq

M

i

kmiq

tt

t

t

tt

t

t

t

tt

tt

c

c

yb

c

cc

c

c

)(

),|( where

tkq

kmq

kmiq

kjmk

mtq yb

uyNc

t

tt

t

27


22

1

)()(2

11

1

),|(

k

mitqk

mtqti

D

ittt

uyD

i

k

miqkmq

kmq euyN

)24(

)(

)()(

),|(

)1())(2)(()2

1(

)(

)(log

2

2

2

)()(2

11

1

22

1

k

miq

kmiqtik

jmt

k

miq

kmiqti

tkq

kmq

kmiq

kjm

kmiqti

k

miq

uyk

miq

D

itkq

kmq

kmiq

tkq

t

t

t

t

t

tt

tt

k

mitqk

mitqti

D

it

t

t

t

t

uy

uy

yb

uyNc

uy

eyb

c

u

yb

28


)25(1

)(

)(

)(

)(

),|(

)exp(

)()()2()2

1(

)(

)(log)(log

2

1

3

2

_

_

32)()(

2

11

1

)()(2

12

1

__

22

1

22

1

k

miq

kmiqtik

mtq

k

miq

k

miqk

miq

kmiqti

tkq

kjm

kjm

kjm

k

miq

k

miq

k

miqkmiqti

uyk

miq

D

i

uyk

mdq

D

id

k

miq

D

i

tkq

kmq

k

miq

k

miq

k

miq

tkq

k

miq

tkq

t

t

t

tt

t

t

t

t

t

tt

k

mitqk

mitqti

D

it

k

mitqk

mitqti

D

itt

t

t

t

t

t

t

t

t

uy

uy

yb

uyNc

uye

e

yb

c

ybyb

22

1

)()(2

11

1

),|(

k

mitqk

mtqti

D

i

mitqtt

uyD

i

kkmq

kmq euyN

29


)27()()(

)(1

)26()()(

)(1

obtained be thencan )(log ofgradient The

)exp()exp(

)exp(

)exp()exp(

)exp(

are ations transformThe

.1 constraint stochastic esatisfy th toformedmust trans The parameter.

dependent state be could and of weight of both ,generality of lossWithout

)()()( ofconsider weWhen

_

_

__

_

__

_

timqimt

bgqbg

timq

im

im

aq

timqimt

bgqbg

tbgq

bg

bg

aq

taq

imbg

im

im

imbg

bg

bg

imbg

imbg

timqimt

bgqbgt

aq

ybyb

ybb

ybyb

ybb

yb

ybybyb

ttqttq

t

tq

tq

t

ttqttq

t

tq

tq

t

t

tqtq

tq

tq

tqtq

tq

tq

tqtq

tqtq

ttqttqt

30


(cont) The complete likelihood ratio based training p

rocedure : Train initial ML HMMs, and for each unit. For each iteration over the training database :

Obtain hypothesized sub-word unit string, segmentation using the LR decoder

Align the decoded sub-word unit as correct or false alarm, to obtain indicator function

Update gradient of the expected cost, Update the model parameter in (17)

c a

),( kYu

)},({ uu

u YFE

u

31

Experimental results

Speech corpora : movie locator task

In a trial of the system over the public switched telephone network, the service was configured to accept approximately 105 theater names, 135 city names, and between 75 and 100 current movie titles.

A corpus of 4777 spontaneous spoken utterances from the trial were used in our evaluation.

32

Experimental results (cont)

A total of 3025 sentences were used for training acoustic models and 1752 utterances were used for testing.

The sub-word models used in the recognizer consisted of 43 context independent units.

Recognition was performed using a finite state grammar built form the specification of the service, with a lexicon of 570 different words.

33


The total number of words in the test set was 4864, where 134 of them were OOV.

Recognition performance of 94% word accuracy was obtained on the “in-grammar” utterance.

The feature set used for recognition included 12 mel-cepstrum, 12 delta mel-cepstrum, 12 delta-delta, mel-cepstrum, energy, delta energy, delta-delta energy coefficients, and cepstral mean normalization was applied.

34


A single “background” HMM alternate model, , containing three states with 32 mixtures per state was used.

A separate “imposter” alternative HMM model, was trained for each sub-word unit. These models contained three states with eight mixtures state.

abg

aim

35


Performance is described both in terms of the receiver operating characteristic curves (ROC) and curves displaying type I + type II error plotted against the decision threshold setting.

wordsdecodedy incorrectl all of num

alarm false of num : rateaccept false

wordsdecodedcorrectly all of num

rejection false of num-1 : rate Detection

system. ion verificatutterance by the accepted also and recognizer by the generated

are whichonssubstituti and insertions worddecodedy incorrectl :alarm) (falseerror II type

process. on verficatiutterance the

by rejectedbut recognizer thein decodedcorrectly are words: rejection) eerror(fals I type

36


Experiment 1 : Comparison of UV Measures

Fig. 4. ROC curve comparing performance of confidence measures using W1 (w); (dashed line) and W2 (w); (solid line) (left figure).

and using W3 (w); (dashed line) and W4 (w); (solid line) (right figure).

37


Experiment 1 : Comparison of UV Measures

Fig. 5. type I + type II comparing performance of confidence measures usingW3 (w); (dashed line) and W4 (w); (solid line)

It appears from the error plot in fig. 5 that W4 is less sensitive to the setting of the confidence threshold.

In the remain simulation, the W4 will be used.

38


Experiment 2 :Investigation of LR Training and UV strategies

TABLE IUtterance Verification

performance: type I + type II minimum error rate for theone-pass (OP) and the two-pass (TP) utterance

verificationprocedure. b% number of mixtures for the backgroundmodel, i% number of mixtures for the imposter model

Fig. 6. Likelihood ratio training, ROC curves for initial models (dash-dot line),

one iteration (dash line) and two iterations (solid line). The *-points are the

minimum type I + type II error.

39


Experiment 2 :Investigation of LR Training and UV strategies

Fig. 7. One-pass versus two-pass UV comparison with the b32.i8configuration and two iterations of the likelihood ratio training.

40


Experiment 3 : whether or not the LR training procedures actually improved speech recognition performance

TABLE IIspeech recognition performance given in terms of word accuracy without using utterance

verification and utterance verification performance given as the sum of type I and type II error

41


Experiment 4 : measured over in-grammar and out-of-grammar utterances, respectively.

Fig. 8. In-grammar and out-of-grammar sentences. Initial models: dot-dash line, one iteration: dash line and two iterations: solid line.

42

Summary and Conclusions

The one-pass decoding procedure improved UV performance over the two pass approach.

Likelihood ratio training and decoding has also been successfully applied to other task including speaker dependent voice label recognition.

Further research should involve the investigation of decoding and training paradigms for UV that incorporate additional, non-acoustic sources of knowledge.

utterance verification in continuous speech recognition decoding and training procedures

Documents