robust speech feature

Robust Speech Feature

Decorrelated and Liftered Filter-Bank Energies

(DLFBE)

Proposed by K.K. Paliwal , in EuroSpeech 99

DLFBE ---Preliminary

* MFCC is very successful in speech recognition

* MFCC computed from the speech signal using

the following three steps: 1.Compute the FFT power spectrum of the speech signal

2.Apply a Mel-space filter-bank to the power spectrum to get N

energies (N=20~60)

3.Compute discrete cosine x’form (DCT) of log filter-bank energies

to get uncorrelated MFCC’s (M=10)

DLFBE --- Motivation

*MFCC has two drawbacks 1. Does not have any physical interpretataion

2. Liftering of cepstral coefficient has no effect in the

modern speech recognition (discuss later)

*The two problem(i.e., numbers and correlation)

in FBE used in 50’s, 60’s,70’s can be solved

today

Liftering --- What and How

*Lifter is the reweighting process of cepstral

coeff. used in DTW framework of speech

recognition

where is dissimilarity between the test vector and the mean vector

2

1

)'()'()'()';(

D

iii

t xxxxxxxxd

)',( xxd

x 'x

Euclidean distance

Liftering --- What and How (cont’d)

Where is i-th cepstral coeff. , is the corresponding liftering coeff. and is the lifter

So

iii xwy iyix

iw

xhgfe

dcba

x

w

w

w

y

D

....

....

000

0...

000

000

2

1

More general form


2

1

)'()'()'()',(

D

iii

t yyyyyyyyd

2

0

)]'([

D

iiii xxw


The types of lifters are listed belows

1.Linear lifter

2.Statistical lifter

3.Sinusoidal lifter

4.Exponential lifter

iwi

iiw

1

)sin(2

1D

iDwi

)2

exp(2

2

i

iw si 5,5.1 s

Liftering --- Discussion and Why

* The multiplicative weighting in cepstrum domain is equivalent to convolution

in spectral domain

Spectral domain Cepstral domain

Type 1 and 2 HP filter Emphasize the higher

cepstral coeff’s.

Type 3 and 4 BP filter Lessen the higher and lower cepstral coeff’s.

kk

IFFT

nn WCwc .

Liftering --- Experiment on DTW

Liftering on CDHMM (??) --- Why

Mahalanobis distance measure due to out

observation prob.

)'()'(),';( 1'' xxxxxxdx

t

x


liftering matrix for MFCC

where

DDDw

w

w

w

W

*

3

2

1

.000

.....

0.00

0.00

0.00

txy WWWxyWxy '','',


Thus,cepstral liftering has no effect in the recognition

process when used with continuous observation Gaussian

Density HMM’s

),';(

)'()'()'()()()'(

)'()()'()'()'(),';(

'

11'

1

1'

1' '

x

tx

ttt

tx

t

y

ty

xxd

xxWxxxxWWWWxx

WxWxWWWxWxyyyyyyd

Decorrelation of FBE --- Why/How

*FBEs are correlated => we can’t use CDHMM

* We can use LP techniques to solve this defeat

can be obtained by covariance method of

LP analysis

p

i

ii za

zPzA

1

1

1

)(1

1)(

1,...,1,0},{ Nnen

}{ ia

P M N

M

Liftering of FBE --- How

L

i

iizhzH

0

)(1,...,1,0},{ NnenM

FIR filter

N=M+L

DLFBE --- Experiment

*SI and isolated word recognition using ISOLET spoken letter database

*90 training utterances from 90 speakers(45 females,45 males)

30 testing utterances from 30 speakers (15 females,15 males)

DLFBE --- Experiment (cont’d)

)(zp

no1

1)( zazP

22

11)( zazazP

no

no

no

no

22

11)( zazazP

)(zH

no

nono

15.01)( zzH175.01)( zzH

11)( zzH21)( zzH11)( zzH

DLFBE --- Experiment (cont’d)

Robust Speech Feature

Noise-Invariant Representation for Speech Signal

Group Delay Function (GDF) Method

Proposed by Bayya & Yegnanarayana

in EuroSpeech ‘99

GDF --- Motivation

*Background noise is a prominent source of mismatch

and eliminated roughly by methods as follows

1.compensation

cause the overestimation and underestimation side effects

Pre-

Processing

SS(spectral sub.) ,HP,BP

FN(feature normalization)

Model

Adaptation

Parameter x’form

GDF --- Motivation (cont’d)

2.new feature

not completely noise resistant

*All the above use power/amplitude as speech feature

Why don’t we use phase information as features ?

And phase infor. may be helpful in speech recognition.

LPC MEL,PLP (projection concept)

GDF --- What/How

*GDF is defined as the normalized autocorrelation of

a short segment of a signal

(#.1)

Where is the normalized autocorrelation of a short

segment of a signal

(.))arg((.)log

(.)log

(.)log))(1log(

(.))arg(

1

RR

eR

Renr

Rj

n

nj

)(nr

(#.2)

compare(#.1)&(#.2)

GDF --- What/How (cont’d)

1 1

11

)cos()()cos()(

)())(1log(

n n

n

nj

n

nj

nnrjnnr

enrenr

1

)sin()((.))arg(n

nnrR

0,0)( nnr


1

)]cos()[((.))arg(

n

nnnrR

GDF

30~10p

Easy to implement

)()]}cos([)({1

nwnnwnrGDFp

n

Truncated version of GDF


where

pnPnw 1),2cos(5.05.0)(

Hanning window

GDF --- Why & Experiment

*frame length = 5 ms , frame rate = 1 ms & modified

autocorrelation sequence averaged over 20 frames

then the GDF computed as defined above

GDF --- Why & Experiment (cont’d)

GDF --- Experiment

*Isolated-digit recognition

Clean Noisy

SI 97%

95%

YES

SD 96.5%

94.5%

NO

Due to large dynamicrange?

robust speech feature

Documents

cepstral liftering

dtw liftering

liftering matrix

mfccwhere liftering

corresponding liftering

exponential lifter liftering

modern speech recognition

general form liftering