robust speech feature
DESCRIPTION
Robust Speech Feature. Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal , in EuroSpeech 99. DLFBE ---Preliminary. * MFCC is very successful in speech recognition * MFCC computed from the speech signal using - PowerPoint PPT PresentationTRANSCRIPT
Robust Speech Feature
Decorrelated and Liftered Filter-Bank Energies
(DLFBE)
Proposed by K.K. Paliwal , in EuroSpeech 99
DLFBE ---Preliminary
* MFCC is very successful in speech recognition
* MFCC computed from the speech signal using
the following three steps: 1.Compute the FFT power spectrum of the speech signal
2.Apply a Mel-space filter-bank to the power spectrum to get N
energies (N=20~60)
3.Compute discrete cosine x’form (DCT) of log filter-bank energies
to get uncorrelated MFCC’s (M=10)
DLFBE --- Motivation
*MFCC has two drawbacks 1. Does not have any physical interpretataion
2. Liftering of cepstral coefficient has no effect in the
modern speech recognition (discuss later)
*The two problem(i.e., numbers and correlation)
in FBE used in 50’s, 60’s,70’s can be solved
today
Liftering --- What and How
*Lifter is the reweighting process of cepstral
coeff. used in DTW framework of speech
recognition
where is dissimilarity between the test vector and the mean vector
2
1
)'()'()'()';(
D
iii
t xxxxxxxxd
)',( xxd
x 'x
Euclidean distance
Liftering --- What and How (cont’d)
Where is i-th cepstral coeff. , is the corresponding liftering coeff. and is the lifter
So
iii xwy iyix
iw
xhgfe
dcba
x
w
w
w
y
D
....
....
000
0...
000
000
2
1
More general form
Liftering --- What and How (cont’d)
2
1
)'()'()'()',(
D
iii
t yyyyyyyyd
2
0
)]'([
D
iiii xxw
Liftering --- What and How (cont’d)
The types of lifters are listed belows
1.Linear lifter
2.Statistical lifter
3.Sinusoidal lifter
4.Exponential lifter
iwi
iiw
1
)sin(2
1D
iDwi
)2
exp(2
2
i
iw si 5,5.1 s
Liftering --- Discussion and Why
* The multiplicative weighting in cepstrum domain is equivalent to convolution
in spectral domain
Spectral domain Cepstral domain
Type 1 and 2 HP filter Emphasize the higher
cepstral coeff’s.
Type 3 and 4 BP filter Lessen the higher and lower cepstral coeff’s.
kk
IFFT
nn WCwc .
Liftering --- Experiment on DTW
Liftering on CDHMM (??) --- Why
Mahalanobis distance measure due to out
observation prob.
)'()'(),';( 1'' xxxxxxdx
t
x
Liftering on CDHMM (??) --- Why
liftering matrix for MFCC
where
DDDw
w
w
w
W
*
3
2
1
.000
.....
0.00
0.00
0.00
txy WWWxyWxy '','',
Liftering on CDHMM (??) --- Why
Thus,cepstral liftering has no effect in the recognition
process when used with continuous observation Gaussian
Density HMM’s
),';(
)'()'()'()()()'(
)'()()'()'()'(),';(
'
11'
1
1'
1' '
x
tx
ttt
tx
t
y
ty
xxd
xxWxxxxWWWWxx
WxWxWWWxWxyyyyyyd
Decorrelation of FBE --- Why/How
*FBEs are correlated => we can’t use CDHMM
* We can use LP techniques to solve this defeat
can be obtained by covariance method of
LP analysis
p
i
ii za
zPzA
1
1
1
)(1
1)(
1,...,1,0},{ Nnen
}{ ia
P M N
M
Liftering of FBE --- How
L
i
iizhzH
0
)(1,...,1,0},{ NnenM
FIR filter
N=M+L
DLFBE --- Experiment
*SI and isolated word recognition using ISOLET spoken letter database
*90 training utterances from 90 speakers(45 females,45 males)
30 testing utterances from 30 speakers (15 females,15 males)
DLFBE --- Experiment (cont’d)
)(zp
no1
1)( zazP
22
11)( zazazP
no
no
no
no
22
11)( zazazP
)(zH
no
nono
15.01)( zzH175.01)( zzH
11)( zzH21)( zzH11)( zzH
DLFBE --- Experiment (cont’d)
Robust Speech Feature
Noise-Invariant Representation for Speech Signal
Group Delay Function (GDF) Method
Proposed by Bayya & Yegnanarayana
in EuroSpeech ‘99
GDF --- Motivation
*Background noise is a prominent source of mismatch
and eliminated roughly by methods as follows
1.compensation
cause the overestimation and underestimation side effects
Pre-
Processing
SS(spectral sub.) ,HP,BP
FN(feature normalization)
Model
Adaptation
Parameter x’form
GDF --- Motivation (cont’d)
2.new feature
not completely noise resistant
*All the above use power/amplitude as speech feature
Why don’t we use phase information as features ?
And phase infor. may be helpful in speech recognition.
LPC MEL,PLP (projection concept)
GDF --- What/How
*GDF is defined as the normalized autocorrelation of
a short segment of a signal
(#.1)
Where is the normalized autocorrelation of a short
segment of a signal
(.))arg((.)log
(.)log
(.)log))(1log(
(.))arg(
1
RR
eR
Renr
Rj
n
nj
)(nr
(#.2)
compare(#.1)&(#.2)
GDF --- What/How (cont’d)
1 1
11
)cos()()cos()(
)())(1log(
n n
n
nj
n
nj
nnrjnnr
enrenr
1
)sin()((.))arg(n
nnrR
0,0)( nnr
GDF --- What/How (cont’d)
1
)]cos()[((.))arg(
n
nnnrR
GDF
30~10p
Easy to implement
)()]}cos([)({1
nwnnwnrGDFp
n
Truncated version of GDF
GDF --- What/How (cont’d)
where
pnPnw 1),2cos(5.05.0)(
Hanning window
GDF --- Why & Experiment
*frame length = 5 ms , frame rate = 1 ms & modified
autocorrelation sequence averaged over 20 frames
then the GDF computed as defined above
GDF --- Why & Experiment (cont’d)
GDF --- Experiment
*Isolated-digit recognition
Clean Noisy
SI 97%
95%
YES
SD 96.5%
94.5%
NO
Due to large dynamicrange?