survey of robust speech techniques in icassp 2009

Survey of Robust Speech Techniques in ICASSP 2009

Shih-Hsiang Lin (林士翔 )

1Survey of Robustness Techniques in ICASSP 2009

Introduction

• The Stereo-based stochastic mapping (SSM) is a front-end data-driven techniques for noise robustness– It assume a joint GMM in the stereo feature space– The mapping between clean and noisy features is estimated

from the GMM to compensate the noisy features• SSM can be estimated under various criteria– Maximum A Posteriori (MAP) Iteratively Optimized– Minimum Mean Square Error (MMSE) Closed Form Solution

• Moreover, the SSM compensated features are further modeled by Multi-Style MPE training

Survey of Robust Speech Techniques in ICASSP 2009 3

Noise Robustness in feature space (1/2)

• Compared to the model space robust speech techniques, feature space noise robust techniques have the advantages of– low computational complexity– easy to decouple from the acoustic model end

• Front-end computation of an IBM LVCSR system with MFCC features

– The computation evolves through various feature spaces• linear spectral space, Mel spectral space, cepstral space ,

discriminatively trained feature space


Noise Robustness in feature space (2/2)

• Depending on the nature of the algorithm, feature space noise robust techniques apply compensation at different space– spectral subtraction -> linear spectral– phase-sensitive feature enhancement -> log Mel spectral– data-driven approach -> can be flexibly applied to different

feature spaces (e.g., MFCC, LDA or fMPE)


SSM and Discriminative Training (1/6)

• SSM is based on stereo features that are concatenation of clean speech feature vectors and noisy speech feature vectors

• Define as the joint stereo feature vectors. A GMM is assumed and trained by the EM algorithm

where

– and are obtained by fMPE training on the LDA featuresSurvey of Robust Speech Techniques in ICASSP 2009 6

yx,

xy

yxz ,

K

kkzzkzk zNczp

1,, ,;

ky

kx

kz,

,

,

kyykyx

kxykxx

kzz,,

,,

,

x y

SSM and Discriminative Training (2/6)MMSE-based SSM

• Given the observed noisy speech feature , the MMSE estimate of clean speech is given by

where

and is the posterior probability against , the marginal noisy speech distribution of the joint stereo distribution


x

y

k

kk byAykpyxEx ||ˆ

1,,

kyykxykA kykyykxykxkb ,1

,,,

ykp | yp

SSM and Discriminative Training (3/6)MAP-based SSM

• Given the observed noisy speech feature , the MAP estimate of clean speech is given by

where equation can be solved using the EM algorithm, which results in an iterative estimation process

where


x

y

yxpxx

|maxargˆ

1,,

1,|

1

1,|

1 ,ˆ|

kyykxykyx

kkyx

lk yxkpC

kkk

ll dyCyxkpx ,ˆ|ˆ 1

kykyykxykxkyxk

kyxl

k yxkpd ,1

,,,1

,|

1

1,|

1 ,ˆ|

kyxkyykxykxxkyx ,

1,,,,|

kykyykxykxxkyx y ,1

,,,,|

SSM and Discriminative Training (4/6)Mathematical Connections

• The MMSE estimate of SSM is a special tying case of one iteration of the corresponding MAP estimate– Assumes all Gaussians in the GMM share the same condition

covariance matrix– It is a reasonable results of the “averaging” effect of the

expectation function in the MMSE estimate of SSM• Due to the iterative nature of the MAP estimate of SSM, an

initial guess has to be made– A natural choice would be the noisy speech feature itself– or setting the MMSE estimate as the starting point


yxkyx |,|

yxE |

y

0x

MMSExx ˆˆ 0


• SPLICE is a special case of the MMSE estimate of SSM under the assumption that is an identity matrix which is equivalent to and having a perfect correlation

– SPLICE estimates the bias terms under the ML criteria• Deng Li also gives a connection between SPLICE and fMPE

– fMPE has under the minimum phone error criterion• Both SPLICE and fMPE share a similar piece-wise linear

structure with posterior probability


k

kk

k ryykprykpyx ||ˆ

kA

x y

k

kk

k myykpmykpyhMyx ||ˆ

km


• Therefore, the overall MAP-based SSM estimation in the fMPE space with the MMSE-based SSM estimate being the starting point can be expressed as

– This amount to applying a sequence of posterior probability weighted piece-wise linear mappings on noisy LDA features

• After the stochastic mapping, the compensated features can be directly decoded by the clean acoustic models– For better performance, an environment adaptive multi-style

discriminative re-training can be further applied (e.g., MPE)


LDAfMPEMMSEMAP yfffx ˆ

Experimental Results (1/3)

• LVCSR tasks (a vocabulary of 32k English words)– Back-End

• 150 hrs / 55k Gaussians / 4.5k states (clean acoustic model)• 300 hrs / 90k Gaussians / 5k states (multi-style acoustic model)• noisy data are generated by adding a mix of humvee, tank and babble noise to the

clean data around 15dB– Front-End

• 24 dims MFCCs (CMS) -> super-vector (9 frames: 216 dim) -> LDA 40 dims– GMMs are trained on the noisy training data and the maping is SNR-

specific• In test, a GMM-based environment classifier is used to estimate the SNR of

sentence– The proposed technique is evaluated on two test sets

• Set A : 2070 utterances (around 1.7 hrs) recorded in clean condition• Set B : 1421 utterances (around 1.2 hrs) recorded in a real world noisy condition

(with humvee noise running in the background 5-8dB)



• All the MAP estimations are run for 3 iterations• SSM gives the same results for Set A after environment detection• As the acoustic model is discriminatively trained on clean speech, the

baseline result on Set B noisy data is very poor– But SSM is able to significantly improve the results

• Compared to the SSM MAP, SSM MMSE MAP reduces the WER relatively by 50%.



• The baseline with multi-style training in Table 2 improves in the noisy condition (Set B) but degrades in the clean condition (Set A)

• When using compensated feature for multi-style training, the performance improves for both Set A and Set B

• It significantly reduces WER in the noisy condition (Set B) while maintaining a decent performance in the clean condition (Set A)


Summary and Discussion

• SSM is a data-driven feature space noise robust technique that exploits stereo data. Hence, it has its advantages and disadvantages– Since it is data-driven and does not rely on model for feature

computation, it is quite flexible to apply to various speech features• e.g., MFCC, PLP, linear or Mel-spectral space, cepstral space, LDA

and fMPE spaces, etc– However, stereo data is usually expensive to collect

• A suboptimal alternative, as done in this paper, would be to artificially generate data for the noisy channel

– SSM as a data-driven approach relies on the noise in the training data and may not handle the unseen noise very well


Introduction (1/2)

• Recently, several techniques has been proposed which aim to exploit the speech signal properties– The spectral peaks being more robust to a broad-band noise

than the spectral valleys or harmonicity information• performs locking of the spectral peak-to-valley ratio

– alleviate the mismatch between clean and noisy features caused by the spectral valleys being buried by noise

• appended the information on spectral peaks into the acoustic• modified the likelihood calculation with the aim of emphasizing

parts of the spectrum corresponding to peaks

• In this paper, they investigated an incorporation of the mask modeling into an HMM-based (ASR) system


Introduction (2/2)

• As the mask expresses which spectro-temporal regions are uncorrupted by noise– It can also be seen as a generalized and soft incorporation of the

spectral peak information– The mask model is associated with each HMM state and mixture

• It expresses what mask information the state/mixture would expect to find in the signal

• The mask modeling is performed by employing the Bernoulli distribution

• The incorporation of the mask modeling is evaluated in a standard model and in two models that had compensated for the effect of the noise, missing feature and multi-condition training model


Incorporating Mask Modeling into HMM-based ASR System (1/6)

• The HMM-based ASR system with the incorporation of mask modeling is formulated as follow

– The term corresponds to the employment of the missing-feature techniques

– The term expresses how likely the given mask M is being generated by the HMM state Sequence S • It serves as a penalization factor for states whose mask model is

not in agreement with the mask extracted from the give signal


WPWSPWSMPWSMYPW |,|,,|maxargˆ

HMM State Transition Probability

Language Model ProbabilityMask-Model Probability

Emission Probability

WSMYP ,,|

WSMP ,|


• How can we estimate the mask model?– Having an example of noise

• The mask model could be estimated based on masks obtained from the training data corrupted by the given noise

– Having no information about noise• It could be estimated by using a mask reflecting some a-priori

knowledge about speech– the fact that high-energy regions of speech spectra are less likely

to be corrupted by noise

• The estimation of the mask model is performed by a separate training procedure that is performed after the HMMs have been trained



• Estimating the mask model for HMM states– Let denotes the mask vector at a given frame

where is the binary mask information of the channel – The mask-model probability for each HMM state and

mixture is modeled by a multivariate Bernoulli distribution

where is the parameter of the distribution• The estimation of the parameter can be estimated by a

Baum-Welch or Viterbi -style training procedure


B

b

bm

slbbmslbslP

1

1

,,,, 1,| m

Bmm ,,1 m

bm b

slP ,|m s

l

slb ,,

slb ,,


• The Viterbi algorithm is used to obtain the state-time alignment of the sequence of feature vectors on the HMMs

• The posterior probability that the mixture component (at state ) have generated the feature vector is then calculated as

• Then, the parameters of the mask models are then estimated as


s

l

'

|'',|

|,|,|

l t

tt slPlsyP

slPlsyPsylP

slb ,,

styt

t

styttt

slb sylP

bmsylP

:

:

,, ,|

,|


• Regions of a high value of the mask model parameter reflect that the masks associated with the given state were for those regions often one, i.e. little affected by noises



• The value of the mask probability when being incorporated in the overall probability calculation may need to be scaled (akin to language model scaling)– By employing a sigmoid function

– The bigger the value of is the greater the effect of the mask probability on overall probability


5.0,|)(1

1,|)(ˆ

slbmPeslbmP


• The experiments were carried out on the Aurora-2 database– The frequency-filtered logarithm filter-bank energies were

used as speech feature representation• Due to their suitability for missing-feature based recognition

– The noisy speech data from the Set A were used for recognition experiments


Introduction


• The idea of the feature mapping method is to obtain “enhanced” or “clean” features from the “noisy” features– In theory, the mapping need not be performed between

equivalent domains• In this paper– They firstly investigate the feature mapping between different

domains with the consideration of MMSE criterion and regression optimizations

– Secondly they investigate the data-driven filtering for the speech separation by using the neural network based mapping method

Mapping Approach (1/3)

• Assume that we have both the direction of the target and interfering sound sources through the use of microphone array

• The mapping approach which takes those two features, and , and maps them to “clean” recordings– To allow non-linear mapping, they used a generic multilayer

perceptron (MLP) with one hidden layer, estimating the feature vector of the clean speech


is

ts

nts

nis

nc

bnnbsigw

nsnsfnP

pi

Tipt

Ttppp

it

1

,,

,ˆ

swsw

c


• The parameters are obtained by minimizing the mean squared error:

– The optimal parameters can be found through the error back-propagation algorithm

– Note that during training this requires that parallel recordings of clean and noisy data are available while only the noisy features are required for the estimation of clean data during testing


bbw Tppp ,,, w

N

n

nn1

2cc


• With the assumption that the distribution of the target data is Gaussian distributedminimizing the mean square error in is the result of the

principle of maximum likelihood• From the perspective of Blind Source Separation (BSS) and

Independent Component Analysis (ICA)– The principle of maximum likelihood, which is highly related to

the minimization of mutual information between clean source– Their methods, however, lead to a linear transformation, and

the probability densities of the sources must be estimated correctly


Experimental Data and Setup (1/2)

• The Multichannel Overlapping Numbers Corpus (MONC) was used to perform speech recognition experiments– There are four recording scenarios

• S1 (no overlapping speech), S12 (with 1 competing speaker L2), S13 (with 1 competing speaker L3), S123 (with 2 competing speakers L2 and L3)

– Training data:6049 utterances, development:2026 utterances and testing (2061 utterances)

– The MLP is trained from data drawn from the development data set which consists of 2,000 utterances (500 utterances of each recording scenario in the development set


Experimental Data and Setup (2/2)

– In this paper, two delay-and-sum (DS) beamformer enhanced speech signals are used

– The ASR frontend generated 12 MFCCs and log-energy with corresponding delta and acceleration coefficients


Feature Mapping Between Different Domains (1/3)

• Three domains are selected as the input– spectral amplitude, log Mel-filterbank energies (log MFBE),

and Mel-frequency cepstral coefficients (MFCC)• As earlier mentioned, the target data with a Gaussian

distribution is optimal from the point view of the MMSE– The PDFs of the amplitudes of the clean speech are far from

being Gaussian– The PDFs of the log MFBEs are bi-modal (the lower modal may

be due to the low SNR segments)– The PDFs of MFCCs have approximative Gaussian distributions


• In fact, the mapping to MFCCs is more straightforward in the context of the ASR system, in which MFCCs are used as the features

• Furthermore, MMSE in the MFCCs also results in MMSE in the delta coefficients (likewise for acceleration coefficients)



22

nnn d

ccc

2

22

22

22

222

22

2

2

2

2

ˆ2

2

ˆˆ

2

ˆˆ

ˆ

n

n

n

nddd

nn

nnnn

nnnn

nn

cc

cccc

cccc

cc


• The mapping of the log MFBEs from two DS enhanced speech to MFCCs yields the best ASR performance– The smaller dynamic range of the log MFBE vectors is advantageous for

regression optimization

• The gains from model adaptation are marginal– The mapping methods evaluated are already very effective at suppressing

the influence of interfering speakers on the extracted features


without model adaptation

with model adaptation

survey of robust speech techniques in icassp 2009

Documents

noisy speech feature

observed noisy speech

map estimate of clean

joint stereo feature

different feature spaces

cepstral space

discriminative training

mel spectral space