1 speech enhancement in nonstationary noise environments using noise properties kotta manohar,...

1

Speech enhancement in nonstationary noise environments using noise properties

Kotta Manohar, Preeti RaoDepartment of Electrical Engineering, Indian Institute of Technology, Powai, Bombay 400 076, India

Presenter: Shih-Hsiang( 士翔 )

SPEECH COMMUNICATION 48 (2006)

2

Reference K. Manohar and P. Rao, "Speech enhancement in nonsataionary noise environments using noise properties", Speech Communication,48 ,(2006) V. Stahl, A. Fischer, and R. Bippus, "Quantile Based Noise

Estimation for Spectral Subtraction and Wiener Filtering," in Proc. ICASSP, 2000, vol. 3, pp. 1875—1878

M. Berouti, R. Schwartz, J. Makhoul, "Enhancement of speech corrupted by acoustic noise." in Proc. ICASSP, 1980, pp.208–211

3

Introduction Signal-channel speech enhancement algorithms are generally

base on short-time spectral attenuation (SATA) Applying a spectral gain to each frequency bin in a short-time frame of

the noisy speech signal, then the gain is adjusted individually as a function of the relative local SNR at each frequency Spectral Subtraction (SS), MMSE short-time spectral amplitude estimator

With low SNR regions attenuated relative to high SNR regions A good estimate of the instantaneous noise spectrum is

crucial in the estimation of the local SNR A common method of noise estimation involves the use of a

voice activity detector (VAD) to detect the pauses in speech The noise estimate is then obtained by a recursively smoothened

adaptation of noise during the detected pause

4

Introduction (cont.) In stationary background noise, such an estimator is

generally reliable However nonstationary noises cannot be tracked adequately by a

recursive noise estimation method that adapts only during detected speech pauses E.g. factory, battlefield noise

Even the VAD is reliable, changes in the noise spectrum occurring during active speech cannot influence the noise estimate in a timely manner

STAT-based algorithms are effective only in suppressing the stationary noise component generally leaving noise bursts unattenuated in the enhanced speech

5

Introduction (cont.) In this paper, a method which exploits known differences in

the spectro-temporal properties of noise and speech to selectively attenuate noisy time-frequency regions remaining in STSA-enhanced signals

6

Suppressing nonstationary noise The proposed solutions generally fall into two categories

Improvements to the noise estimator Modification of the suppression rule

A number of methods for noise spectrum estimation without explicit speech pause detection have been proposed Based on tracking some statistic (e.g. minimum, median) of past power

spectral values for each frequency bin over several frames (e.g. QBNE) However the buffer length necessary to bridge peaks of speech activity

makes it difficult to follow any rapid variations in noise spectrum

7

Suppressing nonstationary noise (cont.) A brief introduction to QBNE (Quantile Based Noise spectrum Estimation)

In speech section of the input signal not all frequency bands are permanently occupied the energy in each frequency The noise estimate N(ω) are taking the q-th quantile over time in every frequency band

For every ω the frames of the entire utterance X(ω,t),t=0,…,T are sorted such that X(ω,t0)≤ X(ω,t1) ≤… ≤ X(ω,tT). The q-quantile noise estimation is defined as

),()( qTtXN

8

Suppressing nonstationary noise (cont.)

QBNE method a buffer of 0.64s durationand quantile value 0.5

Factory noise is nonstationary in nature having stationary noise background with occasional random bursts to which the sudden peaks in the instantaneous noise power spectraVAD estimator tracks the noise burst level only when speech is absentThe QBNE estimator responds to the noise burst only approximately and with a delay

These direct estimation methods for noise fail in conditions such as factory noise

9

Suppressing nonstationary noise (cont.) A different approach to carry out the adaptation of noise during both speech absence and presence is via a speech absence probability based on an estimate of SNR (Malah et al., 1999)(Cohen 2003)

Any sudden increase in the background noise level is not easily distinguished from speech and results in high estimated SNR making the method relatively less effective in highly nonstationary noise No direct method methods can track highly nonstationary noises accurately even if the noise estimate is updated in every frame

10

Suppressing nonstationary noise (cont.) Cooke et al. (2001) propose missing data methods for robust

ASR A two-stage approach is used

Spectral subtraction is employed to suppress the stationary noise component The recognition processor is conditioned on the estimated reliability of spectro-temporal

regions of the signal as determined by various speech spectrum cues Difficulty of detecting unreliable regions when the nonstationary noise

component is intermittent and impulsive A similar concept applicable to speech enhancement is the

use of statistical models of clean speech or trained codebook where a priori information in the form of spectral envelope shapes is stored for both speech and noise A joint or iterative optimization over assumed speech and noise models

is carried out for each frame of noisy speech to determine the noise estimate

The performance would be expected to depend critically on a good match between training and actual usage conditions

11

Suppressing nonstationary noise (cont.) This paper is targeted towards a robust algorithm for

suppression of random noise bursts with minimal speech distortion Using available knowledge to distinguish between speech and noise in

order to identify, and further attenuate, unreliable spectro-temporal regions in signals enhanced by traditional STSA

To achieve improved speech quality using this approach requires solutions to two problems determining reliable cues for identifying noisy spectro-temporal regions finding a suitable suppression rule applicable to the detected noisy

regions so as to achieve significant reduction of noise with minimal speech distortion.

12

Proposed post-processing algorithm The proposed post-processing algorithm involves identifying

regions in the spectrogram of the STSA-enhanced speech that are dominated by the residual noise These regions are selectively attenuated further with the goal to improve

the overall quality of the enhanced speech The post-processing scheme thus comprises the following

steps: Divide the spectrum of each frame of the STSA enhanced speech into

several frequency bands, possibly overlapping, frequency band in view of the fact that the noise spectrum may be localized in frequency

Carry out speech/noise classification to detect frequency bands that are dominated by residual noise

Using a suitable suppression rule, attenuate the spectral values in the identified noisy bands

13

Proposed post-processing algorithm(cont.) The suppression rule should ideally depend on the bin SNR in a manner as to apply more attenuation in low SNR regions

This would help to minimize speech distortion while achieving an overall improvement in the SNR If the identification of noisy frequency bands in Step 2 is reasonably reliable, a local SNR increase in an identified nonspeech bin would signal the onset of a noise burst. An appropriate definition for the estimated SNR is given by the ‘‘average a priori SNR’’ computed as in

2

2

2

2

)(ˆ

)(

)(ˆ

)()1()(

prev

prevest

kD

kS

kD

kSk

current SNR previous SNR

)0,)(ˆ)(()(222 kDkYMaxkS

est

where

The average noise power spectrum estimate as obtained from the noise estimator of the STSA

14

Proposed post-processing algorithm(cont.) The attenuation factor λ(k) is varied linearly with the estimated a priori SNR ζ(k) in dB but restricted to the range of 0.05-0.9

SNR_highhighSNRξ(k)ξ(k)SNR_low

SNR_lowξ(k)ksfk

_　

9.0)(

05.0)( 0

f0 is the value at 0 dB SNR, and s is the slope of the line

0.05

0.9

SNR_low SNR_high SNR(dB)

15

Proposed post-processing algorithm(cont.) The suppression rate can be controlled by varying the parameters ‘SNR_low’ and ‘SNR_high’ After obtaining the attenuation factors, recalculate the speech estimate as follow of an i-th ‘noisy band’ limiting the value to a spectral floor

otherwisekD

kDkSifkSkkS

i

ifinaliSTSAi

finali,)(ˆ

)(ˆ)(,)()()( 2

222

2

the spectral floor gain parameter

16

Spectral flatness based classifiers Based on the assumption that the STSA enhanced speech contains primarily harmonic speech and frequency-localized noise bursts Let X[k] denote the magnitude spectrum values computed via a DFT. The ith frequency band comprises L frequency bins with bin index k in the range [bi, ei]

For instance, with a 256-point DFT at sampling frequency of 8 kHz, the 0–1 kHz band will be bounded by the bin indices: bi = 0 and ei = 31 The measures investigated are:

SFM (spectral flatness measure):It is defined as the ratio of the geometric mean to the arithmetic mean of the magnitude spectrum values

][1

])[( /1

kXL

kXSFM

i

i

i

i

ebk

Lebk

itaking low values for harmonic regions representing speech, and High values for noise-dominated regions which have a relativelyflat spectrum

17

Spectral flatness based classifiers (cont.)

Energy-normalized variance: The harmonic structure or deviation from flatness of the spectrum in any chosen frequency band is reflected in the energy-normalized variance of the spectral values

Entropy: A related measure is ‘‘entropy’’ as used in the VAD of Renevey and Drygajlo (2001) on the assumption that the signal spectrum is more organized during speech segments than during noise segments

2

2

])[()][(

var_kXXkX

ni

i

i

i

ebk

ie

bki

)))((log())(()log(

1 22 kXPkXPL

Hi

i

e

bki

high values for harmonic regions representingspeech, and low values for noise-dominated regions,

i

i

e

bk

kXP

kXPkXP

))((

))(())((

2

22where

H takes maximum value of ‘1’ when the signal is a white noise, and minimum value of ‘0’ whenit is a pure tone (sinusoid). Hence, the entropy based method is well suited for speech detectionin white or quasi-white noise

18

Experimental comparison of classifier A comparative evaluation of the different classifiers can be

achieved by experimental observations in a typical application situation i.e. by comparing the receiver operating characteristics (ROC) or the hit

rate versus false-alarm rate plots A better classifier would be characterized by a lower false-

alarm rate for a given hit rate The steepness or slope of the ROC curves determines the

suitability of the feature in terms of providing an adequate level of discrimination between speech and noise

19

Experimental comparison of classifier (cont.)

ROC plots of the energy-normalized variance, SFM and entropy in the detection of noisy regions for factory noise-corrupted speech at 0 dB SNR

20

Experimental evaluation The performance is evaluated for three real environmental noise viz. factor noise, machine gun noise, and train interior noise

All the three noises are highly fluctuating, characterized by random energetic bursts Two standard STSA algorithms are chosen as the front-end STSA algorithms

Berouti spectral subtraction (BSS) Multiplicatively modified log spectral amplitude estimator (MM-LSA)

In all experiments, a 32ms Hamming window with 50% overlap is applied to 8kHZ sampled speech. The spectrum is computed using a 256-point DFT

21

Experimental evaluation (cont.) Noise properties and post processing parameter settings

Factory noise : contains randomly occurring events such as hammer blows embedded in a more homogenous background noise

Machine gun noise : a series of gunshots recorded in a quiet environment, in order to make it more realistic, a white background noise

Train noise : it is sound recorded in the interior of an Indian electric train with windows open (i.e. the noise arises from the moving mechanical parts of the train)

22

Experimental evaluation (cont.)

Spectrograms of segments of (a) factory, (b) train and (c) machinegun noise

23

Experimental evaluation (cont.) Noise properties and post processing parameter settings

The frequency bandwidth for the variance-based noise detection is selected to provide a high-frequencyresolution for noisy region detectionThe choice of decision threshold the detection of noise-dominated bands should be based on the desired hit rate or tolerable false-alarm rate. A low false-alarm rate helps to minimize speech distortionThe parameters SNR_low and SNR_high determine the amount of attenuation as a function of the estimated a priori SNR

24

Experimental evaluation (cont.) Measuring speech quality improvement

Naturalness and Intelligibility of speech output are important attributes of the performance of any speech enhancement system

Since achieving a high degree of noise suppression is often accompanied by speech signal distortion, it is important to evaluate both quality and intelligibility

Subjective listening tests are the best indicators of achieved overall quality A–B comparison tests of sentences processed by competing processing

methods can be used to obtain comparative quality rankings The chief attributes tested here are the naturalness or overall quality of the processed

speech Speech intelligibility is tested by the SUS (semantically unpredictable

sentences) test, originally proposed for evaluating synthetic speech (Benoit et al., 1996)

25

Semantically Unpredictable Sentences (SUS) Comparative evaluation of sentence intelligibility, minimizing

the effect of contextual cues. Short, semantically unpredictable sentences of five different, common syntactic structures with words randomly selected from lexicons with frequent "mini-syllabic" words (smallest words available in a given category): Subject - Verb - Adverbial, e.g., The table walked through the blue truth Subject - Verb - Direct object, e.g., The strong way drank the day Adverbial - Transitive verb - Direct object (imperative), e.g., Never draw

the house and the fact Q-word - Transitive verb - Subject - Direct object, e.g., How does the day

love the bright word? Subject - Verb - Complex direct object, e.g., The place closed the fish

that lived.

26

Experimental evaluation (cont.) Overall quality ranking is A–B comparison involving four

listeners and eight distinct sentences from the TIMIT database (Fisher et al., 1986) , each from a different speaker (four male and four female) Each sentence pair presented for listening comparison comprises of the

processed versions of a single sentence, before and after post-processing

To avoid bias, the order A and B are interchanged and randomized across sentences and listeners

Speech intelligibility is tested by the SUS Thirty SU sentences, six of each of five syntax structures, were

generated and played in random order to each of four listeners who were asked to write down the sentences they hear

To avoid listener familiarity with a specific noise sample, segments of the noise file to be added to the sentences were chosen randomly from a larger noise sample and digitally added to the clean speech

27

Experimental evaluation (cont.) There are a large number of objective measures that

quantify the degradation in quality of processed speech with respect to a reference speech sample However, not all objective measures may be appropriate for

specific kinds of distortion Use PESQ and WSS in the experiments to measure

quality gains, if any, achieved due to post-processing

28

Weighted Spectral Slope Measure The weighted spectral slope (WSS) measure is based on an

auditory model in which 36 overlapping filters of progressive larger bandwidth are used to estimate the smoothed short-time speech spectrum

The measure finds a weighted difference between the spectral slopes in each band

The magnitude of each weight reflects whether the band is near a spectral peak or valley, and weather the peak is the largest in the spectrum

the difference between overall sound pressure level of the original and processed utterances Ks is a parameter which can be varied to increase the overall performance.

36

1

2)](ˆ)()[()ˆ(k

s kSkSkwKKKd

29

PESQ MOS Mean Opinion Score (MOS)

平均意見指標 (mean opinion score ； MOS) 來衡量清晰度平均意見指標是將收訊的語音樣本，由一群收訊者依收聽到的通話品質分成 5個等級： 1 代表最差、 5 代表最佳， 4 則是一般公眾電話網路系統的通話品質。由於 MOS 很難建立一個客觀標準

Perceptual Evaluation of Speech quality (PESQ) 這項技術結合 PSQM 和 PAMS 兩種方法的優點— PSQM 的聽覺模型 (perceptual model) 和 PAMS 的時間對位法 (time-alignment routine) ，所以 PESQ指標與 MOS 指標 g 之間的相關性將更高 PSQM 演算法是以 0 到 6.5 的數字來評量清晰度，數字越低代表通話品質越好 PAMS 會產生聽音品質指標 (listening quality score)(Ylq) 和聽音效應指標 (listening effort )(Yle) 兩種指標，它們都是由 0~15 編排，數字越高代表品質越好。和 PSQM 清晰度指標一樣，聽音品質指標主要是評量收訊者接收的語音訊號，與原本訊號之間的相似度。聽音效應指標主要是針對嚴重失真無法以聲音品質評估的訊號，因此聽音效應指標評估的是，收訊者必須花費多少心力才能聽懂嚴重失真的語音訊號所傳遞的訊息

30

清晰度評量的四個步驟第一步都是將參考訊號或是原始訊號 (reference or original) 與接收訊號作時間對位 (time-align) 第二個步驟是參考訊號和接收訊號的增益調整 (gain-scaling) ，使兩個訊號的功率相同。第三個步驟將原本的時域訊號 (time domain) 轉換成頻域 (frequency domain) 訊號，並將所得到的訊號頻譜，依據人類聲音聽覺與頻率之間的非線性相關設定頻帶 (bins) 。依據 Bark scale 所設定的頻帶，會反映出人類聽覺對於低頻聲音較明顯的特性，因此低頻端的頻帶頻寬較窄，而高頻端的頻帶頻寬較寬最後就是分析的重要工作。利用聽覺模型 (perceptual model) 來比對和處理頻帶中的內容，以決定對人類聽覺的重要性和差異性，處理的結果提供清晰度指標作為差異的比較。

31

清晰度評量的四個步驟 (續 )

32

Result and discussion

there is a clear listener preference for the post-processed speech over that before post-processing

The percentage word intelligibility scores averaged across the listeners are 60.7, 51.7 and 50.6 at 3 dB SNR for the three configurations of noisy, BSS and BSS + PP respectively

33

Result and discussion (cont.)Narrowband spectrograms of (a) clean, (b) noisy, (c) BSS-enhanced speech and (d) after post-processing, for a speech segment in factory noise

34

Result and discussion (cont.)

The WSS distance indicates a consistent decrease (implying an improvement in quality) with post-processingfrom that obtained with STSA enhancement alone

The PESQ MOS on the other hand is consistent with the subjectively perceived trend of an improvement in speech quality with STSA enhancement over that of noisy speech,

Both the objective measures indicate that post-processing has a greater influence at the lower SNRs relative to that at higher SNRs.

35

Result and discussion (cont.)

the performance gains due to post-processing do not change significantly with the change in the algorithm parameters

36

Conclusion Traditional STSA speech enhancement algorithms perform

inadequately in application to speech corrupted by highly nonstationary noise

With limited added complexity, the post-processing algorithm is effective in significantly reducing the perceived effects of the noise bursts at low SNRs without further speech distortion

While the onsets of noise bursts are greatly attenuated, bursts of long duration are not suppressed completely due to the difficulties in the reliable classification of bins as speech or noise dominated within an identified noise burst band

1 speech enhancement in nonstationary noise environments using noise properties kotta manohar,...

Documents