approved by supervisory committee: dr. robert hunt …sunil devdas kamath, m.s.e.e. the university...

A MULTI-BAND SPECTRAL SUBTRACTION METHOD FOR SPEECH ENHANCEMENT

APPROVED BY SUPERVISORY COMMITTEE:

Dr. Philipos Loizou, Chair.

Dr. Robert Hunt

Dr. Mohammad Saquib

Copyright 2001

Sunil Devdas Kamath

All Rights Reserved

To my parents


by

SUNIL DEVDAS KAMATH, B.E.

THESIS

Presented to the faculty of

The University of Texas at Dallas

in Partial Fulfillment

of the Requirements

for the Degree of

MASTER OF SCIENCE IN ELECTRICAL ENGINEERING

THE UNIVERSITY OF TEXAS AT DALLAS

December 2001

v

ACKNOWLEDGEMENTS I would like to thank my adviser, Dr. Philipos Loizou, for his guidance in my studies and my

work. He has offered me many helpful suggestions on conducting research and writing

technical documents.

I would also like to thank Dr. Robert Hunt and Dr. Mohammad Saquib, for their valuable

feedback on this manuscript.

Thanks are also in order to Dr. Emily Tobey for providing me with the opportunity to work

with her wonderful team at the Callier Institute of Communication Disorders / UTD. I would

like to thank Paul Dybala and Amanda Labue for conducting the subject test.

I would like to take this opportunity to express my deepest gratitude to Dr. Neeraj Magotra of

Texas Instruments – Dallas for the invaluable support and guidance he has given me in every

aspect of my student and personal life. I am especially thankful for the wholehearted

confidence that he has shown in my abilities.

And finally to my wife, Sanmati, I would like to acknowledge my deepest appreciation for

her love and caring, for her timely encouragements, for being my driving force and standing

by me through thick and thin.

vi


Sunil Devdas Kamath, M.S.E.E. The University of Texas at Dallas, 2001

Supervising Professor: Dr. Philipos C. Loizou The corruption of speech due to presence of additive background noise causes severe

difficulties in various communication environments. This thesis addresses the problem of

reduction of additive background noise in speech. The proposed approach is a frequency-

dependent speech enhancement method based on the proven spectral subtraction method.

Most implementations and variations of the basic spectral subtraction technique advocate

subtraction of the noise spectrum estimate over the entire speech spectrum. However, real

world noise is mostly colored and does not affect the speech signal uniformly over the entire

spectrum. This thesis explores a Multi-Band Spectral Subtraction (MBSS) approach with

suitable pre-processing of the speech data. Speech is processed into )81( ≤≤ NN

frequency bands and spectral subtraction is performed independently on each band using

band-specific over-subtraction factors. This method provides a greater degree of flexibility

and control on the noise subtraction levels that reduces artifacts in the enhanced speech,

resulting in improved speech quality. The effect of the number of frequency band and the

vii

type of filter spacing (linear, logarithmic or mel) was investigated. Results showed that the

proposed MBSS method with four linear-spaced frequency bands outperformed the

conventional spectral subtraction method with respect to speech quality and reduced musical

noise.

viii

TABLE OF CONTENTS

ACKNOWLEDGEMENTS ..................................................................................................v

ABSTRACT ........................................................................................................................vi

LIST OF FIGURES ..............................................................................................................x

LIST OF TABLES..............................................................................................................xii

1. INTRODUCTION............................................................................................................1

2. LITERATURE REVIEW.................................................................................................3

2.1 Fundamentals of speech production ...........................................................................4

2.2 Classification of speech enhancement techniques ......................................................6

2.3 Short-term spectral amplitude techniques...................................................................9

2.4 Principle of the basic spectral subtraction method....................................................11

2.5 Drawbacks of the spectral subtraction method .........................................................13

2.6 Modifications to spectral subtraction .......................................................................16

2.7 Frequency – dependent spectral subtraction methods ...............................................23

3. MULTI-BAND SPECTRAL SUBTRACTION ..............................................................26

3.1 Motivation...............................................................................................................26

3.2 Multi-band spectral subtraction................................................................................30

4. IMPLEMENTATION AND PERFORMANCE EVALUATION....................................37

4.1 Implementation .......................................................................................................37

4.2 Objective measures for performance evaluation.......................................................43

4.3 Effect of pre-processing strategies ...........................................................................46

4.4 Effect of frequency spacing .....................................................................................50

4.5 Performance with speech-silence detector................................................................55

4.6 Subjective evaluation of speech intelligibility ..........................................................57

4.7 Optimal configuration..............................................................................................59

5. SUMMARY AND CONCLUSIONS..............................................................................63

ix

BIBLIOGRAPHY...............................................................................................................66

VITA

x

LIST OF FIGURES Figure 2.1: Diagramtic representation of the short-time spectral magnitude enhancement

system. ........................................................................................................................10 Figure 2.2: Spectrograms of the sentence “The shop closes for lunch”, clean speech (top),

with speech shaped noise at 5 dB SNR (middle), and speech enhanced using spectral subtraction (bottom) ....................................................................................................14

Figure 2.3: Over-subtraction factor α as a function of SNR with 40 =α .............................19 Figure 3.1: (a) PSD of WGN, (b) Segmental SNR of four (linearly-spaced) frequency bands

of speech corrupted by WGN at 5dB SNR. ..................................................................28 Figure 3.2: (a) PSD of speech-shaped noise, (b) Segmental SNR of four (linearly-spaced)

frequency bands of speech corrupted by speech-shaped noise at 5dB SNR. .................29 Figure 3.3: (a) PSD of multi-talker babble, (b) Segmental SNR of four (linearly spaced)

frequency bands of speech corrupted multi-talker babble at 5dB SNR. ........................29 Figure 3.4: (a) PSD of aircraft noise, (b) Segmental SNR of four (linearly spaced) frequency

bands of speech corrupted aircraft noise at 5dB SNR...................................................30 Figure 3.5: Diagrammatic representation of the multi-band spectral subtraction method......31 Figure 3.6: (a) Original magnitude spectrum speech frame, (b) Magnitude spectrum of the

smoothed and averaged version of 3.5(a). ....................................................................33 Figure 4.1: (a) Long-term magnitude spectrum of a speech file from the HINT database , (b)

Magnitude spectrum of the speech-shaped noise. .........................................................39 Figure 4.1: Sentence “The shop closes for lunch,” sampled at 8kHz, (above) time plot and

(below) the corresponding spectrogram. ......................................................................41 Figure 4.2: Speech shaped noise sampled at 8kHz, (above) time plot and (below) the

corresponding spectrogram..........................................................................................41 Figure 4.3: Sentence “The shop closes for lunch,” at 5 dB SNR, (above) time plot and

(below) the corresponding spectrogram. ......................................................................42 Figure 4.4: Sentence “The shop closes for lunch,” at 0 dB SNR, (above) time plot and

(below) the corresponding spectrogram. ......................................................................42 Figure 4.5: Sentence “The shop closes for lunch,” after spectral smoothing and magnitude

averaging, (above) time plot and (below) the corresponding spectrogram. ...................46 Figure 4.6: Mean IS distance measure of the MBSS approach with linear frequency spacing

and without pre-processing, as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR. .........................47

Figure 4.7: Mean IS distance measure of the MBSS approach with linear frequency spacing and with pre-processing, as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR. .........................48

Figure 4.8: Spectrograms of processed speech of the sentence “The shop closes for lunch,” at 5 dB SNR, using MBSS using four linearly spaced frequency bands, (above) without pre-processing and (below) with smoothing and weighted magnitude averaging. .........49

xi

Figure 4.9: Spectrograms of processed speech of the sentence “The shop closes for lunch,” at 0 dB SNR, using MBSS using four linearly spaced frequency bands, (above) without pre-processing and (below) with smoothing and weighted magnitude averaging. ........49

Figure 4.10: Mean IS distance measure of the MBSS approach with logarithmic frequency spacing as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR...........................................................53

Figure 4.11: Mean IS distance measure of the MBSS approach with mel frequency spacing as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR. ..................................................................................53

Figure 4.12: Comparison of spectrograms of enhanced speech at 5 dB SNR processed with the MBSS algorithm using four bands with (top) linear spacing, (middle) logarithmic spacing and (bottom) mel spacing................................................................................54

Figure 4.13: Comparison of spectrograms of enhanced speech at 0 dB SNR processed with the MBSS algorithm using four bands with (top) linear spacing, (middle) logarithmic spacing and (bottom) mel spacing................................................................................54

Figure 4.14: Mean IS distance measure of the MBSS approach with linear frequency spacing and speech-silence detector, as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR. .........................56

Figure 4.13: Spectrograms of speech enhanced with the MBSS algorithm using four linearly spaced frequency bands with a speech-silence detector, at (top) 5 dB SNR and (bottom) 0 dB SNR. ...................................................................................................................57

Figure 4.14: Intelligibility test results for seven subjects scored on percentage words correct.....................................................................................................................................59

Figure 4.15: Comparison of the performance, in terms of mean IS distance measure, of the with power spectral subtraction (indicated with 'PSS') with the multi-band spectral subtraction approach as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0 dB SNR..............................................60

Figure 4.16: Spectrogram of the sentence ''The shop closes for lunch.'' at 5 dB SNR. The top spectrogram is the corrupted signal, the middle spectrogram is the enhanced signal obtained by the multi-band spectral subtraction method using 4 linearly spaced frequency bands, and the bottom spectrogram is the enhanced signal obtained by the power spectral subtraction method...............................................................................61

Figure 4.17: Spectrogram of the sentence ''The shop closes for lunch.'' at 5 dB SNR. The top spectrogram is the corrupted signal, the middle spectrogram is the enhanced signal obtained by the multi-band spectral subtraction method using 4 linearly spaced frequency bands, and the bottom spectrogram is the enhanced signal obtained by the power spectral subtraction method...............................................................................62

xii

LIST OF TABLES

Table 2.1: Phonemes in American English.. ........................................................................5

Table 2.2: Speech enhancement processing strategies, (Adapted from [30]) ..........................6

Table 4.1: List of sentences used from the HINT database for objective performance

evaluation......................................................................................................................40

Table 4.2: Center frequency values for linear, logarithmic and mel spacing of frequency

bands.............................................................................................................................52

Table 4.3: Mean global and segmental SNR calculated over ten sentences at 5 dB SNR......58

1

CHAPTER ONE

INTRODUCTION A major part of the interaction between humans takes place via speech communication.

Hence, research in speech and hearing sciences has been going on for centuries to understand

the dynamics and processes involved in the production and perception of speech. The field of

speech processing is essentially an application of signal processing techniques to acoustic

signals using the knowledge offered by researchers in the field of hearing sciences. The

explosive advances in recent years in the field of digital computing have provided a

tremendous boost to the field of speech processing. Digital signal processing techniques are

more sophisticated and advanced as compared to their analog counterparts. Ease and speed of

representing, storing, retrieving and processing speech data has contributed to the

development of efficient and effective speech processing techniques to address the issues

related to speech.

The presence of background noise in speech significantly reduces the intelligibility of

speech. Degradation of speech severely affects the ability of person, whether impaired or

normal hearing, to understand what the speaker is saying. Noise reduction or speech

enhancement algorithms are used to suppress such background noise and improve the

perceptual quality and intelligibility of speech. Even though speech is perceptible in a

moderately noisy environment, many applications like mobile communications, speech

recognition and aids for the hearing handicapped, to name a few, drive the effort to build

2

more effective noise reduction algorithms for better performance. Over the years engineers

have developed a variety of theoretical and relatively effective techniques to combat this

issue. However, the problem of cleaning noisy speech still poses a challenge to the area of

signal processing. Removing various types of noise is difficult due to the random nature of

the noise and the inherent complexities of speech. Noise reduction techniques usually have a

trade off between the amount of noise removal and speech distortions introduced due the

processing of the speech signal. Complexity and ease of implementation of the noise

reduction algorithms is also of concern in applications especially those related to portable

devices such as mobile communications and digital hearing aids.

The spectral subtraction method is a well-known noise reduction technique [2] [3]

[18]. Most implementations and variations of the basic technique advocate subtraction of the

noise spectrum estimate over the entire speech spectrum. However, real world noise is

mostly colored and does not affect the speech signal uniformly over the entire spectrum. In

this thesis, we propose a multi-band spectral subtraction approach that takes into account the

fact that colored noise affects the speech spectrum differently at various frequencies. This

method outperforms the standard power spectral subtraction method resulting in superior

speech quality and largely reduced musical noise.

This thesis is organized as follows; Chapter 2 gives a review of the different noise

reduction strategies that have been developed. Chapter 3 discusses the Multi-Band Spectral

Subtraction (MBSS) method. Results and quantative performance comparison is discussed in

Chapter 4. Chapter 5 gives the conclusions and presents a summary of the work done and

future work.

3

CHAPTER TWO

LITERATURE REVIEW In the past decades, research in the field of speech enhancement has focused on the

suppression of additive background noise [5] [16] [17]. From the point of view of signal

processing, additive noise is easier to deal with than convolutive noise or nonlinear

disturbances. The ultimate goal of speech enhancement is to eliminate the additive noise

present in speech signal and restore the speech signal to its original form. Several methods

have been developed as a result of these research efforts. Most of these methods have been

developed with some or the other auditory, perceptual or statistical constraints placed on the

speech and noise signals. However, in real world situations, it is very difficult to reliably

predict the characteristics of the interfering noise signal or the exact characteristics of the

speech waveform. Hence, in effect, the speech enhancement methods are sub-optimal and

can only reduce the amount of noise in the signal to some extent. Due to the sub-optimal

nature of these methods, some of the speech signal can be distorted during the process.

Hence, there is a trade-off between distortions in the processed speech and the amount of

noise suppressed. The effectiveness of the speech enhancement system can therefore be

measured based on how well it performs in light of this trade-off.

This chapter presents review on the production of speech in humans and a literature

review of the different speech enhancement methods used to date. The family of subtractive-

type enhancement methods is discussed in more detail.

4

2.1 Fundamentals of speech production

Speech, a dynamic, information-bearing signal, is also called the acoustic waveform. These

waves are produced due the sound pressure generated in the mouth of the speaker as a result

of some sequence of coordinated movements of a series of structures in the human vocal

system. The branch of science that deals with the dynamics and production of the human

sound is called phonetics. The process of speech communication involves the production of

the acoustic wave by the speaker and the perception of the signal by the listener. Though the

process of speech perception still largely remains a mystery to the scientific world, the

process of speech production has been well researched and understood. A sound knowledge

of the processes involved in the production and perception of speech is necessary for

engineers to develop suitable methods to represent and transform the acoustic signals to

achieve the desired results.

The human speech production mechanism consists of the lungs, trachea (windpipe),

larynx, pharyngeal cavity (throat), buccal cavity (mouth), nasal cavity, velum (soft palate),

tongue, jaw, teeth and lips. The lungs and trachea make up the respiratory subsystem of the

mechanism. These provide the source of energy for speech when air is expelled from the

lungs into the trachea. The resulting airflow passes through the larynx, which provides

periodic excitation to the system to produce the voiced sounds. The three cavities of the

system can collectively be termed as the main acoustic filter that shapes the sound that is

generated. The velum, tongue, jaw, teeth and lips are known as the articulators. These

provide the finer adjustments to generate speech. The excitation used to generate speech can

be classified into voiced, unvoiced, mixed, plosive, whisper and silence. Any combination of

one or more can be blended to produce a particular type of sound. A phoneme describes the

5

linguistic meaning conveyed by a particular speech sound. The American English language

consists of about 42 phonemes, which can be classified as vowels, semivowels, diphthongs

and consonants (fricatives, nasals, affricatives and whisper) as shown in Table 2.1.

Table 2.1: Phonemes in American English.

Vowels are produced due to the periodic vibrations of the vocal chords in the larynx.

The frequency at which the vocal chords vibrate is called the fundamental frequency or pitch

of the speech. The fricatives are caused by the turbulence of the air passing through narrow

constrictions in the vocal tract, causing a random noise-like sound. Nasals are caused by

6

acoustically coupling the nasal cavity to the pharyngeal cavity by lowering the velum.

Building up pressure in front of the vocal tract and abruptly releasing it produces plosives.

The resonant frequencies generated by the vocal tract are called the formant frequencies or

the formants. The formants depend on the length and shape of the vocal tract.

2.2 Classification of speech enhancement techniques

Speech enhancement systems can be classified in a number of ways [18] [30] based on the

criteria used or application of the enhancement system. (See Table 2.2).

Domain Possible Strategies

Number of input channels One / Two / Multiple

Domain of processing Time / Frequency

Type of algorithm Non-adaptive / Adaptive

Additional constraints Speech production / Perception

Table 2.2: Speech enhancement processing strategies, (Adapted from [30]).

The speech signal can be acquired from single or multiple channel sensors. As

discussed in Chapter 1, additive noise can make speech enhancement particularly difficult.

Non-stationarity of the noise process can further complicate the enhancement effort. One

microphone input (single channel) could make speech enhancement difficult, as speech and

noise are present in the same channel. Separation of the two signals would require relatively

good knowledge of the speech and noise models or require that the interfering signal be

present exclusively in a different frequency band than that of the speech signal. A costly

solution to this problem is to use a dual microphone approach. Spatial analysis can however

7

help immensely in speech enhancement as this gives useful information regarding the signal.

In such analysis, the noise source is assumed to be statistically independent and additive.

This assumption is based on the fact that most environmental noise is typically additive in

nature. The discussion in this chapter will be limited to single channel enhancement

techniques, as these are the most common types of enhancement systems found in many

applications.

• Suppression of noise using periodicity of speech

These methods exploit the quasi-periodic nature of voiced speech. As discussed in

Chapter 1, voiced speech is periodic in nature characterized by a fundamental frequency,

which varies from person to person. This technique however, depends heavily on the

accurate estimation of the pitch period (inverse of the pitch) of the speaker’s voice.

A simple method based on this criterion is the adaptive comb filter [18]. In this

method, a series of notch filters are used so as to filter out any spectral content between

the fundamental frequency and its harmonics. Another method is the single channel

adaptive noise cancellation technique [25]. In this method, a delayed version of the

speech signal is used as the input to an adaptive LMS filter while the input in used as the

reference signal. The delay decorrelates the noise in the input signal with that present in

the reference. And when the delay is equal to an estimate of the pitch period, there is a

correlation in the speech content of the two signals.

A major disadvantage of these methods is that there is no improvement in quality of

the unvoiced speech portions. Moreover, an accurate pitch extraction algorithm is crucial to

achieving good performance.

8

• Model-based speech enhancement

Enhancement systems in this category are also called statistical - model based methods

[6]. These methods are usually used when there is no knowledge of the statistical

properties of the speech or noise signal. Speech production models like autoregressive –

moving average (ARMA), autoregressive (AR) or moving average (MA) are used

instead. This involves the estimation of the speech model parameters and then the

estimation of the enhanced signal by re-synthesis using speech model parameters or by

using a Wiener or Kalman filter.

The Wiener filter is a popular adaptive technique that has been used in many

enhancement methods. The basic principle of the Wiener filter is to estimate an optimal

filter from the noisy input speech by minimizing the Mean Square Error (MSE) between

the desired signal )(ks and the estimated signal )(^

ks . The Wiener filter can be given in

the frequency domain by:

)()(

)()(

ωωωω

NS

S

PP

PH

+= (2.1)

where )(ωSP is the power spectral density (PSD) of the speech and )(ωNP is the PSD of

the noise spectrum calculated during periods of non-speech activity. From Eq. 2.1 it is

obvious that a priori knowledge of the speech and noise power spectra is necessary. The

speech power spectrum is estimated using the estimated speech model parameters [17].

9

2.3 Short-term spectral amplitude techniques

The short-term spectral amplitude (STSA) of speech has been exploited successfully in the

development of various speech enhancement algorithms. The basic idea is to use the STSA

of the noisy speech input and recover an estimate of the clean STSA by removing the part

contributed by the additive noise. A general representation of the technique is given in Figure

2.1. The input to the system is the noise-corrupted signal )(ny . While there are many

methods for the analysis-synthesis processing, the Short-term Fourier Transform (STFT) of

the signal with OverLap and Add (OLA) [5] is the most commonly used method. The

spectral amplitude |)(| kY of the noisy input signal )(ny is modified using a correction

factor. Usually this correction factor could be the spectral amplitude of the estimated noise

signal )(nd , measured during periods of silence/non-activity in the speech signal or obtained

from a reference channel (dual-channel method). The correction is obtained by subtracting

the spectral amplitude of the noise signal from that of the noisy speech input. Hence, these

methods are also referred to as subtractive-type algorithms. If the noise is assumed to be

uncorrelated with the speech signal, then the corrected amplitude can be considered as an

estimate |)(ˆ| kS of the original clean speech signal )(ns . The unprocessed phase of the noisy

input signal is used to synthesize the enhanced speech signal under the assumption that the

human ear is not able to perceive the distortions in the phase of the speech signal.

10

Figure 2.1: Diagramtic representation of the short-time spectral magnitude enhancement system.

Spectral subtraction is a well-known noise reduction method based on the STSA

estimation technique. The basic power spectral subtraction technique, as proposed by Boll

[3], is popular due to its simple underlying concept and its effectiveness in enhancing speech

degraded by additive noise. The basic principle of the spectral subtraction method is to

subtract the magnitude spectrum of noise from that of the noisy speech. The noise is assumed

to be uncorrelated and additive to the speech signal. An estimate of the noise signal is

measured during silence or non-speech activity in the signal.

Since Boll [3] first proposed this method, several variations and enhancements have

been made to the techniques to overcome some inherent drawbacks in the method. Section

2.3 presents the basic principle of the technique, Section 2.4 discusses the drawbacks in the

method and Section 2.5 describes the improvements that have been proposed over the years.

Windowing

+ DFT

Synthesis

Magnitude Modification

Phase

Correction Factor

|)(| kY |)(|^

kS

)()()( ndnsny += )(^

ns

Analysis

IDFT +

OverLap and Add

11

2.4 Principle of the basic spectral subtraction method

If we assume that )(ny , the discrete noisy input signal, is composed of the clean speech

signal )(ns and the uncorrelated additive noise signal )(nd , then we can represent it as:

)()()( ndnsny += (2.2)

Processing is done on a frames-by-frame basis. Analysis of overlapping frames of the

noisy signal is implemented by using the Discrete Fourier Transform (DFT) preceded by a

Hamming window. The power spectrum of the noisy signal can be written as:

222 |)(||)(||)(| kDkSkY +≈ (2.3)

where the DFT of )(kY is given by:

)(21

0

|)(|)()( kjN

nkjN

n

ekYenykY ϕπ

==−−

=∑ (2.4)

where )(kϕ is the phase of the noise-corrupted signal, i.e. the phase of )(kY .

Since the noise spectrum )(kD cannot be directly obtained, a time-average of the

power spectrum )(ˆ kD is calculated during a period of silence. Assuming that noise is

uncorrelated with the speech signal, an estimate of the modified speech spectrum can be

given as:

222 |)(ˆ||)(||)(ˆ| kDkYkS −= (2.5)

12

From Eq. (2.5) it can be seen that the subtraction process involves the subtraction of an

averaged estimate of the noise from the instantaneous speech spectrum. Due to the error in

computing the noise spectrum, we may have some negative values in the modified spectrum.

These values are set to zero. This process is called half-wave rectification. With half-wave

rectification the modified spectrum can be written as:

>=

else

kSifkSkS

0

0|)(ˆ||)(ˆ||)(ˆ|

222 (2.6)

The modified spectrum of Eq. 2.6 is combined with the phase information from the noise-

corrupted signal to reconstruct the time signal by using the Inverse Discrete Fourier

Transform (IDFT) in conjunction with the OLA method.

( ))(|)(ˆ|)(ˆ kjekSIDFTns ϕ= (2.7)

The noise suppression can also be implemented as a time-varying filtering process by

rewriting the spectral subtraction method as:

)()()(ˆ kYkHkS = (2.8)

where )(kH is a gain function represented by:

2

2

|)(|

|)(ˆ|1)(

kY

kDkH −= (2.9)

or

13

2

22

|)(|

|)(ˆ||)(|)(

kY

kDkYkH

−= (2.10)

In this case, the modified spectrum is obtained by applying a time varying weight )(kH to

each frequency component. From Eq. 2.9, it can be deduced that the frequency dependent

gain is a function of the noisy signal-to-noise ratio (NSNR) of each of the frequency

components. The enhanced time signal is synthesized as given in Eq. 2.7, using the original

noisy phase portion.

The enhanced signal has largely reduced noise levels compared to the original noise-

corrupted signal resulting in a better SNR and improved speech quality. However, the

subtraction process also introduces an annoying artifact called musical noise. This artifact is

due to the residual noise in the enhanced speech. This and other drawbacks of the method

neutralize the improvement in speech quality achieved due to the reduction in noise levels

and can be more annoying than the original noise itself.

2.5 Drawbacks of the spectral subtraction method

While the spectral subtraction method is easily implemented and effectively reduces the

noise present in the corrupted signal, there exist some glaring shortcomings, which are given

below:

• Residual noise (musical noise)

It is obvious that the effectiveness of the noise removal process is dependent on obtaining

an accurate spectral estimate of the noise signal. The better the noise estimate, the lesser

the residual noise content in the modified spectrum. However, since the noise spectrum

14

cannot be directly obtained, we are forced to use an averaged estimate of the noise.

Hence there are some significant variations between the estimated noise spectrum and the

actual noise content present in the instantaneous speech spectrum. The subtraction of

these quantities results in the presence of isolated residual noise levels of large variance.

These residual spectral content manifest themselves in the reconstructed time signal as

varying tonal sounds resulting in a musical disturbance of an unnatural quality. This

musical noise can be even more disturbing and annoying to the listener than the

distortions due to the original noise content.

Figure 2.2: Spectrograms of the sentence “The shop closes for lunch”, clean speech (top), with speech shaped noise at 5 dB SNR (middle), and speech enhanced

using spectral subtraction (bottom)

15

The residual noise is quite clearly evident in the bottom plot of Figure 2.2. Figure 2.2

shows the plot of the spectrograms of the sentence “ The shop closes for lunch”

pronounced by a male speaker.

Several residual noise reduction algorithms have been proposed to combat this

problem. However, due to the limitations of the single-channel enhancement methods, it

is not possible to remove this noise completely, without compromising the quality of the

enhanced speech. Hence there is a trade-off between the amount of noise reduction and

speech distortion due to the underlying processing.

• Distortions due to half / full wave rectification

The modified speech spectrum obtained from Eq. 2.5 may contain some negative values

due to the errors in the estimated noise spectrum. These values are rectified using half-

wave rectification (set to zero) or full-wave rectification (set to its absolute value). This

can lead to further distortions in the resulting time signal.

• Roughening of speech due to the noisy phase

The phase of the noise-corrupted signal is not enhanced before being combined with the

modified spectrum to regenerate the enhanced time signal. This is due to the fact that the

presence of noise in the phase information does not contribute immensely to the

degradation of the speech quality. This is especially true at high SNRs (>5 dB). However,

at lower SNRs (<0dB), the noisy phase can lead to a perceivable roughness in the speech

signal contributing to the reduction in speech quality. Experiments conducted by

Schroeder [27] have corroborated this fact. Estimating the phase of the clean speech is a

16

difficult and will greatly increase the complexity of the method. Moreover, the distortion

due to noisy phase information is not very significant compared to that of the magnitude

spectrum, especially for high SNRs. Hence the use of the noisy phase information is

considered to be an acceptable practice in the reconstruction of the enhanced speech

signal.

Most speech enhancement algorithms, including the spectral subtraction methods, try

to optimize noise removal based on mathematical models of the speech and noise signals.

However, speech is a subtle form of communication and is heavily dependent on the

relationship of one frequency with another. Hence, while conventional speech enhancement

algorithms can increase the speech quality of the noisy speech by increasing the SNR, there

is no significant increase in speech intelligibility. Algorithms should take into account the

subtleties of speech and incorporate methods based on the perceptual properties of the speech

signal. The spectral subtraction methods, as well as most other methods, suffer from this

drawback. Studies [5] [30] have shown that there is no improvement in the intelligibility in

the speech signals enhanced by the spectral subtraction method.

2.6 Modifications to spectral subtraction

Several variants of the spectral subtraction method originally proposed by Boll [3] have been

developed to address the problems of the basic technique, especially the presence of musical

noise. Still other methods based on this method have been developed that perform noise

suppression in the autocorrelation, cepstral, logarithmic and sub-space domains. A variety of

pre and post processing methods have also proved to help reduce the presence of musical

17

noise while minimizing speech distortion. This section deals with the different techniques

and enhancements that have been proposed over the years.

• Magnitude averaging

Magnitude averaging of the input spectrum reduces spectral error by averaging across

neighboring frames. This has the effect of lowering the noise variance while reinforcing

the speech spectral content and thus preventing destructive subtraction. The magnitude

averaging is viable only for stationary time waveforms. Due to the short-term stationarity

of speech, the number of neighboring frames over which the averaging is done is limited.

If this constraint is ignored, a certain slurring of speech can be detected due to the

smearing of different speech phonemes into each other. A generalized representation of

the averaging operation can be expressed as:

∑−=

−+=

M

Mjjiji kYW

MkY )(

12

1)( (2.11)

where i is the frame index. The weights jW can be used to weight the frames. When

jW =1 )( j∀ , the equation reduces to the basic magnitude averaging operation. In the

case where the frames are weighted by different values for jW , the operation is referred

to as weighted magnitude averaging. Goh et al. [9] proposed multi-blade median filtering

over several frames of speech to identify spectral content contributing to residual noise

and smoothing them out.

18

• Generalized spectral subtraction

A generalized form of the basic spectral subtraction Eq. 2.5 is given by Berouti et al. [2]

as:

aaa kDkYkS |)(ˆ||)(||)(ˆ| += (2.12)

where the power exponent a can be chosen to obtain optimum performance. In the case

where 2=a , the subtraction is carried out on the Short-term Power Density Spectra

(STPDS) and is referred to as power spectral subtraction. When 1=a , the equation

reduces to the basic spectral subtraction method proposed by Boll [3] where the

subtraction is carried out by subtracting the magnitude spectra.

• Spectral Subtraction using over-subtraction and spectral floor

An important variation of spectral subtraction was proposed by Berouti et al. [2] for

reduction of residual musical noise. This proposed technique could be expressed as:

222 |)(ˆ||)(||)(ˆ| kDkYkS α−= (2.13)

>

=elsekD

kDkSifkSkS

2

2222

|)(ˆ|

|)(ˆ||)(ˆ||)(ˆ||)(ˆ|

ββ

(2.14)

where the over-subtraction factor α is a function of the noisy signal-to-noise ratio and

calculated as:

dBSNRdBSNR 20520

30 ≤≤−−=αα (2.15)

19

where 0α is the desired value of α at 0 dB SNR. Figure 2.3 gives a plot of α at 40 =α

over a range of SNR values. The over-subtraction factor α , subtracts an overestimate of

the noise power spectrum from the speech power spectrum. This operation minimizes the

presence of residual noise by decreasing the spectral excursions in the enhanced

spectrum. The over-subtraction factor can be seen as a time-varying factor, which

provides a degree of control over the noise removal process between periods of noise

update.

Figure 2.3: Over-subtraction factor α as a function of SNR with 40 =α

In Eq. 2.14, the spectral floor β prevents the spectral components of the enhanced

spectrum from falling below the lower value, 2|)(ˆ| kDβ . This operation fills out the

valleys between spectral peaks and the reinsertion of broadband noise into the spectrum

helps to ‘mask’ the neighboring residual noise components. While the proposed

20

technique has proved to be successful in suppressing the residual noise to a large extent,

over-subtraction of the noise estimate also causes heavy speech distortions.

• Spectral subtraction with an MMSE STSA estimator

In 1983, Ephraim and Malah [7] proposed an optimum (in the minimum mean-square

error sense - MMSE) STSA estimator. The method calculated a gain function based on

the a priori and a posteriori SNRs. The following equations describe the method:

)()()(ˆ kYkHkS = (2.16)

+⋅⋅

+⋅=

A

AN

A

A

N

FkHγ

γγγ

γλ

π11

1

2)( (2.17)

where Aγ is the a priori SNR, which is calculated as:

)1()98.01(|)(ˆ|

|)(ˆ|98.0)( ,2

21

, −⋅−+

⋅= −

iN

i

iiA P

kD

kSk γγ (2.18)

here, i is the frame index with xxP =)( if 0≥x and 0)( =xP otherwise. Nγ is the a

posteriori SNR and F is a function representing:

+

+=−

22)1()( 10

2 xI

xIxexF

x

(2.19)

where )(0 yI and )(1 yI are zero and first order modified Bessel functions respectively.

Unlike magnitude averaging where averaging is performed irrespective of whether

the frame contains speech or noise, the proposed MMSE estimator performs non-linear

21

smoothing only when the SNR is low, i.e. when the frame predominantly contains noise.

The residual noise present due to this technique has been observed to be colorless. The

method reduces the distortions in the speech parts due to averaging.

• Spectral subtraction based on perceptual properties

As mentioned earlier, while conventional speech enhancement algorithms improve the

speech quality of the noisy speech by increasing the SNR, there is no significant increase

in speech intelligibility due to the quasi-stationary and other subtle properties of speech.

To tackle this problem, researchers have been trying to incorporate the knowledge of

human perceptual properties in the enhancement processing. Methods based on the

perceptual loudness (Petersen and Boll [22]) and lateral inhibition (Cheng and

O’Shaughnessy [4]) have shown that this approach is somewhat successful in preserving

speech content.

Virag [29] proposed a technique based on the masking properties of the human

auditory system, i.e. the property that weak sounds are masked by simultaneously

occurring stronger sounds. A masking threshold is calculated by modeling the frequency

selectivity of the human ear and the masking property. Using the implementation of

spectral subtraction given in Eq. 2.8, the gain function is calculated as:

⋅

+<

⋅−

=

elseY

D

Y

Dif

Y

D

G2

1

12

1

|)(|

|)(ˆ|

1

|)(|

|)(ˆ|

|)(|

|)(ˆ|1

)(γγ

γγγ

ωωβ

βαωω

ωωα

ω (2.20)

22

where the over subtraction factor α and the spectral floor parameter β is a function of

the masking threshold )(ωT . The exponents 1γ and 2γ determine the sharpness of

transition of )(ωG . The masking threshold )(ωT is calculated by applying a spreading

function across the critical bands of the speech spectrum.

Kim et al. [15] proposed a similar method based on masking properties and phonetic

dependency of speech. The method employs a state-dependent subtraction of speech and

residual noise reduction using the masking threshold. While these methods have proved

to improve speech quality as compared to using purely mathematical models of speech

and noise signals, the increase in complexity in implementation is also substantial.

• Other methods

Other methods based on the spectral subtraction method have been developed that

operate in the autocorrelation, cepstral, logarithmic and signal subspace domains. In the

basic spectral subtraction method and most of its variations, the short-term spectral

estimations are done in the frequency domain. This estimation can also be done in the

logarithmic domain. A major drawback of this method is that the resulting

implementation becomes very complicated and computationally expensive. However, this

drawback can be avoided by using lookup tables.

The signal subspace principles have also been incorporated successfully within the

spectral subtraction framework. The decomposition of the noisy signal into a subspace of

the desired signal and noise is done using the Karhunen-Loeve Transform (KLT) as

described by [8] [24]. However, for non-white noise sources, pre-whitening may be

necessary.

23

In recent years, researchers have proposed a frequency adaptive subtraction factor

based on the segmental noisy SNR. Most implementations and variations of the basic

technique advocate subtraction of the noise spectrum estimate over the entire speech

spectrum. However, real world noise is mostly colored and does not affect the speech

signal uniformly over the entire spectrum. The frequency-dependent spectral subtraction

approach takes into account the fact that colored noise affects the speech spectrum

differently at various frequencies. The next section deals in detail the methods based on

non-linear spectral subtraction methods.

2.7 Frequency – dependent spectral subtraction methods

Lockwood and Boudy [19] proposed the non-linear spectral subtraction (NSS) method,

which is based on the linear spectral subtraction method proposed by Berouti et al. [2]. In this

method, the over-subtraction factor is frequency dependent within every frame of speech

input. Hence, the subtraction is non-linear over the range of frequencies in the spectrum. The

enhanced speech spectrum can be expressed as:

)(1

)(|)(||)(ˆ|

k

kkYkS

i

iii ργ

α+

−= (2.21)

where i is the frame index. )(kYi is the smoothed noisy speech spectrum of the ith frame,

and )(kiα is the frequency-dependent overestimation factor calculated as:

( )|)(ˆ|max)(40

kDk jiji

i ≤≤−=α (2.22)

24

or, is estimated as:

|)(ˆ|5.1)( kDk ji ⋅=α (2.23)

The scaling factor γ is dependent on the variation of the frequency-dependent SNR )(kiρ ,

which is given by:

|)(ˆ|

|)(|)(

kD

kYk

i

ii =ρ (2.24)

The subtracting term in Eq. 2.19 is manually limited by the bounds given in Eq. 2.25 to

reduce large variation in the modified spectrum.

|)(ˆ|3)(1

)(|)(ˆ|

'

kDk

kDkD i

i

ii ≤

+≤

ργ (2.25)

To prevent negative values in the enhanced spectrum, a spectral floor is employed as:

≥=

elsekY

kDkSifkSkSi

iiii

|)(|

|)(|)()()(

^^^^

ββ (2.26)

where a typical value for 1.0=β .

The proposed algorithm computes an optimum over-subtraction value for each

frequency in the frame depending on the SNR. Though the algorithm is successful in

suppressing the musical noise to a large extent, there may exist large variations between

neighboring frequency components due to errors in the noise estimate. However, the

25

algorithm demonstrates that frequency-dependent processing can be used to suppress musical

noise and achieve better speech quality.

Other approaches based on frequency-dependent subtraction have also been proposed.

He and Zweig [11] have proposed a two-band spectral subtraction method using the Berouti

et al. [2] method for the lower frequency band and weighted magnitude averaging for the

higher frequency band, which is considered to be stochastic in nature. The cut-off frequency

between the two bands was determined adaptively for each frame as the highest frequency

below which the separation between adjacent peaks was approximately equal to the

fundamental frequency. Another method (similar to the work presented in this thesis)

proposed by Wu and Chen [31] uses the Berouti et al. [2] spectral subtraction method on

each critical band over the speech spectrum.

The success met by these methods has shown that frequency-dependent processing of

the subtraction procedure is indeed a valid line of research. Chapter 3 presents the proposed

multi-band method for frequency-dependent subtraction.

26

CHAPTER THREE

MULTI-BAND SPECTRAL SUBTRACTION In Chapter 2, we have seen that while the conventional power spectral subtraction method

substantially reduces the noise levels in the noisy speech, it also introduces an annoying

distortion in the speech signal called musical noise. This distortion is caused due to the

inaccuracies in the short-time noise spectrum estimate resulting in large spectral variations in

the enhanced spectrum. This chapter describes a frequency-dependent spectral subtraction

method that offers better speech quality of the resulting enhanced speech with reduced

residual noise. Section 3.1 discusses the motivation behind the development of the proposed

method. Section 3.2 explains the different processing techniques used in the proposed multi-

band spectral subtraction algorithm.

3.1 Motivation

Most speech enhancement algorithms have been observed to work well under some

conditions. As such the problem of enhancing speech corrupted by a noise source has not yet

been fully resolved. While methods based on mathematical and statistical models of

speech/noise signals have shown to be effective, they have a key drawback due to the fact

that they incorporate some very crucial assumptions about the speech and noise

characteristics. However, real-world noise is highly random in nature. Moreover, the spectral

content of speech can vary significantly from speaker to speaker and with the emotional state

27

of the speaker. Hence it becomes imperative to exploit as much of the palpable properties of

the speech and noise signals as possible. For example, it is possible to observe the noise by

itself during speech pauses due to the bursty nature of speech.

Recent research in spectral subtraction methods has focused on a non-linear approach

to the subtraction procedure [11] [19] [28] [31]. This approach has been justified due to the

variation of signal-to-noise ratio across the speech spectrum. Unlike white gaussian noise

(WGN), which has a flat spectrum, the spectrum of real-world noise is not flat. Thus, the

noise signal does not affect the speech signal uniformly over the whole spectrum. Some

frequencies are affected more adversely than the others. In multi-talker babble, for instance,

the low frequencies, where most of the speech energy resides, are affected more than the high

frequencies. Hence it becomes imperative to estimate a suitable factor that will subtract just

the necessary amount of the noise spectrum from each frequency bin (ideally), to prevent

destructive subtraction of the speech while removing most of the residual noise.

Another factor that leads to variation in SNR in different frequency bands of speech

corrupted with noise is the fact that noise has non-uniform effect on different vowels and

consonants. Past research [20] has shown this effect to be present in consonants. Preliminary

results of continuing research at our lab at the University of Texas at Dallas have shown that

various types of noise also affect vowels non-uniformly.

These effects are best illustrated in the plots of the power spectral density (PSD) of

different noise signals and the corresponding variation of segmental SNR of a portion of

speech corrupted with the particular noise. Calculation of the segmental SNR is given by Eq.

3.10 in section 3.2. Figure 3.1(a) depicts the PSD of computer-generated WGN, which is flat

over the whole spectrum. Figure 3.1(b) shows the segmental SNR estimated for 4 (linearly

28

spaced) frequency bands of speech corrupted by the noise. The segmental SNR was plotted

for a portion of the sentence "The shop closes for lunch." produced by a male speaker.

Figures 3.2 – 3.4 illustrate similar plots for three real-world noise signals, i.e. speech-shaped

noise, babble noise and aircraft noise. The SNR plots indicate that the segmental SNRs of the

high frequency bands (e.g. band 4) are significantly lower than the SNR of the low frequency

bands (e.g. band 2), by as much as 15 dB in some cases.

The non-linear spectral subtraction [19] is a frequency-dependent spectral subtraction

method, which exploits the non-uniformity of the effects of noise on speech. Here, a

frequency-dependent subtraction factor is calculated for each frequency component of the

spectra.

(a) (b)

Figure 3.1: (a) PSD of WGN, (b) Segmental SNR of four (linearly-spaced) frequency bands of speech corrupted by WGN at 5dB SNR.

29

(a) (b)

Figure 3.2: (a) PSD of speech-shaped noise, (b) Segmental SNR of four (linearly-spaced) frequency bands of speech corrupted by speech-shaped noise at 5dB SNR.

(a) (b)

Figure 3.3: (a) PSD of multi-talker babble, (b) Segmental SNR of four (linearly spaced) frequency bands of speech corrupted multi-talker babble at 5dB SNR.

30

(a) (b)

Figure 3.4: (a) PSD of aircraft noise, (b) Segmental SNR of four (linearly spaced) frequency bands of speech corrupted aircraft noise at 5dB SNR.

3.2 Multi-band spectral subtraction

This section describes the proposed method for speech enhancement with reduced residual

noise. A block diagram of the proposed method is shown in Figure 3.6. It consists of 4

stages. In the first stage, the signal is windowed and the magnitude spectrum is estimated

using the FFT. In the second stage, we split the noise and speech spectra into different

frequency bands and calculate the over-subtraction factor for each band. The third stage

includes processing the individual frequency bands by subtracting the corresponding noise

spectrum from the noisy speech spectrum. Lastly, the modified frequency bands are

recombined and the time signal is obtained by using the noisy phase information and taking

the IFFT in the fourth stage. The effect of pre-processing operations is to neutralize the

31

Fig

ure

3.5:

Dia

gram

mat

ic r

epre

sent

atio

n of

the

mul

ti-ba

nd s

pect

ral s

ubtr

actio

n m

etho

d

32

distortion in the spectral content of the input data due to the analysis window and to pre-

condition the input data to surmount the distortion due to errors in the subtraction process.

• Pre-processing techniques

Along with the actual noise suppression operation, some pre-processing methods are also

crucial to achieving good speech quality. It had been mentioned in Chapter 2 (Eqs. 2.8 -

2.10) that the spectral subtraction process could be viewed as a time varying filter. To

reduce the perception of residual noise in the enhanced speech, it is necessary to reduce

the variance of the frequency content of the signal. Hence, instead of directly using the

power spectra of the signal, a smoothed version of the power spectra can be used. A

smoothing window of size 10 ms was found to work well. However, it was seen that

smoothing of the estimated noise spectrum was not helpful in reducing residual noise.

Local or magnitude averaging has also proved to help improve speech quality of the

processed speech [3][4] [5]. The operation is described as:

∑−=

−=M

Mlljlj kYWkY )()( (3.1)

where j is the frame index 10 << lW . The averaging is done over M preceding and

succeeding frames of speech. Since the residual noise is the difference between the

estimated noise spectrum and its mean, local averaging of the magnitude spectrum

essentially means that noise content in the averaged frame will approach the mean noise

spectrum, i.e., the estimated noise spectrum, )(ˆ kN . If the error can be written as:

33

)()(ˆ)( kSkSkE −= (3.2)

Then, from Eqs. 2.3 and 2.5,

)(ˆ)()( kNkNkE −= (3.3)

substituting )(kN with )(kN :

)(ˆ)()( kNkNkE −= (3.4)

where )(kN is obtained by the averaging operation as described in Eq. 2.11. The error,

i.e. the residual noise, is minimized if )(ˆ)( kNkN ≈ . Figure 3.5 (b) shows the effect of

smoothing and local averaging on the original spectrum shown in Figure 3.5 (a).

(a) (b)

Figure 3.6: (a) Original magnitude spectrum speech frame, (b) Magnitude spectrum of the smoothed and averaged version of 3.5(a).

34

• Proposed subtraction method

A block diagram of the proposed method was given in Figure 3.6. Assuming the additive

noise to be stationary and uncorrelated with the clean speech signal, the resulting input

corrupted speech can be expressed as:

)()()( ndnsny += (3.5)

where )(ny , )(ns and )(nd are the corrupted speech signal, clean speech signal and the

noise respectively. For a zero-mean uncorrelated noise signal, the power spectrum of the

corrupted speech can be approximately estimated as:

222 |)(||)(||)(| kDkSkY +≈ (3.6)

where )(kS and )(kD are the magnitude spectra of the clean speech and the noise

respectively. Since the noise spectrum cannot be directly obtained, an estimate )(ˆ kD is

calculated during periods of silence or non-speech activity. In the proposed

implementation by Berouti et al. [2], the estimate of the clean speech spectrum is

obtained as:

222^

|)(ˆ||)(||)(| kDkYkS α−= (3.7)

>

=elsekD

kDkSifkSkS

2

2222

|)(ˆ|

|)(ˆ||)(ˆ||)(ˆ||)(ˆ|

ββ

(3.8)

where α is the over-subtraction factor [2], which is a function of the segmental SNR.

This implementation assumes that the noise affects the speech spectrum uniformly and the

35

over-subtraction factor α subtracts an over-estimate of the noise over the whole

spectrum. That is not the case, however, with real-world noise (e.g., car noise, cafeteria

noise, etc.).

To take into account the fact that colored noise affects the speech spectrum

differently at various frequencies, we propose a multi-band approach to spectral

subtraction. The speech spectrum is divided into N non-overlapping bands, and spectral

subtraction is performed independently in each band. The estimate of the clean speech

spectrum in the i th band is obtained by:

iiiiiii ekbkDkYkS ≤≤−= 222 |)(ˆ||)(||)(ˆ| δα (3.9)

where ib and ie are the beginning and ending frequency bins of the i th frequency band,

iα is the over-subtraction factor of the i th band and iδ is a band-subtraction factor that

can be individually set for each frequency band to customize the noise removal

properties. )(kY i is the i -th frequency band of smoothed and averaged version of the

noisy speech spectrum as given by Eq. 3.1. The band-specific over-subtraction factor iα

is a function of the segmental iSNR of the i th frequency band, which is calculated as:

=

∑

∑

=

=

i

i

i

i

e

bki

e

bki

i

kD

kY

dBSNR2

2

10

|)(ˆ|

|)(|

log10)( (3.10)

36

Using the iSNR value calculated in Eq. 3.8, iα can be determined as:

>

≤≤−−

−<

=

201

205)(20

34

575.4

i

ii

i

i

SNR

SNRSNR

SNR

α (3.11)

While the use of the over-subtraction factor iα provides a degree of control over the

noise subtraction level in each band, the use of multiple frequency bands and the use of

the iδ weights provide an additional degree of control within each band.

The negative values in the modified spectrum in Eq. 3.7 are floored to the noisy

spectrum as:

>

=elsekY

kSifkSkS

i

iii 2

222

|)(|

0|)(ˆ||)(ˆ||)(ˆ|

β (3.12)

where the spectral floor parameter was set to 002.0=β .

The modified spectra of each frequency band are recombined to obtain the enhanced

speech spectrum, |)(ˆ| kS . The IFFT of the enhanced speech spectrum is computed with

the original noisy phase information. Since overlapping frames of speech were used in

the analysis stage, the enhanced time signal )(ns is obtained by adding the overlapping

portions of the temporal speech frames, i.e. by the OLA method.

In the next chapter, we present the results obtained with the proposed multi-band

spectral subtraction approach.

37

CHAPTER FOUR

IMPLEMENTATION AND PERFORMANCE EVALUATION This chapter describes the implementation details and performance evaluation of the

proposed algorithm. Evaluation of a speech enhancement algorithm is not simple. While

objective quality assessment methods can indicate an improvement or degradation in speech

quality based on mathematical measures, the human listener does not believe in a simple

mathematical error criterion. Therefore, subjective measurements of intelligibility and quality

are also required. Section 4.1 describes the implementation of the proposed algorithm and the

speech material used to test the algorithm. Section 4.2 explains the objective measures that

were used to evaluate the algorithm. Later sections deal with the results obtained by off-line

simulations of different versions of the proposed algorithm. In section 4.3, the effects of pre-

processing techniques are discussed. In section 4.4, objective results obtained by using

different methods for frequency spacing is given. Section 4.5, evaluates the proposed

algorithm incorporating a speech activity detector. Subjective results are given in section 4.6.

Section 4.7 summarizes the best configuration for the proposed algorithm that is indicated by

the objective measures.

4.1 Implementation

It is necessary to conduct off-line simulations to check the validity and feasibility of an

algorithm before it can be implemented on a real-time system. Implementation on a

38

workstation permits modifications and changes to the algorithm without constraints of time,

memory or computational power. The simulations were carried out on an IBM PC using

Matlab, a technical computing software.

The speech signal is first Hamming windowed using a 20-ms window and a 10-ms

overlap between frames. The windowed speech frame is then analyzed using the Fast Fourier

Transform (FFT). Smoothing of the magnitude spectrum as per [1] was found to reduce the

variance of the speech spectrum and contribute to the enhancement in speech quality. A

weighted spectral average is taken over preceding and succeeding frames of data as given by

Eq. 3.1 in section 3.2. The number of frames M is limited to 2 to prevent smearing of the

speech spectral content. The weights lW were empirically determined and set to

[ ] 2209.0,25.0,32.0,25.0,09.0 ≤≤−= lforWl . The resulting smoothed and averaged

spectrum and the estimated noise spectrum is divided into ( )81 ≤≤ NN frequency bands

using either linear, logarithmic or Mel spacing as described in section 4.4. The over-

subtraction factor iα is calculated for each band as described by Eq. 3.11. The values for

iδ in Eq. 3.7 were empirically determined and set to:

−>

−≤<

≤

=

kHzFs

f

kHzFs

fkHz

kHzf

i

i

i

i

22

5.1

22

15.2

11

δ (4.1)

where if is the upper frequency of the thi − band, and Fs is the sampling frequency in Hz.

The motivation for using smaller iδ values for the low frequency bands is to minimize

39

speech distortion, since most of the speech energy is present in the lower frequencies.

Relaxed subtraction is also used for the high frequency bands. Subtraction is performed over

each band as indicated in Eq. 3.9 and the negative values are rectified using the spectral floor

as given in Eq. 3.12. A small amount of the original noisy spectrum can be introduced back

into the enhanced spectrum to mask any remaining musical noise. In this implementation, 5%

of the original noisy spectrum was added to the enhanced spectrum. The enhanced spectrum

within each band is combined, and the enhanced signal is obtained by taking the IFFT of the

enhanced spectrum using the phase of the original noisy spectrum. Finally, the standard

overlap-and-add method is used to obtain the enhanced signal.

(a) (b)

Figure 4.1: (a) Long-term magnitude spectrum of a speech file from the HINT database , (b) Magnitude spectrum of the speech-shaped noise.

Ten sentences (see Table 4.1) of list number 2 (L02) from the HINT (Hearing In

Noise Test) database [21] uttered by a male speaker were used to evaluate the proposed

40

multi-band spectral subtraction approach. Speech-shaped noise at 5 dB and 0 dB SNR was

added to the sentences after downsampling them to 8 kHz. This noise was generated from the

long-term spectrum of all the sentences in the database and resembles the spectral

characteristics of the male speaker, as illustrated in Figure 4.1.

File Sentence Utterance

L02_1 “A boy ran down the path.”

L02_2 “Flowers grow in the garden.”

L02_3 “Strawberry jam is sweet.”

L02_4 “The shop closes for lunch.”

L02_5 “The police helped the driver.”

L02_6 “She looked in her mirror.”

L02_7 “The match fell on the floor.”

L02_8 “The fruit came in a box.”

L02_9 “He really scared his sister.”

L02_10 “The tub faucet is leaking.”

Table 4.1: List of sentences used from the HINT database for

objective performance evaluation

41

Figure 4.1: Sentence “The shop closes for lunch,” sampled at 8kHz, (above) time plot and (below) the corresponding spectrogram.

Figure 4.2: Speech shaped noise sampled at 8kHz, (above) time plot and (below) the corresponding spectrogram.

42

Figure 4.3: Sentence “The shop closes for lunch,” at 5 dB SNR, (above) time plot and (below) the corresponding spectrogram.

Figure 4.4: Sentence “The shop closes for lunch,” at 0 dB SNR, (above) time plot and (below) the corresponding spectrogram.

43

4.2 Objective measures for performance evaluation

In the evaluation of speech enhancement algorithms, it is necessary to identify the similarities

and differences in perceived quality and subjectively measured intelligibility. Speech quality

is an indicator of the “naturalness” of the processed speech signal. Intelligibility of speech

signals is a measure of the amount of speech information present in the signal that is

responsible for conveying what the speaker is saying. The interrelationship between

perceived speech and intelligibility is not clearly understood. While unintelligible speech

may not be considered to be of high quality, the converse may not be true. For human

listeners, it is important for the speech signal to be intelligible, even at the expense of some

degradation in speech quality. For example, human end-users could actually prefer a less

aggressive enhancement method that may not completely remove all of the interfering noise,

to a more aggressive algorithm that may completely remove the noise component but also

reduce the speech intelligibility. Performance evaluation tests can be done by subjective

quality measures or objective quality measures. Subjective measures are discussed in section

4.6. While subjective measures provide a broad measure of performance since a large

difference different in quality is necessary to be distinguishable to the listener. Hence, it

becomes difficult to get a reliable measure of changes due to algorithm parameters. Objective

measures, on the other hand, provide a measure that can be easily implemented and reliably

reproduced.

Objective measures are based on a mathematical comparison of the original and

processed speech signals. The majority of objective quality measures quantify speech quality

in terms of a numerical distance measure or a model of the perception of speech quality by

the human auditory system. It is desired that the objective measures be consistent with the

44

judgment of the human perception of speech. However, it has been seen that the correlation

between the results obtained by objective measures are not highly correlated with those

obtained by subjective measures. The signal-to-noise ratio (SNR) and the Itakura-Saito (IS)

measure are two of the most widely used objective measures.

• Signal-to-noise ratio (SNR)

The SNR is a popular method to measure speech quality. As the name suggests, it is a

calculated as the ratio of the signal to noise power in decibels:

[ ]

−⋅=

∑

∑

n

ndB

nsns

ns

SNR2

2

10)(ˆ)(

)(log10 (4.2)

where )(ns is the clean speech signal, )(nd is the noise signal and )(ˆ ns is the processed

speech signal. If the summation is performed over the whole signal length, the operation

is called global SNR. However, this measure suffers from a very low correlation to

subjective results [23]. A better measure can be achieved by performing the summation

over smaller periods or frames of the speech signal. This method is referred to as

segmental SNR. An average of the segmental SNRs over the whole speech length can be

performed. This method has proved to have a higher correlation to subjective results as

compared to the global SNR method [23].

45

• Itakura-Saito (IS) distance measure

The IS measure is based on the similarity or difference between the all-pole model of the

clean signal and the corrupted or processed speech signal. This measure penalizes any

mismatch in formant locations while errors in the locations of spectral valleys do not

contribute heavily to the distance. It is computed as shown in the following equation:

1log),)((2

2

102

2

−

+

=

φφφ

φφφ

σσ

σσ

d

Td

Tdd

d

dISaRa

aRaaamd (4.3)

where 2dσ and 2

φσ are the all-pole gains for the enhanced and clean speech segments

respectively. da and φa are the linear prediction coefficient vectors for the enhanced and

clean speech segments respectively. dR and φR are the autocorrelation matrices of the

enhanced and clean speech segments respectively. This method has a correlation of 0.59

with subjective measures [23]. A typical range of results for the IS measure is 1 to 10,

with lower values indicating lesser distance and better speech quality [23].

The IS distance method was used as the objective measure to evaluate the

performance of the proposed algorithm. The highest %5 of the IS distance values were

discarded, as suggested in [10], to exclude unrealistically high spectral distance values. This

method ensured a reasonable overall measure of performance.

To determine the optimal (in terms of speech quality) number of bands, we varied the

number of bands from 1 to 8 and examined speech performance using objective measures.

46

4.3 Effect of pre-processing strategies

Development of the proposed multi-band spectral subtraction (MBSS) algorithm was carried

out in different stages. The performance of the multi-band subtraction process as defined in

Eqs. 3.7 – 3.10 was evaluated with different pre - and post-processing techniques. Smoothing

the magnitude spectrum and taking a weighted spectral average has shown to help preserve

speech information and improve speech quality by reducing the variance of the noisy input

spectrum. Averaging also strengthens the speech components of the transitional regions. This

results in reduced amounts of destructive subtraction of the speech signal components by the

imperfect noise estimate.

Figure 4.5: Sentence “The shop closes for lunch,” after spectral smoothing and magnitude averaging, (above) time plot and (below) the corresponding spectrogram.

The spectrogram in Figure 4.5 shows the effects of smoothing and weighted

magnitude averaging on speech. When compared to the spectrogram of the original signal in

47

Figure 4.1, it can be observed that the speech spectral components are darker, indicating

higher values of speech magnitude and hence higher speech spectral concentration in those

regions. However, smoothing the estimated magnitude spectrum of noise and the over-

subtraction factors iα did not result in any significant improvement in signal quality.

Figures 4.6 and 4.7 plot the mean IS distance values for 10 sentences at 5 dB and 0

dB SNR for the MBSS algorithm without and with pre-processing respectively, as a function

of the number of bands used. The IS distance shows marked improvement when the number

of bands increased from 1 to 4. The error bars indicate standard deviations. The improvement

in speech quality is also marked.

(a) (b)

Figure 4.6: Mean IS distance measure of the MBSS approach with linear frequency spacing and without pre-processing, as a function of the number of bands for 10 sentences

embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR.

48

(a) (b)

Figure 4.7: Mean IS distance measure of the MBSS approach with linear frequency spacing and with pre-processing, as a function of the number of bands for 10 sentences embedded

in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR.

However, the enhanced speech pre-processed using the above mentioned techniques

have significantly reduced musical noise. This is also evident in Figures 4.8 and 4.9. which

illustrate the effect of using pre-processing strategies. The spectrogram at the top in Figures

4.6 and 4.7 represent the enhanced speech for the sentence “The shop closes for lunch.” at 5

dB and 0dB SNRs respectively after being processed through the MBSS algorithm using four

linearly spaced frequency bands without spectral smoothing and weighted magnitude

averaging. The bottom spectrograms are obtained after processing the noisy speech through

the same configuration of MBSS, but with spectral smoothing and weighted magnitude

averaging as a pre-processing strategy. The top spectrograms exhibit high levels of residual

noise in the enhanced spectrum, whereas the bottom spectrograms have relatively lesser

residual noise levels.

49

Figure 4.8: Spectrograms of processed speech of the sentence “The shop closes for lunch,” at 5 dB SNR, using MBSS using four linearly spaced frequency bands, (above) without pre-

processing and (below) with smoothing and weighted magnitude averaging.

Figure 4.9: Spectrograms of processed speech of the sentence “The shop closes for lunch,” at 0 dB SNR, using MBSS using four linearly spaced frequency bands, (above) without pre-

processing and (below) with smoothing and weighted magnitude averaging.

50

In addition to reduced residual noise, informal listening tests showed that the

enhanced speech obtained by the use of the above-mentioned pre-processing techniques has a

significant improvement in speech quality. Hence, all later versions of the MBSS algorithm

referred to in section 4.4 and 4.5 incorporate the spectral smoothing and the weighted

spectral average as the pre-processing strategy.

4.4 Effect of frequency spacing

The central idea behind the development of the proposed algorithm is that the enhancement

process is more effective and accurate when carried out over different frequency bands rather

than over the whole spectrum taken as a single band. The process of splitting the speech

signal into different bands can be performed in the time domain by using band-pass filters or

in the frequency domain by using appropriate windows. The latter method was adopted

because it is computationally more economical and technically more reasonable to

implement considering the frequency domain implementation of the subtraction stage in the

proposed method.

Three frequency spacing techniques, viz. linear, logarithmic and mel spacing were

evaluated for the MBSS method. In the linear spacing of frequency bands, the speech

bandwidth is divided into N linearly spaced frequency bands. The logarithmic and mel

spacing are non-linear frequency scales that approximate the sensitivity of the human ear. In

logarithmic frequency spacing, the center frequencies are distributed logarithmically over the

speech bandwidth. The upper and lower frequencies are non-overlapping. The mel is a

psychoacoustic unit of meaure for the perceived pitch by the human ear. The mapping

between the mel scale and the real frequencies is non-linear vis-à-vis the non-linearity of the

51

human ear. The center frequencies for the corresponding frequency spacing methods are

given in Table 4.2.

The performance of the MBSS algorithm was evaluated, as described in section 4.2,

for each frequency spacing allocation. The plots for the IS distance values obtained for

linear, logarithmic and mel spacing are given in Figures 4.7, 4.10 and 4.11 respectively. The

IS distance shows a consistent improvement as the number of bands is increased. However,

the improvement in speech quality is not very obvious when the more than four frequency

bands are employed. All three spacing methods exhibited comparable performance results

and speech quality. However, logarithmic spacing caused some distortion in the lower

frequency ranges. This can be established from Figures 4.12 and 4.13, which show the

spectrograms for the sentence “The shop closes for lunch,” at 5 dB and 0 dB respectively,

processed by the MBSS algorithm using the three spacing methods with four frequency

bands. The top spectrograms show the processed speech using linear spacing. The middle

spectrograms illustrate the processed speech for logarithmic spacing. It can be observed that

there is some removal of speech in the lower frequencies. This is can be explained by

observing the center frequencies listed for logarithmic spacing in Table 4.2. There is a higher

concentration of bands in the lower frequency region as the number of bands is increased,

resulting in disproportionate subtraction of lower frequencies. The bottom spectrograms

represent speech processed by mel spacing. This is comparable to the spectrograms on the

top parts of the figures.

52

Center Frequencies (kHz)

Number of bands

Linear Spacing Logarithmic Spacing Mel Spacing

1 2 2.0005 2.5798

2 1, 3 0.0321, 2.0316 1.2476, 2.9208

3 0.6667, 2.0, 3.3334 0.0084, 0.1339, 2.1260 0.8058, 1.7133, 3.1335

4 0.5, 1.5, 2.5, 3.5 0.0045, 0.0356, 0.2831,

2.2515 0.5915, 1.1911, 2.0492,

3.2772

5 0.4, 1.2, 2.0, 2.8, 3.6 0.0031, 0.0164, 0.0863,

0.4532, 2.3807 0.4661, 0.9066, 1.5006,

2.3012, 3.3804

6 0.3333, 1.0, 1.6667, 2.3333, 3.0, 3.6667

0.0025, 0.0099, 0.0396, 0.1576, 0.6280, 2.5020

0.3841, 0.7295, 1.1757, 1.7520, 2.4964, 3.4580

7 0.2857, 0.8571, 1.4286,

2.0, 2.5714, 3.1429, 3.7143

0.0021, 0.0070, 0.0228, 0.0747, 0.2442, 0.7986,

2.6116

0.3264, 0.6092, 0.9630, 1.4056, 1.9592, 2.6519,

3.5184

8 0.25, 0.75, 1.25, 1.75, 2.25, 2.75, 3.25, 3.75

0.0019, 0.0054, 0.0152, 0.0428, 0.1208, 0.3407,

0.9607, 2.7092

0.2838, 0.5225, 0.8138, 1.1693, 1.6031, 2.1325,

2.7785, 3.5668

Table 4.2: Center frequency values for linear, logarithmic and mel spacing of frequency bands.

53

(a) (b)

Figure 4.10: Mean IS distance measure of the MBSS approach with logarithmic frequency spacing as a function of the number of bands for 10 sentences embedded in speech-

shaped noise at (a) 5 dB SNR and (b) 0dB SNR.

(a) (b)

Figure 4.11: Mean IS distance measure of the MBSS approach with mel frequency spacing as a function of the number of bands for 10 sentences embedded in speech-shaped noise at

(a) 5 dB SNR and (b) 0dB SNR.

54

Figure 4.12: Comparison of spectrograms of enhanced speech at 5 dB SNR processed with the MBSS algorithm using four bands with (top) linear spacing, (middle) logarithmic

spacing and (bottom) mel spacing.

Figure 4.13: Comparison of spectrograms of enhanced speech at 0 dB SNR processed with the MBSS algorithm using four bands with (top) linear spacing, (middle) logarithmic

spacing and (bottom) mel spacing.

55

4.5 Performance with speech-silence detector

The error between the processed signal and the clean speech signal is minimized if the

estimate of the noise spectrum is accurate. Hence, it is desirable to estimate the noise signal

at every available instant to get a more accurate estimate of the noise spectrum. This is not a

problem in dual channel implementations since the noise signal is exclusively made available

in a second channel. However, in single channel methods, the noise signal has to be

estimated from the noisy speech signal itself due to the non-availability of a second noise

channel. Hence a voice activity detector (VAD) is required that will identify those frames of

the input signal in which the speaker is silent. These input signals are considered to only

contain the interfering noise signals and the noise spectrum is updated. The VAD has to

accurately identify such periods of silence to prevent calculating an erroneous update of the

noise spectrum with parts of the speech signal, since subsequent subtraction will remove the

speech signal from the succeeding input frames. Therefore, the VAD is only required to

detect pauses between words or sentences and not transitions between phonemes or words.

Hence, the method is also called the speech-silence detector.

The MBSS algorithm with linear frequency spacing was evaluated with a speech-

silence detector as per [12]. This detector is based on a statistical model-based voice activity

detection method to detect non-speech frames. The method computes the likelihood ratio of

speech being present or absent in the input frame as:

η

absentspeech

presentspeech

kD

kY

kD

kY

N

N

k <>

−−∑−

=

1

02

2

102

2

1)(ˆ

)(log

)(ˆ

)(1 (4.4)

56

where η is a preset threshold. When speech was absent in the jth frame, the noise spectrum

was updated as:

22

1

2)()1()(ˆ)(ˆ kYkDkD jdjdj ⋅−+⋅= − λλ (4.5)

where 9.0=dλ .

The IS distance values obtained for the above configuration of the MBSS method is

shown in Figure 4.14 for 5 dB and 0 dB SNRs. As seen earlier, the IS distance values

consistently decrease as the number of bands is increased.

(a) (b)

Figure 4.14: Mean IS distance measure of the MBSS approach with linear frequency spacing and speech-silence detector, as a function of the number of bands for 10 sentences

embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0dB SNR.

57

Figure 4.15: Spectrograms of speech enhanced with the MBSS algorithm using four linearly spaced frequency bands with a speech-silence detector, at (top) 5 dB SNR and (bottom) 0

dB SNR.

Global and segmental SNR values were calculated for both noise conditions for all

the four versions of the MBSS algorithm for four frequency bands. The gain in the mean

SNR values calculated over the ten sentences was seen to be consistent and comparable for

all the four versions. (See Table 4.3).

4.6 Subjective evaluation of speech intelligibility

Subjective tests are conducted by having some human subjects listen to the prepared test

speech files and evaluate based on some criteria. Intelligibility test were carried out at the

Callier Center for Communication Disorders / UTD on seven hearing-impaired subjects with

severe to profound hearing loss.

58

Table. 4.3: Mean global and segmental SNR calculated over 10 sentences at 5 and 0 dB SNR.

Speech enhanced by the MBSS algorithm with four linearly spaced frequency bands

was evaluated against the noisy speech. Twenty different sentences at 20 kHz were used for

each condition. The sentences were corrupted using speech-shaped noise at 0 dB SNR. The

noise-corrupted sentences were played in a random order through speakers in a sound-

insulated booth. The sound pressure level was maintained at an average level of 67 dB SPL

with a variance of 2 dB. The subjects were asked to repeat the sentence they heard.

Intelligibility was measured in terms of percentage of words correct.

Figure 4.16 gives the bar plots of the intelligibility scores achieved by each subject

and the average score for both the test conditions. The tests showed no increase in

intelligibility of the corrupted speech after processing by the speech enhancement algorithm.

This result is in accordance with those obtained by other spectral subtraction methods [5]. On

an average, the subjects’ score showed a decrease of 22 % for the enhanced speech.

However, one subject (S6) actually scored better on the test with the speech enhanced by the

MBSS algorithm.

59

Figure 4.16: Intelligibility test results for seven subjects scored on percentage words correct.

4.7 Optimal configuration

The results obtained by objective evaluation are an indicator of the best speech quality that

can be obtained by the different configurations of the algorithm. From the performance plots

(Figures 4.6, 4.7, 4.10, 4.11 and 4.12) of the mean IS distance of the five configurations

mentioned above, it is evident that the MBSS algorithm provides the best performance when

the subtraction process is preceded by spectral smoothing and magnitude spectral averaging

with linear spacing of the frequency bands. A speech-silence detector is necessary if a

practical implementation is considered. The speech-silence detector provides a better

estimate of the noise when the noise spectrum is updated between speech pauses.

For comparative purposes, performance of the traditional power spectral subtraction

(PSS) method as implemented by Berouti et al. [2] is given in Figure 4.15 along with IS

0 10 20 30 40 50 60 70 80 90

S1 S2 S3 S4 S5 S6 S7 Average SUBJECT

% CORR E C T

Noisy speech at 0 dB Processed speech

60

measures for the proposed method. The proposed multi-band spectral subtraction approach

(with number of bands > 3) performed better than the PSS approach for both SNRs.

(a) (b)

Figure 4.17: Comparison of the performance, in terms of mean IS distance measure, of the with power spectral subtraction (indicated with 'PSS') with the multi-band spectral

subtraction approach as a function of the number of bands for 10 sentences embedded in speech-shaped noise at (a) 5 dB SNR and (b) 0 dB SNR.

The performance obtained with power spectral subtraction is indicated with 'PSS'.

While the IS distance does show a slight increase in performance for higher number of bands,

there was no perceivable improvement in speech quality. Informal listening tests indicated

that the multi-band approach yielded very good speech quality with very little trace of

musical noise and with minimal, if any, speech distortion.

The lack of musical noise can also be seen in Figures 4.16 and 4.17, which shows the

spectrograms of enhanced speech obtained with multi-band spectral subtraction (4 bands)

and enhanced speech obtained with power spectrum subtraction.

61

Figure 4.18: Spectrogram of the sentence ''The shop closes for lunch.'' at 5 dB SNR. The top spectrogram is the corrupted signal, the middle spectrogram is the enhanced signal

obtained by the multi-band spectral subtraction method using 4 linearly spaced frequency bands, and the bottom spectrogram is the enhanced signal obtained by the power spectral

subtraction method.

62

Figure 4.19: Spectrogram of the sentence ''The shop closes for lunch.'' at 5 dB SNR. The top spectrogram is the corrupted signal, the middle spectrogram is the enhanced signal

obtained by the multi-band spectral subtraction method using 4 linearly spaced frequency bands, and the bottom spectrogram is the enhanced signal obtained by the power spectral

subtraction method.

63

CHAPTER FIVE

SUMMARY AND CONCLUSIONS The work in this thesis addressed the problem of enhancing speech in noisy conditions. A

multi-band spectral subtraction method, based on the direct estimation of the short-term

spectral amplitude of speech and the non-uniform effect of noise on speech, was proposed.

The results establish the superiority of the proposed method over the conventional spectral

subtraction method with respect to speech quality of the enhanced signal and reduced

residual noise.

The major contributions of this thesis are:

(a) Development of a multi-band speech enhancement strategy based on the spectral

subtraction method. Speech processed by the new algorithm shows reduced levels of

residual noise and good speech quality.

(b) Proposed a band subtraction factor iδ that provides greater control over the subtraction

process in each band and can be tweaked to minimize speech distortion.

(c) Evaluation of various pre-processing strategies for improving the output speech quality.

It was shown that spectral smoothing and weighted spectral averaging of the input

speech spectrum helped preserve the speech content and improved speech quality.

(d) Assessment of linear and non-linear frequency spacing techniques. Linear and mel

frequency spacing methods provide consistently good results.

64

(e) Determining the optimal number of frequency bands for the MBSS algorithm. Results

showed that the speech quality obtained by using four frequency bands is comparable to

that obtained for higher number of bands. This conclusion results in the reduction of a

significant amount computation as compared to using critical bands (23 bands) or for

every frequency component in the FFT, e.g., 256, 512 or 1024 bands.

The multi-band spectral subtraction method provides a definite improvement over the

conventional power spectral subtraction method and does not suffer from musical noise. The

improvement can be attributed to the fact that the multi-band approach takes into account the

non-uniform effect of colored noise on the spectrum of speech. The added computational

complexity of the algorithm is minimal. Four linearly spaced frequency bands were found to

be adequate in obtaining good speech quality.

Further research can be conducted to adaptively calculate the value of the band

subtraction factor iδ , in place of the empirically derived value that is proposed in this thesis.

The algorithm can be implemented in real-time on a fixed point Digital Signal Processor

(DSP) (e.g., the Texas Instruments TMS320C54x/55x) platform for evaluation in real-world

conditions. This would require a detailed quantization analysis of the algorithm. Fixed-point

DSPs are becoming increasingly popular in applications such as cellular-phones, personal

entertainment devices, digital hearing aids and headsets due to their low-power consumption

and high processing rates. Speech enhancement algorithms are a major component of these

applications for operation in adverse environments. The proposed method can eventually be

incorporated into such systems. However, these applications also demand low MIPS (Million

Instructions Per Second), i.e., low number of operations, to conserve battery life. Hence a

study can be made to optimize the processes involved. For instance, an alternative method to

65

calculate the over-subtraction factor iα can be researched because the use of the log function

is computationally expensive in real-time systems. Also, methods can be developed to

preserve the transitional regions and unvoiced regions, which contain low speech levels.

66

BIBLIOGRAPHY

[1] L. Arslan, A. McCree and V. Viswanathan, “New methods for adaptive noise

suppression,” ICASSP, vol.1, pp. 812-815, May 1995.

[2] M. Berouti, R. Schwartz and J. Makhoul, “Enhancement of speech corrupted by

acoustic noise,” Proc. IEEE Int. Conf. on Acoust., Speech, Signal Procs., pp. 208-

211, Apr. 1979.

[3] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE

Trans. Acoust., Speech, Signal Process., vol.27, pp. 113-120, Apr. 1979.

[4] Y. Cheng and D. O'Shaughnessy, “Speech enhancement based conceptually on

auditory evidence,” ICASSP, vol.2, pp. 961-964, Apr. 1991.

[5] J. Deller Jr., J. Hansen and J. Proakis, “Discrete-Time Processing of Speech Signals”,

NY: IEEE Press, 2000.

[6] Y. Ephraim, “Statistical-model-based speech enhancement systems,” Proc. IEEE, vol.

80, No.10, pp. 1526-1555, Oct.1992.

[7] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square

error short-term spectral amplitude estimator,” IEEE Trans. on Acoust., Speech,

Signal Proc., vol.ASSP-32, No.6, pp. 1109-1121, Dec.1984.

[8] Y. Ephraim and H. Van Trees, “A signal subspace approach for speech

enhancement,” IEEE Trans. Speech Audio Procs., vol. 3, pp. 251-266, Jul. 1995.

67

[9] Z. Goh, K.Tan and T. Tan, “Postprocessing method for suppressing musical noise

generated by spectral subtraction,” IEEE Trans. Speech Audio Procs., vol. 6, pp. 287-

292, May 1998.

[10] J. Hansen and B. Pellom, “An effective quality evaluation protocol for speech

enhancements algorithms,” Inter. Conf. on Spoken Language Processing, vol.7, pp.

2819-2822, Sydney, Australia, Dec.1998.

[11] C. He and G. Zweig, “Adaptive two-band spectral subtraction with multi-window

spectral estimation,” ICASSP, vol.2, pp. 793-796, 1999.

[12] Y. Hu, M. Bhatnagar and P. Loizou, “A cross-correlation technique for enhancing

speech corrupted with correlated noise,” ICASSP, vol. 1, pp. 673-676, 2001.

[13] S. Kamath and P. Loizou, “A multi-band spectral subtraction method for enhancing

speech corrupted by colored noise,” submitted to ICASSP 2002.

[14] P. Kasthuri, “Multichannel speech enhancement for a digital programmable hearing

aid,” Master thesis, University of New Mexico, 1999.

[15] W. Kim, S. Kang and H. Ko, “Spectral subtraction based on phonetic dependency and

masking effects,” Proc. IEEE Vis. Image Signal Procs., vol. 147, No. 5, Oct. 2000.

[16] H. Levitt, “Noise reduction in hearing aids: An overview”, Journal of Rehabilitation

Research and Development, vol. 38, No. 1, January/February 2001.

[17] J. Lim and A. Oppenheim, “All-pole modeling of degraded speech,” IEEE

Transactions on Acoustics, Speech and Signal Processing, vol. 26, No. 3, pp. 197-

210, June 1978.

68

[18] J. Lim and A. Oppenheim, “Enhancement and bandwidth compression of noisy

speech,” Proc. IEEE, vol. 67, No. 12, pp. 221-239, Dec. 1979.

[19] P. Lockwood and J. Boudy, “Experiments with a nonlinear spectral subtractor (NSS),

Hidden Markov Models and the projection, for robust speech recognition in cars,”

Speech Communication, Vol. 11, Nos. 2-3, pp. 215-228, 1992.

[20] G. Miller and P. Nicely, “An analysis of perceptual confusions among some English

consonants,” Jour. Acoust. Soc. America, vol. 27, No. 2, pp. 338-352, March 1955.

[21] M. Nilsson, S. Soli and J. Sullivan, “Development of the hearing in noise test for the

measurement of speech reception thresholds in quiet and in noise,” J. Acoust. Soc.

Am., vol.95, pp. 1085-1099, 1994.

[22] T. Peterson and S. Boll, “Acoustic noise suppression in the context of a perceptual

model”, Proc. IEEE Inter. Conf. Acoust. Speech Signal Procs., pp. 1086-1088, 1981.

[23] S. Quackenbush, T. Barnwell, and M. Clements, “Objective Measures for Speech

Quality Testing,” Prentice-Hall, 1988.

[24] A. Rezayee and S. Gazor, “An adaptive KLT approach for speech enhancement,”

IEEE Trans. Speech Audio Processing, vol. 9, pp. 87-95, Feb. 2001.

[25] M. Sambur, “Adaptive noise canceling for speech signals,” IEEE Transactions on

Acoustics, Speech and Signal Processing, vol. 26, pp. 419-423, 1978.

[26] S. Savadatti, “Real time, fixed-point implementation of multi-channel speech

amplitude compression,” Master thesis, University of New Mexico, 2000.

69

[27] M. Schroeder, “Models of hearing,” Proc. IEEE, vol. 63, No. 9, pp. 1332-1350, Sept.

1975.

[28] I. Soon, S. Koh and C. Yeo, “Selective magnitude subtraction for speech

enhancement,” Proc. The Fourth International Conference/Exhibition on High

Performance Computing in the Asia-Pacific Region, vol.2, pp. 692-695, 2000.

[29] N. Virag, “Single channel speech enhancement based on masking properties of the

human auditory system,” IEEE Trans. Speech and Audio Processing, vol. 7, pp 126-

137, March 1999.

[30] N. Virag, “Speech enhancement based on masking properties of the human auditory

system,” Master thesis, Swiss Federal Institute of Technology, 1996.

[31] K. Wu and P. Chen, “Efficient speech enhancement using spectral subtraction for car

hands-free application,” International Conference on Consumer Electronics, vol. 2,

pp. 220-221, 2001.

VITA

Sunil Kamath was born in Mumbai, India on December 27, 1974, the son of late Shri

Mangalore Devdas Pandurang Kamath and Shrimati Geetha Kamath. After

completing his pre-university education from Canara Pre-University College,

Mangalore, he joined the Karnatak University, Dharwad India, where he received the

Bachelors degree in Electrical and Electronic Engineering in 1996. He worked as a

network engineer at Microland Ltd., India until 1999.

He was admitted to the Masters program at the University of New Mexico,

Albuquerque in the Department of Electrical Engineering in August 1999. He

transferred to the Masters program in the Electrical Engineering Department at the

University of Texas at Dallas in August 2000. He has been working with the Callier

Institute of Communication Disorders on speech enhancement in hearing aids since

August 2000.

approved by supervisory committee: dr. robert hunt …sunil devdas kamath, m.s.e.e. the university...

Documents