audio processing material

70
Abstract Many interactive applications, such as video games, require processing a large number of sound signals in real-time. This project proposes a novel perceptually-based and scalable approach for efficiently filtering and mixing a large number of audio signals and produce echo in audio signal. Key to its efficiency is a pre-computed Fourier frequency-domain representation augmented with additional descriptors. The descriptors can be used during the real-time processing to estimate which signals are not going to contribute to the final mixture. Besides, we also propose an importance sampling strategy allowing to tune the processing load relative to the quality of the output. We demonstrate our approach for a variety of applications including equalization and mixing, reverberation processing and specialization. It can also be used to optimize audio data streaming or decompression. By reducing the number of operations and limiting bus traffic, our approach yields a 3 to 15-fold improvement in overall processing rate compared to brute-force techniques, with minimal degradation of the output. Input signal is speech signal we are processing this signal for producing for producing echo (single and multiple) by 1

Upload: yogyogyog-patel

Post on 08-Nov-2014

59 views

Category:

Documents


3 download

DESCRIPTION

Audio Processing Material

TRANSCRIPT

Page 1: Audio Processing Material

Abstract

Many interactive applications, such as video games, require processing a large number of sound signals in real-time. This project proposes a novel perceptually-based and scalable approach for efficiently filtering and mixing a large number of audio signals and produce echo in audio signal. Key to its efficiency is a pre-computed Fourier frequency-domain representation augmented with additional descriptors. The descriptors can be used during the real-time processing to estimate which signals are not going to contribute to the final mixture. Besides, we also propose an importance sampling strategy allowing to tune the processing load relative to the quality of the output. We demonstrate our approach for a variety of applications including equalization and mixing, reverberation processing and specialization. It can also be used to optimize audio data streaming or decompression. By reducing the number of operations and limiting bus traffic, our approach yields a 3 to 15-fold improvement in overall processing rate compared to brute-force techniques, with minimal degradation of the output. Input signal is speech signal we are processing this signal for producing for producing echo (single and multiple) by delay time period of this signal and get the result in form of sound.

1

Page 2: Audio Processing Material

Chapter 1INTRODUCTION

Many interactive applications such as video games, simulators and visualization/signification interfaces require processing a large number of input sound signals in real-time (e.g., for specialization).Typical processing includes sound equalization, filtering and mixing and is usually performed for each of the inputs individually. In modern video games, for instance, hundreds of audio samples and streams might have to be combined to re-create the various specialized sound effects and background ambiance. This results in both a large number of arithmetic operations and heavy bus traffic. Although consumer-grade audio hardware can be used to accelerate some pre-defined effects, the limited number of simultaneous hardware voices calls for more sophisticated voice-management techniques. Besides, contrary to their modern graphics counterparts, consumer audio hardware accelerators still implement fixed function pipelines which might eventually limit the creativity of audio designers and programmers. Hence, designing efficient software solutions is still of major interest. While perceptual issues have been a key aspect in the field of audio compression (e.g., mp3) , most software audio processing pipelines still use brute-force approaches which are completely independent of the signal content. As a result, the number of audio streams they can process is usually limited rapidly since the amount of processing cannot be adapted on-demand to satisfy a predefined time vs. quality tradeoff. This is especially true for multi-media or multi-modal applications where only a small fraction of the CPU-time can be devoted to audio processing In recent years, several contributions have been introduced that aim to bridge the gap between perceptual audio coding and audio processing in order to make audio signal processing pipelines more efficient. A family of approaches proposed to directly process perceptually-coded audio signals yielding faster implementations than a full decode-process-recode cycle. Although they are well suited to distributed applications involving streaming over low-bandwidth channels, they require specific coding of the filters and processing. Moreover, they cannot guarantee an efficient processing for a mixture of several signals, nor that they would produce an “optimal” processing for the mixture. Others, inspired by psycho-acoustic research and audio coding work, tried to use perceptual knowledge to optimize various applications. For instance, a recent paper by

2

Page 3: Audio Processing Material

Tsongas et al. proposed a real-time voice management technique for 3D audio applications which evaluates audible sound sources at each frame of the simulationand groups them into clusters that can be directly mapped to hardware voices. Necessary sub-mixing of all sources in each cluster is done in software at fixed-cost. Dynamic auditory masking estimation has also been successfully used to accelerate modal synthesis. In the context of long FIR filtering for reverberation processing, the recent work. also shows that significant improvement can be obtained by estimating whether the result of the convolution is below hearing threshold, hence reducing the processing cost. In this paper, we build on these approaches and propose a scalable, perceptually-based, audio processing strategy that can be applied to a frequency-domain processing pipeline performing filtering and mix-down operations on producing echo input audio signals. Key to our approach is the choice of a signal representation that allows its progressive encoding and reconstruction. In this, we use speech signal as a convenient and widely used solution which satisfies these constraints. In this context, we present a set of techniques to: dynamically maintain features of the input audio signals to process (for instance, based on pre-computed information on the input audio samples in a way similar to ), dynamically evaluate auditory masking between a number of echo in audio frames that have to be processed and mixed down to produce a frame of audio output,• implement a scalable processing pipeline by fitting a predefined budget of operations to the overall task based on the importance of each input audio signal. Audio products must sound good. That is a given. However, the determination of what constitutes “good sound” is a matter that has been controversial. Some assert that it is a matter of personal taste, that our opinions of sound quality are as variable as our tastes in “wine, persons or song”. This would place audio manufacturers in the category of artists, trying to appeal to a varying public “taste”. Others, like the author, take a more pragmatic view, namely that artistry is the domain of the instrument makers and musicians and that it is the role of audio devices to capture, store and reproduce their art with as much accuracy as technology allows. The audio industry then becomes the messenger of the art. Interestingly, though, this process has created new “artists”, the recording engineers, who are free to editorialize on the impressions of direction, space, timbre and dynamics of the original performance, as perceived by listeners through their audio systems. Other creative opportunities exist at the point of reproduction, as audiophiles tailor the fundamental form of the sound field in listening rooms by selecting loudspeakers of differing timbral signatures and directivities, and by adjusting the acoustics of the listening space with furnishings or special acoustical devices. To design audio products, engineers need technical measurements. Historically, measurements have been viewed with varying degrees of trust.

3

Page 4: Audio Processing Material

However, in recent years, the value of measurements has increased dramatically, as we have found better ways to collect data, and as we have learned how to interpret the data in ways that relate more directly to what we hear. Measurements inevitably involve objectives, telling us when we are successful. Some of these design objectives are very clear, and others still need better definition. All of themneed to be moderated by what is audible. Imperfections in performance need not be un measurably small, but they should be inaudible. Achieving this requires knowledge of psychoacoustics, the relationship between what we measure and what we hear. This is a work in progress, but considerable gains have been made. That aimed to determine the extent to which listeners agreed on their preferences in sound quality and, beyond that, to identify relationships between listener preferences and measurable performance parameters of loudspeakers. Given that loudspeakers, listeners and rooms form a complex acoustical system, some of the effort was necessarily devoted to identifying the aspects of performance that maximize the performance of the entire system in real world circumstances. The sound signal We normally perceive sound as acoustical pressure variations in air. These pressure variations are transmitted through the air as compression waves. Sound object acoustical pressure variations of a sound in air.

1.1. The sound editorIn figure 1.1 we show the sound editor, it is the window that appears when you click the Edit command of a selected sound. The sound editor shows vertically the sound amplitude as a function of the time which is displayed horizontally. The most important parts of the editor have been numbered in the figure:

4

Page 5: Audio Processing Material

Figure 1.1.: The basic sound editor. The numbered parts are explained in the text.

1. The title bar. It shows, from left to right, the id of the sound, the type of object being edited, i.e. “Sound” and the name of the sound being edited. This is typical for the title bar of all editors . they always show the id, the type of object and the name of the object, respectively.

2. The menu bar. Besides options for displaying the sound amplitudes we can display various other aspects of the sound like the spectrogram, pitch, intensity, formants and3. The sound signal pulses. In figure 1.1 we have explicitly chosen to display none of these. On the other hand in figure 1.2 all these aspects of the sound are displayed:_ The spectrogram is shown as grey values in the drawing area just below the sound amplitude area. The horizontal dimension of a spectrogram represents time. The vertical dimension represents frequency in hertz. The time-frequency area is divided in cells. The strength of a frequency in a certain cell is indicated by its blackness. Black cells have a strong frequency presence while white cells have very weak presence. The maximum and minimum frequencies represented in the spectrogram are displayed on the left. here they are 0 Hz and 6000 Hz, respectively. The characteristics of the spectrogram can be modified with options from the Spectrum menu. we will find more information on the spectrogram. The pitch is drawn as a blue dotted line in the spectrogram area. The minimum and maximum pitch are drawn on the right side in blue color. Here they

5

Page 6: Audio Processing Material

are 70 Hz and 300 Hz, respectively. The specifics of the pitch analysis can be varied with options from the Pitch menu. The intensity is drawn with a solid yellow line in the spectrogram area too. The peakedness of the intensity curve are influenced by the “Pitch settings...” from the Pitch menu, however. The Intensity menu settings only have influences on the display of the intensity, not on its measurements. The minimum and maximum values of the scale are in dB’s and shown with a green color on the right side inside the display area (normally the location depends on whether the pitch or the spectrogram are present too). More on intensity. Formant frequency values are displayed with red dots in the spectrogram area. The Formant menu has options about how to measure the formants and about how many formants to display. Finally the pulses menu enable the pitch glottal pulse moments to be displayed. The sound display area. It shows the sample “amplitudes” as a function of time. Rows of rectangles that represent parts of the sound with their durations. In the figure only five rectangles are displayed and each one when clicked plays a different part of the sound. For example, the rectangle at the top left with the number “0.607803” displayed in it plays the first 0.607803 s of the sound, left from the cursor. The rectangle in the second row on the right plays the 4.545266 s of the sound that are not displayed in the window. A maximum of eight rectangles are displayed when the part of the sound displayed in the window is somewhere in between the start and end and at the same time we have a selection present. In case of a selection which is indicated by a pink part in the display area, a rectangle also appears above the sound display area. In this rectangle the duration and the inverse duration are displayed. This is shown in figure 2.2 where a the duration “0.1774555” with its inverse “5.729 / s” is displayed. The times of the left and right cursor of the selection are also shown on top. The selection starts at time 0.982446 which is of course equal to the sum of the durations in the two rectangles below the display with the values 0.111519 and 0.870928 . Global selections. After clicking the “all” button the complete signal is displayed in the display area. The “in”, “out” or “back” button zoom in or out or back. By successively clicking “in” you will see ever more detail of the sound. This goes on and on until the duration of the display area has become smaller than the sample time. In this case there will be no signal drawn anymore. By successively clicking “out” you will less detail. This goes on until the complete sound can be drawn within the display area. Further zooming out is not possible. A cursor. the cursor position is indicated by a vertical red-colored dashed line. Its time position is written above the display area. A horizontal scroll bar. We can either drag the scroll bar or clicked in the part not covered by the scroll bar to change the part of the sound that is displayed.8. A grouping feature. Grouping only applies if you have more than one time-based editor open on the same sound. If

6

Page 7: Audio Processing Material

grouping is on, as is show in both figures, all time actions act synchronously in all grouped windows. For example, if you make a selection in one of the grouped editors the selection is replicated in all the editors. If you scroll in one of them all other editors scroll along.9. A number displaying the average value of the sound part in this window. For this speech fragment the value happens to be “0.007012”. In general this number has to be close to zero, as it is here, otherwise there is an o set in your signal. An o set can be removed with the “Modify>Subtract mean” command. Figure

Figure 1.2.: The sound editor with display of spectrogram, pitch, intensity, formants and pulses.

1.2. analyse a sound

How do we analyse a sound? First we have to deal with a fundamental limitation of our analysis methods: most of our analysis methods are not designed to analyse sounds whose characteristics are changing in time although some special methods have been developed that can analyse signals that change in a predictable way. Speech however is not a predictable signal. This is a dilemma. The practical solution is to model the speech signal as a slowly varying function of time. By slowly varying we mean that during intervals of say 5 to 25 ms the speech

7

Page 8: Audio Processing Material

characteristics don’t change too much and can be considered as almost being constant. The parts of the speech signal where this model holds best is for vowels. This is because the articulators do not move as fast here as in other parts of speech (for example during the very fast release of a plosive).Therefore, to analyse a speech sound we cut up the sound into small segments and analyse each interval separately and pretend it has constant characteristics. The analysis of a whole sound is split up and becomes a series of analyses on successive intervals. The length of the analysis interval will depend on what kind of information we want to extract from the speech signal. Some analysis methods are more sensitive to changes during the analysis interval than others. As we may know by looking carefully at the speech signal, intervals where the sound signal characteristics do not vary at all, probably don’t exist in real speech. The analysis results for each analysis interval thus always represent some kind of average value(s) for the analysis interval.In the analysis we do not really cut up the signal into separate consecutive pieces but we use overlapping segments, i.e. successive analysis segments have signal parts in common: the last part of a segment overlaps the begin part of the next segment. In figure 1.3 a scheme

1.3.: General analysis scheme

8

Page 9: Audio Processing Material

Figure 1.3.: General analysis scheme.

of a generic analysis is given. In the upper left part the figure shows (the first part of) a sound signal and in the upper right part the first part of the analysis results. The next lines visualize the cutting up of the sound into successive segments. Each segment is analyzed in the rectangular block labeled “Analysis”, the results of the analysis are stored in an analysis frame and the analysis frame is stored in the output object. What happens in the “Analysis “block depends of course on the particular type of analysis and necessarily the contents of the analysis frames also depends on the analysis. For example, a pitch analysis will store pitch candidates a formant analysis will store formants. Before the analysis can start at least the following three parameters have to be specified:1. The “window length”. As was said before, the signal is cut up in small segments that will be individually analyzed. The window length is the duration of such a segment. This duration will hold for all the segments in the sound. In many analyses “Window length “is one of the parameters of the form that appears if you click on the specific analysis. For example if you have selected a sound and click on the “To Formant (burg)” action a form appears where you can chose the window length. Sometimes the analysis width is derived from other information. For pitch measurements window length is derived from the lowest pitch you are interested in. There is no one optimal window length that fits all circumstances as it depends on the type of analysis and the type of signal being analyzed. For example, to make spectrograms one often

9

Page 10: Audio Processing Material

choses either 5 ms for a wideband spectrogram or 40 ms for smallband spectrogram. For pitch analysis one often choses a lowest frequency of 75 Hz and this implies a window length of 40 ms. If you want to measure lower pitches the window length increases.2. The “time step”. This parameter determines the amount of overlap between successive segments. If the time step is much smaller than the window length we have much overlap. If time step is larger than the window length we have no overlap at all. In general we like to chose the time step smaller than half the window length.3. The “window shape”. This parameter determines how a segment will be cut from the sound. In general we want the sound segment’s amplitudes to start and end smoothly windowing function can do this for us. Translated to the analog domain a windowing function is line a like a fade-in followed immediately by a fade-out. In the digital domain things are much simpler: we copy a segment and multiply all the sample values by the corresponding value of the window function. The window values are normally near zero at the borders and near one in the middle. In figure1.3 the window function is drawn in the sound. The form of the function stays the same during the analysis. The coloured sound segments are the result of the windowing of the sound, i.e. the multiplication of the sound with the window of the same color. A lot of different window shapes were popular in speech analysis, we name a square window (or rectangular window), a Hamming window, a Hanning window, a Bartlett window. In Praat the default windowing function is the Gaussian window. When we discuss the spectrogram we will learn more about windowing functions.Now the analysis algorithm scheme can be summarized. The sound signal

1. Determine the values of the three parameters window Length, time Step and window Shape from values specified by the user. The duration of the sound and the values of these three parameters determine together how many times we have to perform the analysis on separate segments. This number therefore equals the number of times we have to store an analysis frame, we call it number of Frames.2. Calculate the midpoint t0 of the first window.3. Copy the windowed sound segment, centered at t0 with duration windowLength from the sound and analyse it.4. Store the analysis frame in the output object.5. If this is the last frame then stop. Else calculate a new t0 = t0 +time Step and start again . Once the analysis has finished a new object will appear in the list of objects, for example a Pitch or a Formant or a Spectrogram.1.3. How to make sure a sound is played correctly? Whenever you want to play a sound there are a number of things that need to know. First of all: for the sound to play correctly it is

10

Page 11: Audio Processing Material

mandatory that the amplitudes of the sound always stay within the limits -1 and +1. This will guarantee that the sound will not be corrupted during the transformation from its sampled representation to an analog sound signal. This command multiplies all amplitudes with the same factor to ensure that the maximum amplitude does not exceed 0.99.In the second place, fast and large amplitude variations in the sampled sound have to be avoided because these will be audible as clicks during playing and therefore they create all kinds of perceptual. In section Sometimes it may happen that due to errors in the recording equipment a sound signal shows a non neglectable o_set. By this we mean that the average value of the signal is not close to zero. The average value of this signal is clearly not around zero but around 0.4. What we want is that on the average the parts of a signal above and below zero are approximately equal. This can be arranged by subtracting the average value, or the mean, from the signal. The signal drawn with the dotted line is the result of applying the “Subtract mean” command. For a mono signal it is equivalent to the following script: For a stereo sound means are determined for each channel separately. Time (s)

11

Page 12: Audio Processing Material

1.4. Special sound signals

The “Create Sound from formula...” command the possibility to create all kinds of fancy signals by varying the formula field.2.5.1. Creating tones The following line creates a mono tone of 1000 Hz with a sampling frequency of 44100 Hz and a duration of 0.5 s 2.5.2. Creating noise signals In many experiments noise sounds are required. Noise sounds can be made by generating a sequence of random amplitude values. Different kind of noises exist, the two most important ones are white noise and pink noise. In white noise the power spectrum of the noise is flat, all frequencies have approximately the same strength. In pink noise the power spectrum depends of the frequency as 1=f Both types of noise can be made easily. For white noise we can use two functions that generate random numbers from a random number distribution. The function random Gauss (mu,sigma):

12

Page 13: Audio Processing Material

Chapter 2Background Theory

2.1 Background Theory

In order to understand the content presented in this thesis it is first necessary to

provide some background information regarding digital signal theory.

2.1.1 Discrete Time Signal

A discrete-time signal is a sequence or a series of signal values defined in discrete

points of time, see Figure 1.1. These discrete points of time can be denoted tk.

Fig 1.2.1 Discrete Time Signal

13

Page 14: Audio Processing Material

Where k is an integer time index. The distance in time between each point of time

is the time-step, which can be denoted h.

Thus, h = tk - tk-1

The time series can be written in various ways:

{x (tk)} = {x (kh)} = {x (k)} = x (0), x (1), x (2) . . . } 1.1

To make the notation simple, we can write the signal as x( tk) or x(k).Examples of

discrete-time signals are logged measurements, the input signal to and the output

signal from a signal filter.

2.1.2 Random Signal

Signal is a physical quantity which convey information but many signals in real

world applications, the values of the input vector of the echo cancellation system

are unknown before they arrive. Also as it is difficult to predict these values, they

appear to behave randomly. So a brief discussion of random signal theory will be

study in this section.

A random signal, expressed by random variable function, x (t), does not have a

precise description of its waveform. It may, however, be possible to express these

random processes by statistical or probabilistic models (Diniz 1997, p.17). A single

occurrence of a random variable appears to behave unpredictably. But if we take

several occurrences of the variable, each denoted by n, then the random signal is

expressed by two variables, x (t,n).

14

Page 15: Audio Processing Material

2.1.3 Convolution Function

Convolution is a formal mathematical operation, just as multiplication, addition,

and integration. Addition takes two numbers and produces a third number, while

convolution takes two signals and produces a third signal. Convolution is used in

the mathematics of many fields, such as probability and statistics. In linear

systems, convolution is used to describe the relationship between three signals of

interest: the input signal, the impulse response, and the output signal.

Figure 1.2 shows the notation when convolution is used with linear systems. An

input signal, enters a linear system with an impulse response, , x[n] h[n] resulting

in an output signal, . In equation form: y[n] x[n] t h[n] ' y[n] Expressed in words,

the input signal convolved with the impulse response is equal to the output signal.

Just as addition is represented by the plus, +, and multiplication, convolution is

represented by the star. It is unfortunate that most programming languages also use

the star to indicate multiplication. A star in a computer program means

multiplication, while a star in an equation means convolution. Convolution of two

signal x (n) & h (n)

Fig 1.3 Convolutions of Two Signals

Y (n) =x (n)*h (n)

15

Linear system h(n)

X(n) Y(n)

Page 16: Audio Processing Material

2.1.4 Correlation Function

The correlation function is a measure of how statistically similar two functions are.

The autocorrelation function of a random signal is defined as the expectation of a

signals value at time n multiplied by its complex conjugate value at a different time

m. This is shown in equation 1.2, for time arbitrary time instants, n and m.

xx (n,m) = E [ x (n) x* (m) ] 1.2

As this thesis deals only with real signals the above equation becomes

xx (n,m) = E [x (n) x (m)]

The derivations of adaptive filtering algorithms utilize the autocorrelation matrix,

R. For real signals this is defined as the matrix of expectations of the product of a

vector x (n) and its transpose.

R = E [x (k) xT (k)] 1.3

The autocorrelation matrix has the additional property that its trace, i.e. the sum of

its diagonal elements, is equal to the sum of the powers of the values in the input

vector Correlation matrices are based on either cross-correlation or autocorrelation

functions. This simply refers to the signals being used in the function. If it is cross

correlation, the signals are different, if it is autocorrelation, the two signals used in

the function are the same.

16

Page 17: Audio Processing Material

2.1.5 Stationary Signal.

A signal is called stationary if it follows two property of signal.

1. The mean values, or expectations, of the signal are constant for any shift in

time.

mx (n) = mx (n+k) 1.4

2. The autocorrelation function is also constant over an arbitrary time shift.

xx (n,m)= xx(n+k,m+k) 1.5

These two properties called the statistical properties of a stationary signal are

constant over time. In the analysis of adaptive filter algorithms it is assumed that

the input signals to the algorithm are stationary.

2.1.6 Speech Signal

A speech signal consists of three classes of sounds. They are voiced, fricative and

plosive sounds. Voiced sounds are caused by excitation of the vocal tract with

quasi-periodic pulses of airflow. Fricative sounds are formed by constricting the

vocal tract and passing air through it. Plosive sounds are created by closing up the

vocal tract, building up air behind it then suddenly releasing it; this is heard in the

sound made by the letter .p. (Oppenheim & Schafer 1989, p. 724). The theory

behind the derivations of many adaptive filtering algorithms usually requires the

input signal to be stationary. Although speech is non-stationary for all time, it is an

assumption of this thesis that the short term stationary behaviors outlined above

will prove adequate for the adaptive filters to function as desired.

17

Page 18: Audio Processing Material

Fig 1.2.6 Speech signal

18

Page 19: Audio Processing Material

Chapter 3Speech signal enhancement

A microphone often picks up acoustical disturbances together with a speaker’s voice (which is the signal of interest). In this work, algorithms will be developed for techniques that allow for removing these disturbances from the speech signal before further processing it.

3.1 OverviewIn general, more than one type of disturbances will be present in a microphone signal, each of them requiring a specific enhancement approach. We will mainly focus on two classes of speech enhancement techniques, namely acoustic echo cancellation (AEC) and acoustic noise cancellation (ANC).For AEC, a whole range of algorithms exists, from computationally cheap to expensive, with of course a corresponding performance. We will focus on one of the ’intermediate’ types of algorithms, of which the performance and complexity can be tuned depending on the available computational power. We will describe some methods to increase noise robustness, we will show how existing fast implementations fail

19

Page 20: Audio Processing Material

when their assumptions are violated, and we will derive a fast implementation which does not require any assumptions. For ANC, a class of promising state of the art techniques exists of which the characteristics could be complementary to the features of computationally cheaper (and commercially available) techniques. Existing algorithms for these techniques have a high numerical complexity, and hence are not suited for real time implementation. This observation motivates our work in the field of acoustic noise cancellation, and we describe a number of algorithms that are (several orders of magnitude) cheaper than existing implementations, and hence allow for real time implementation. Finally we will show that considering the combined problem of acoustic echo and noise cancellation as a global optimization problem leads to better results than using traditional cascaded schemes. The techniques which we use for ANC can easily be modified to incorporate AEC.The outline of this first chapter is as follows. After a problem statement, we will describe a number of applications in which acoustic echo– and noise cancelling techniques prove useful, an overview of commercially available applications in this field is given. In section 1.5 our own contributions are summarized.

3.2 Problem of echo3.2.1 Nature of acoustical disturbancesIn many applications involving speech communication, it is difficult (expensive) toplace microphones closely to the speakers. The microphone amplification then has to be large due to the large distance to the speech source. As a result, more environmental noise will be picked up than in the case where the microphones would be close to the speech source.For some of these disturbances, a reference signal may be available. For example a radio may be playing in the background while someone is making a telephone call. The electrical signal that is fed to the radio’s loudspeaker can be used as a reference signal for the radio sound reaching the telephone’s microphone. We will call the techniques that rely on the presence of a reference signal ’acoustic echo cancellation techniques’ (AEC), the reason for this name will become clear below. For other types of disturbances, no reference signal is available. Examples of such disturbances are the noise of a computer fan, people who are babbling in the room where someone is using a telephone, car engine noise, ... Techniques that perform disturbance reduction where no reference signal is available will be called ’acoustic noise cancellation techniques’ (ANC) in this text.In some situations the above two noise reduction techniques should be combined with a third enhancement technique, namely dereverberation. Each

20

Page 21: Audio Processing Material

acoustical environment has an impulse response, which results in a spectral coloration or reverberation of sounds that are recorded in that room. This reverberation is due to reflections of the sound against walls and objects, and hence has specific spatial characteristics, other than those of the original signal. The human auditory system deals with this effectively because it has the ability to concentrate on sounds coming from a certain direction, using information from both ears. If for example one would hear a signal. Recorded by only one microphone in a reverberant room, speech signals may easily become unintelligible. Of course also voice recognition systems that are trained on non–reverberated speech will have difficulties handling signals that have been filtered by the room impulse response, and hence dereverberation is necessary. In this thesis, we will concentrate on algorithms for both classes of noise reduction (noise reduction with (AEC) and without (ANC) a reference signal). Dereverberation will not be treated here (we for dereverberation techniques).

3.2.2 AEC, reference–based noise reductionThe most typical application of noise reduction in case a reference signal is available, is acoustic echo cancellation (AEC). As mentioned before, we will use the term AEC to refer to the technique itself, even though the disturbance which is reduced is not always strictly an ’echo’. Single channel techniques. A teleconferencing setup consists of two conference rooms (see Figure3.1) in both of which microphones and loudspeakers are installed.

Figure 3.1: Acoustic echo cancellation.

21

Page 22: Audio Processing Material

The loudspeaker signal in the near end room is picked up by the microphone, and would be sent back to the far end room without an echo canceller,where the far end speaker would hear his own voice again (delayed by the communication setup).Sound picked up by the microphones in one room (called the ’far end speech’ and the ’far end room’) is reproduced by the loudspeakers in the other (near end) room. The task of an ’echo canceller’ is to avoid that the portion of the far–end speech signal, which is picked up by the microphones in the near end room, is sent back to the far end. Hearing his own delayed voice will be very annoying to the far end speaker. A similar example is voice control of a CD–player. The music itself then can be considered a disturbance (echo) to the voice control system.The loudspeaker signal in both cases is ’filtered’ by the room impulse response. This impulse response is the result of the sound being reflected and attenuated (in a frequency dependent way) by the walls and by objects in the room. Due to the nature of this process, the room acoustics can be modeled by a finite impulse response (FIR) filter. Nonlinear effects (mostly by loudspeaker imperfections) are not considered here. In an acoustic echo cancellation algorithm, a model of the room impulse response is identified. Since the conditions in the room may vary continuously (people moving around being an obvious example), the model needs to be updated continuously. This is done by means of adaptive filtering techniques. In the situation in Figure 3.2 the far end signal x(k) is filtered by the room impulse response, and then picked up by a microphone, together with the desired speech signal of the near end speaker. We consider digital signal processing techniques, hence A/D converted signals, i.e. discrete–time signals and

systems. At the same time, the loudspeaker signal x(k) is filtered by a model w(k) of the room impulse response, and subtracted from the microphone signal d(k)

During periods where the near end speaker is silent, the error (residual) signal e(k) may be used to update w(k), but when the near end speaker is talking, this signal would disturb the adaptation process. We assume that the room characteristics do not change too much during the periods in which near end speech is present, and the adaptation is frozen in these periods by a control algorithm in order to solve this problem.

22

Page 23: Audio Processing Material

Figure 3.2: Echo canceller

Multi–channel acoustic echo canceller

23

Page 24: Audio Processing Material

Figure 3.3: Multi–channel acoustic echo canceller. The fundamental problem of stereophonic AEC tends to occur in this case, and decorrelation of the loudspeaker signals is necessary to achieve good performance Multi–channel techniques In multi–channel acoustic noise cancellation, a microphone array is used instead of a single microphone to pick up the signal. Apart from the spectral information also the spatial information can be taken into account. Different techniques that exploit this spatial information exist. In filter– and sum beam forming , a static beam is formed into the (assumed known) direction of the (speech) source of interest (also called the direction of arrival). While filter–and sum beam forming is about the cheapest multi–channel noise suppression method, deviations in microphone characteristics or microphone placement will have a large influence on the performance, Since signals coming from other directions than the direction of arrival are attenuated, beam forming also provides a form of dereverberation of the signal. Generalized side lobe cancellers (Griffiths–Jim beam forming) aim at reducing the response into directions of noise sources, with as a constraint a distortionless response towards the direction of arrival. The direction of arrival is required prior knowledge. A voice activity detector is required in order to discriminate between noise– and speech+noise periods, such that the response towards the noise sources can be adapted during noise–only periods. Griffiths–Jim beam forming effectively is a form of constrained optimal filtering. A third method is unconstrained optimal filtering Here a MMSE–optimal estimate of the signal of interest can be obtained, while no prior knowledge is required about geometry. A voice activity detector again is necessary and crucial to proper operation. The distortionless constraint towards the direction of arrival is not imposed here. A parameter can be used to trade off signal distortion against noise reduction. The contributions of this thesis in the field of acoustic noise reduction will be focused on this last method (chapters 5 and 6). Existing algorithms for unconstrained optimal filtering for acoustic noise reduction are highly complex compared to bothother (beam forming–based) methods, which implies that they are not suited for real time implementation. On the other hand, they are quite promising for certain applications, since they have different features than the beam forming–based methods :filter–and sum beam formers are well suited (and even optimal) for enhancing a localized speech source in a diffuse noise field, and generalized side lobe cancellers are able to adaptively eliminate directional noise sources, but both of them rely upon a priori information about the geometry of the sensor array, the sensor characteristics, and the direction of arrival of the signal of interest. This means that the unconstrained optimal filtering technique is more robust against microphone placement and microphone characteristics, and that the direction of

24

Page 25: Audio Processing Material

arrival is not required to be known a priori. Another advantage is that they can easily be used for combined AEC/ANC.

Figure 3.4: Two methods to combine echo– and noise cancellation.

25

Page 26: Audio Processing Material

Chapter 4IMPLEMEMENTATION DETAILS

4.1OVERVIEW OF OUR APPROACHOur approach can be decomposed into four main stages (see Figure 1). The first stage builds a frequency domain representation of the audio signals based on a short-time Fourier transform (STFT).This representation is augmented by a set audio descriptors such as the root-mean-square (RMS) level of the signal in several frequency bands and the tonality of the signal. This kind of augmented description of audio signals is also similar in spirit to prior work in indexing and retrieval of audio. This first stage is usually performed off-line when the signals to process are known in advance. The three remaining stages: masking evaluation, importance sampling and actual processing are performed on-line during the interactive application. Audio signals are processed using small frames of audio data (typically using windows of 20 to 40 ms) and, as a consequence, all three later steps are performed for each frame of the computed output stream. The masking step determines which subset of the input audio frames will be audible in the final mixture. It is not mandatory but usually makes the importance sampling step more efficient. It can also be used to limit bus traffic since all inaudible signals can be directly discarded after the masking evaluation and do not have to go through the actual processing pipeline. The importance sampling step determines the amount of data to select and process in each input signal in order to fit the predefined operation budget and minimize audible degradations. Finally, the actual processing step performs a variety of operations on the audio data prior to the final mix-down. context of equalization/mixing, reverberation processing and audio rendering.

4.1.1 PRE-PROCESSING AUDIO SIGNALSThe first stage of our approach aims at pre-computing a signal representation from which the later real-time operations can be efficiently performed. We chose a representation based on a STFT of the input signals augmented with additional information.

4.1.2. Constructing the representationFor each frame of the input audio signal, we first compute the STFT of the audio data. For 44.1 kHz signals, we use 1024 sample Hanning-windowed frames with

26

Page 27: Audio Processing Material

50% overlap, resulting in 512 complex values in frequency domain. From the complex STFT, we then compute a number of additional descriptors: RMS level for a predefined set of i frequency bands (e.g.,octave or Bark bands),• Tonality T calculated as a spectral flatness measure ;tonality is a descriptor in [0 , 1] encoding the tonal (whenclose to 1) or noisy (when close to 0) nature of the signal,• Reconstruction error indicator Err; this descriptor indicates how well the signal can be reconstructed from a small number of bins. To compute the indicator Err, we first sort the FFT bins by decreasing modulus value. Then, several reconstructions (i.e., inverse Fourier transforms) are performed using an increasing number of FFT bins. The reconstruction error, calculated as the RMS level of the (time-domain) difference between the original and reconstructed frame, is then computed. For a N bin FFT, we perform k reconstructions using 1 to N FFT bins, in N/k increments .Err is calculated as the average of the k corresponding errors values. This indicator will be later used during the on-line importance sampling step. Descriptors, together with the pre-sorted FFT bins, are computed for each frame of each input signal and pre-stored in a custom file-format. If required, descriptors can be stored separately from the FFT data used for the processing. They can be viewed as a compact representation of the signal, typically requiring a few additional k Bytes of data per second of audio signals (e.g.,3kBytes/sec. at 44.1kHz for 1024 sample frames with 50% overlap and 8 frequency bands). Hence, for a set of short audio signals, they could easily fit into memory for fast random access over all signals.

4.2 Optimizing the representation

This representation can be further optimized if necessary during the pre-processing step. Frames whose energy is below audible threshold can be stored with a minimal amount of data. Basic masking calculations can also be performed while computing Err by examining the signal-to-noise ratio between the energy in the selected FFT bins and the resulting reconstruction error. The number of stored bins can then be limited as soon as the signal-to-noise exceeds a specified threshold, which can further depend on the tonality of the signal . Of course, any optimization made at this stage would imply that the signals are not going to be drastically modified during the processing step. However, this restriction applies to any approach applied to audio data encoded using a lossy audio compression strategy. Although compression is not the primary goal of our paper, we also experimented with various strategies to optimize storage space. By quantizing the complex FFT data with non-uniform 16-bit dynamic range and compressing all the data for each frame using standard compression techniques (e.g., zip), the size of the obtained

27

Page 28: Audio Processing Material

sound files typically varies between 1.5 times (for wideband sounds) and 0.25 times the size (e.g. speech) of the original 16-bit PCM audio data. If more dynamic range is necessary, it is also possible to quantize the n first FFT bins, which contain most of the energy, over a 24-bit dynamic range and to represent the rest of the data using a more limited range with minimal impact on the size of the representation and quality of the reconstruction.

4.2.1 REAL-TIME MASKING EVALUATION

Once the input sound signals have been pre-processed, we can use the resulting information to optimize a real-time pipeline running during an interactive application. The first step of our pipeline aims at evaluating which of the input signals are going to significantly contribute to a given frame of the output, which amounts to evaluating which input signals are going to be audible in the final mixture at a given time. Signals that have been identified as inaudible can be safely removed from the pipeline reducing both the arithmetic operations to perform and the bus traffic. Since the calculation must be carried on at each processing frame, it must be very efficient so that it does not result in significant overhead. The masking algorithm is similar to the one of the the pre-computed descriptors for maximum efficiency. First, all input frames are sorted according to some importance metric. a loudness metric was used but some of our recent experiments seem to indicate that the RMS level would perform equally well, if not better on average, for lack of a “ultimate” loudness metric . If the signals must undergo filtering or equalization operations, we dynamically weight the RMS level values pre computed for several frequency-bands to account for the influence of the filtering operations in each band. We can then compute the importance as the sum of all weighted RMS values. Second, all signals are considered in decreasing importance order for addition to the final mixture according to the following pseudo-code This process basically adds the level RMS k of each source to an estimate of the level of the final result in each band Pmix (initially set to zero). Accordingly, it subtracts it from an estimate of the remaining level in each band PtoGo (initially set to the sum of all RMS levels for all signals). The process stops when the estimated remaining level in each band is below a given threshold Mmix from the estimated level of the final result. The process also stops if the remaining level is below the absolute threshold of hearing ATH]. Threshold Mmix is adjusted according to the estimated tonality of the final result Tmix, following rules similar to the ones used in perceptual audio coding. In our applications, a simple constant threshold of -27 dB also gave satisfying results indicating that pre-computing and estimating tonality values is not mandatory. Note that all operations must be

28

Page 29: Audio Processing Material

performed for each frequency band, although we simplified the given pseudo-code for the sake of clarity (accordingly, all quantities should be interpreted as vectors whose dimension is the number of used frequency bands and all arithmetic operations as vector arithmetic). In particular the process stops when the masking threshold is reached for all frequency bands.

4.3 IMPORTANCE SAMPLING AND PROCESSING

The second step of our pipeline aims at processing the sub-set of audible input signals in a scalable manner while preserving the perceived audio quality. In our case this is achieved by performing the required signal processing using a target number of operations over a limited sub-set of the original signal data. Note that it is not mandatory to perform masking calculations (as described in the previous section) in order to implement the following importance sampling scheme. However, the masking step limits the number of samples going through the rest of the pipeline and ensures that no samples will be wasted since our sampling strategy itself does not ensure that masked signals will receive a zero-sample budget.4.3 Analog to Digital Conversion

Figure 4.1.: The analog to digital conversion process.

4.3.1 Aliasing

The low-pass filtering step 2 in figure 3.9 is essential so the digitized signal is a faithful representation of the original. Shannon and Nyquist proved in the 1930’s that for the digital signal to be a faithful representation of the analog signal, a relation between the sampling frequency and the bandwidth of the signal had to be maintained. For speech and audio signals bandwidth translates to the highest frequency that is present in the signal. We know that the highest frequencies we

29

Page 30: Audio Processing Material

can hear is nearly 20 kHz. To faithfully represent frequencies that high we have to use a sampling frequency that is at least twice as high. Hence the 44100 Hz sampling frequencies used in CD-audio. All DAC’s have a fixed highest sampling frequency and to guarantee that the input contains no frequencies higher than half this frequency we have to filter them out. If we don’t filter out these frequencies, they get aliased and contribute to the digitized representation. A famous non-audio example of aliasing is in westerns where the wheels of the stage coach sometimes seem to turn backwards. In figure 4.2 we see an example of aliasing. The figure shows with black solid poles the result of sampling a sine of

Figure 4.2.: Aliasing example. The red dotted analog 900 Hz tone gets aliased to the black dotted 100 Hz tone after analog to digital conversion with 1000 Hz sampling frequency.100 Hz with a sampling frequency of 1000 Hz. This can be easily checked: we have 10 sample values in 0.01 s which make 1000 sample values in one second. As a reference, the analog sine signal is also drawn with a black dotted line. Therefore, the black dotted line represents the analog signal before it is converted to a digital signal, and the black dotted poles are the output of the ADC. The red dotted line shows nine periods of an analog sine in this same 0.01 s interval and accordingly has a frequency 900 Hz. The figure makes clear that if the red dotted 900 Hz signal were o_ered to the ADC instead of the black dotted 100 Hz signal, the analog to digital conversion process would have resulted in the same black poles. This means that from the output of the ADC we can not reconstruct anymore whether a 900 or a 100 Hz sine was digitized: if we have a signal that contains besides a sine of 100 Hz also a sine of 900 Hz then,after the analog to digital conversion, only one frequency is left, namely the 100 Hz frequency.313.6.5. Digital to Analog Conversion In figure 3.12 the digital to analog conversion process is shown. This process is almost the reverse of the analog to digital conversion process. We start at 1 with a series of numbers as input to the digital to analog converter 2. At each clock tick, the DAC converts a number

30

Page 31: Audio Processing Material

to an analog voltage and maintains that voltage on its output until the next clock tick. Then a new number is processed. This results in a not so smooth step-like signal 3. If made audible this signal would sound harsh. In step 4, this step-like signal is low-pass filtered to remove frequencies above the Nyquist frequency.9In fact the simulation of the analog to digital conversion is much better than any hardware device now and in the foreseeable future can deliver. The quantization in Praat is only limited by the precision of the floating point arithmetic units. Sounds are represented with double precision numbers: this roughly corresponds to a 52-bit precision. The best hardware nowadays quantizes with 24 bits of precision.

Fig 4.3

5. Pitch analysis Pitch in the context of speech processing most of the time will refer to the periodicity in the speech sound. This periodicity is due to the periodic opening and closing cycle of the vocal cords. The standard pitch algorithm in Praat tries to detect and to measure this periodicity and the algorithm to do so is described in Boersma [1993]. We will also deal with in in this section.The concept of pitch however is not as simple as we stated above because pitch is a subjective psycho-physical property of a sound. The ANSI1 definition of pitch is as follows: Pitch is that auditory attribute of sound according to which sounds can be orderedon a scale from low to high.Pitch is a sensation and the fact that pitch is formed in our brain already hints that it will not always be simple to calculate it. The definition implies that essentially the calculation of pitch has to boil down to one number; numbers can be ordered from low to high. In fact, the only simple case for pitch measuring is the pitch associated with a pure tone: a pure tone always evokes the same pitch sensation with a normal-hearing listener. This was experimentally verified by letting subjects adjust the frequency of a tone to make its pitch equal to the pitch of a test tone. After many repetitions of the experiment and after averaging over many listeners, the distribution of the subject’s pitch shows only one peak, centered at the test frequency. For more complex sounds distributions with more than one peak may occur. Various theories about pitch and pitch

31

Page 32: Audio Processing Material

perception exist and a nice introduction is supplied by Terhardt’s website.The topic of this chapter however is not pitch perception but pitch measurement. A large number of pitch measurement algorithms exist and new ones are still being developed every year. We will describe the pitch detector implemented in Praat because it is one of the best around.

6. The Spectrogram In the spectrum we have a perfect overview of all the frequencies in a sound. However, every information with respect to time has been lost. The spectrum is ideal for sounds that don’t change too much during their lifetime, like a vowel. For sounds that change in the course of time, like real speech, the spectrum does not provide us with the information we want. We like to have an overview of spectral change, i.e. how frequency content changes as function of time. The spectrogram represents an acoustical time-frequency representation of a sound: the power spectral density. It is expressed in units of Pa2/Hz.Because the notion frequency doesn’t make sense at too small a time scale, spectro-temporal representations always involve some averaging over a time interval. When we assume that the speech signal is reasonably constant during time intervals of some 10 to 30 ms we may take spectra from these short slices of the sound and display these slices as a spectrogram. We have obtained a spectro-temporal representation of the speech sound. The horizontal dimension of a spectrogram represents time. The vertical dimension represents frequency in hertz. The time -frequency strip is divided in cells. The strength of a frequency in a certain cell is indicated by its blackness. Black cells have a strong frequency presence while white cells have very weak presence.8.1. How to get a spectrogram from a sound The easiest way is to open the sound in the sound editor. If you don’t see a grayish image you click Spectrum>Show Spectrogram. A number of parameters determine how the spectrogram will be calculated from the sound and other parameters determine how the spectrogram will be displayed.

32

Page 33: Audio Processing Material

4.4 Principle of Multi-Echo EPIThe basic timing scheme of Multi-Echo EPI is shown in Figure 1. It is based on theoriginal EPI sequence proposed by Mansfield (1). After a single RF-excitation, ncomplete EPI images are acquired in a single shot. The phase gradient is rewound to the original starting position after each echo-image to ensure identical k-space rajectories for all echo-images (2).

33

Page 34: Audio Processing Material

Figure 4.4 – Basic timing of the Multi-Echo EPI sequence. The shaded gradient in phase encoding direction ensures identical k-space trajectories for all echo-images. The resulting echo-images have different increasing echo times and thus T2*- weighting. The single echo-images correspond to standard gradient echo EPI images with the same echo time. Multichannel audio has been established in the consumer environment through the success of DVD-Video players for home theater systems. Moreover, the streaming technology over IP used as a broadcast service is requesting multichannel audio at low data rates. Therefore, multichannel audio coding and processing methods have been investigated by many researchers during last decade. The first method is denoted as multichannel audio coding. Matrix surround coding schemes and parametric audio coding schemes are the two main multichannel audio coding techniques currently used. Matrix surround coding scheme such as Dolby Pro Logic consists in matrixing the channels of the original multichannel signal in order to reduce the number of signals to be transmitted. Nevertheless, this multichannel audio coding method cannot deliver high quality (close to transparency) at low data rates. That is made possible with low bit rate parametric audio coding mainly based on Binaural Cue Coding (BCC). This coding scheme represents multichannel audio signals by one or several Down mixed audio channels plus spatial cues extracted from the original channels. The spatial cues refer to the auditory localization cues defined as interaural time and level differences (ITD and ILD) extracted from input channel pairs in a subband

34

Page 35: Audio Processing Material

domain and then denoted as inter-channel time and level differences (ICLD and ICTD). BCC uses filter banks with subbands of bandwidths equal to two times the equivalent rectangular bandwidth (ERB) defined in. Moreover, the inter-channel coherence (ICC) is also extracted in order to recre-ate the diffuseness of the original multichannel input. Indeed, the multichannel audio synthesis at the decoder side is based on the ICC parameter which yields a coherence synthesis relying on late reverberation. The downmixed audio channel (in case of mono downmix) is decoded and then filtered by late reverberation filters which deliver several decorrelated audio channels. Then, these signals are combined according to spatial cues (ICTD, ICLD and ICC) such that the ICC cues between the output sub bands approximate those of the original audio signal. Then, BCC scheme achieves a drastically data rate reduction by transmitting a perceptually encoded downmixed signal plus quantized spatial cues. Moreover,BCC scheme achieves a better audio quality and perceived spatial image than matrix surround coding scheme. From a spatial attribute point of view, BCC synthesis restricted to ICLD and ICTD achieves the desired sources positions and coloration effects caused by early reflections but suffers from auditory spatial image width reduction. Indeed, spatial impression is related to the nature of reflections that occur following the direct sound. Then, BCC synthesis based on late reverberation mimicks different reverberation times and then achieves a better spatial impression closer to the original multichannel input. A second multichannel audio processing method, called upmix,classically converts the existing stereo audio contents into five-channel audio compatible with home theater systems. So, the decoding process of BCC has the common intention with upmix method which is to deliver multichannel audio signal – considering upmix stereo input and BCC stereo downmix. A priori, more information is available for BCC scheme i.e. spatial parameters, than for blind upmix method. However, upmix method uses the patial characteristics and the coherence of the stereo signal to synthesize a mul-tichannel audio signal, signal with rear channels considered as ambience channels – defined as diffuse surround sounds – and a center front channel corresponding to e sources panned across the original stereo channels. More precisely, ocus on existing PCA-based upmix method. The first step of the up mix algorithm in consists in a Principal Component Analysis (PCA) of the stereo signal. A

4.4.1 Image contrast of Multi-Echo EPIThe signal intensity of the single echo images of the Multi-Echo EPI sequence is

35

Page 36: Audio Processing Material

simply given by where S0 is the signal intensity at echo time 0, which corresponds to the proton density for long repetition times TR, and TE is the echo time. Therefore, the echo times of the different echo-images should be on the order or less than the tissue T2*.Parametric images of the proton density and T2* can be obtained by fitting a model function to the signal intensities at the different echo times (3). Please see he processing options described below.

Chapter 5Result And Discussion

5.1 RESULTS

We implemented and tested our signal processing algorithm for three applications: a simple mixing and echo generation, an Reverberation pipeline and a massive spatial audio application which can render hundreds of simultaneous sound sources in real time .Our first application performs simple equalization and mix-down operations on a number of input sound signals. In this case, all data was streamed and decompressed from the disk in real-time. Even for a relatively small number of signals to process (in our test example we used 8), our approach shows a 3-time delay compared to processing the entire data set. Overall processing speed,including streaming from disk and final time-domain reconstruction,is doubled. Compute-time breakdown for various stages of the pipeline when processing the STFT data at several “resolutions”, decreasing the number of target FFT bins. The results, expressed in form of window, correspond to the rate at

36

Page 37: Audio Processing Material

which a full frame of output can be calculated. The window consist thre menu for program first for original signal ,second echo , third multiple echo.

5.2 DISCUSSION

One limitation of our importance sampling scheme is that it requires a fine-grain scalable model of sounds to be applicable. However,we also experimented with coarser-grain time-domain representations with promising results. As with all frequency domain processing approaches, it might require many inverse FFTs per frame to reconstruct multiple channels of output. This might be a limiting factor for applications requiring multi-channel output. However, in most cases the number of output channels is small.Another limiting factor is that we use pre-computed information to limit the overhead of our frequency domain processing and final reconstruction. In the case were the input signals are not known in advance (e.g., voice over IP, real-time synthesis,...), an equivalent representation would have to be constructed on-the-fly prior to processing. We believe that if a small number of such streams are present, our approach would still improve the overall performance. Pre-sorting the FFT data also implies that the processing should not drastically affect the frequency spectrum of the input signals which might not be the case. However, any approach using signals encoded .since perceptual (e.g., masking) effects would be encoded a priori. A solution to the problem would be to store sorted FFT data associated to a number of subbands and re-order them in real time according to how filtering operations might affect the level in each subband. Sorting the output frequency-domain data would also be necessary if several effects have to be chained together.Another solution to better account for filtering effects would be to extend our importance sampling strategy to account explicitly for frequency content (currently, importance is implied by the precomputed ordering of STFT data so that only the number of processed bins has to be determined). This might yield to an approach closer in spirit to albeit we would still benefit from scalability and masking estimation (and would not require specific filter representation).Although it can be optimized, our STFT data is not a compact representation (at least not as compact as standard perceptually coded representations, such as mp3 or AAC) which could be a limitation for streaming or bus transfers.

37

Page 38: Audio Processing Material

Chapter 6Conclusion and Future Work

Conclusion

We presented a simple approach to efficiently filter and mix down a large number of echo signals in real-time. By pre-computing a Fourier frequency-domain representation of our input audio data augmented with a set of audio descriptors, we are able to concentrate our processing efforts on the most important components of the signals. In particular, we show that we can identify signals which will not be audible in the output at each processing time frame. Such signals can be discarded thus reducing computational load and bus traffic. Remaining audible signals are sampled based on an importance metric so that only a subset of their representation is processed to produce a frame of output. Our approach yieldsa 3 to 15-fold improvement in overall processing rate compared to brute-force techniques with minimal degradation of the output. As future extensions, we plan to conduct perceptual validation studies to assess the auditory transparency of our approach at several ”processing bit-rates” and further improve on our masking calculation and importance sampling metrics. Generates random numbers from a

38

Page 39: Audio Processing Material

Gaussian or normal distribution with mean mu and standard deviation sigma. Random Uniform(lower,upper): generates random numbers between lower and upper. It has the advantage as compared with the random Gauss function that all amplitudes are always limited to the predefined intern Sound and the computer .

Future Work

In the future scope we can develop this algorithm in the following field.

1.This project can we implemented for mixing of multiple signal. 2.This project can we implemented for mixing of external noise for producing echo effect. 3.In the field of acoustic echo cancellation, no ’perfect’ solutions exist yet for multichannel decorrelation. For speech signals non–linearities like half–wave rectifiers are providing sufficiently good results, but in applications where multichannel audio is involved (voice command applications for audio devices), these solutions introduce intolerable distortion. This subject clearly requires more research.The adaptive filtering techniques which form the core of acoustic echo cancellers are well explored. For cheap consumer products NLMS and frequency domain adaptive filters can be used, while a whole range of better (and more expensive) algorithms exist if one can afford the extra complexity. For the class of noise cancellation algorithms.

39

Page 40: Audio Processing Material

Chapter 7MATLAB PROGRAM

%ECHO GENERATION

%Pgm to generate a echo file.

[x,fs,nbits]=wavread('tomson.wav');%read in wav file

xlen=length(x);%Calc. the number of samples in the file

a=0.6;

R=ceil(fs*100e-3);

y=zeros(size(x));

d=zeros(size(x));

% filter the signal

40

Page 41: Audio Processing Material

for i=1:1:R+1

y(i) = x(i);

end

for i=R+1:1:xlen

y(i)= x(i)+ a*x(i-R);

d(i)= a*x(i-R);

end

%wavplay(y,fs)

wavwrite(y,fs,'echo.wav');

[z,fs,nbits]=wavread('echo.wav');

%soundsc(z,fs)

wavplay(z,fs)

%IMPLEMENTATION OF ECHO EFFECT IN THE AUDIO SIGNAL

function y = echo1_m();

[x,fs,nbits]=wavread('tomson.wav');%read in wav file

xlen=length(x);%Calc. the number of samples in the file

a=0.5;

R=ceil(fs*100e-3);

y=zeros(size(x));

d=zeros(size(x));

% filter the signal

for i=1:1:R+1

y(i) = x(i);

41

Page 42: Audio Processing Material

end

for i=R+1:1:xlen

y(i)= x(i)+ a*x(i-R);

d(i)= a*x(i-R);

end

%PROJECT: AUDIO SIGNAL PROCESSING

%IMPLEMENTATION OF ECHO IN THE AUDIO SIGNAL

clear all;

prompt={'Enter .wav file path :'};

def={'tomson.wav'};

dlgTitle='Input for wav file';

lineNo=1;

AddOpts.Resize='on';

AddOpts.WindowStyle='normal';

AddOpts.Interpreter='tex';

answer=inputdlg(prompt,dlgTitle,lineNo,def,AddOpts);

42

Page 43: Audio Processing Material

out=1;

hfile=answer(1);

[org fs nbits]=wavread('tomson.wav');

while(out)

b=menu('Press button to play','Original Signal','Echo','Multiple Echo');

disp(b);

switch(b)

case 1,

y=org;

case 2,

y=echo1;

case 3,

y=multi_echo;

otherwise

out=0;

end

if(out)

sound(y,fs);

end

end

43

Page 44: Audio Processing Material

%IMPLEMENTATION OF MULTIPLE ECHO EFFECT IN THE AUDIO SIGNAL

function y = mecho1_m();

[x,fs,nbits]=wavread('tomson.wav');%read in wav file

xlen=length(x);%Calc. the number of samples in the file

% the ratios for attenuation

a0=0.9;

a1=0.8;

a2=0.7;

%calculate no of samples for delay greater than 50 ms

R=ceil(fs*120e-3);

len=size(x)+3*R;

44

Page 45: Audio Processing Material

y=zeros(1,len);

% filter the signal

x=x';

d0=[zeros(1,R) x zeros(1,2*R)]; d0=a0*d0;

d1=[zeros(1,2*R) x zeros(1,R)];d1=a1*d1;

d2=[zeros(1,3*R) x];d2=a2*d2;

for i=1:1:R+1

y(i) = x(i);

end

y=[x zeros(1,3*R)] + d0 +d1 +d2;

%MULTIPLE ECHO GENERATION

[x,fs,nbits]=wavread('tomson.wav');%read in wav file

xlen=length(x);%Calc. the number of samples in the file

% the ratios for attenuation

a0=0.9;

a1=0.8;

a2=0.7;

%calculate no of samples for delay greater than 50 ms

R=ceil(fs*120e-3);

len=size(x)+3*R;

y=zeros(1,len);

45

Page 46: Audio Processing Material

% filter the signal

x=x';

d0=[zeros(1,R) x zeros(1,2*R)]; d0=a0*d0;

d1=[zeros(1,2*R) x zeros(1,R)]; d1=a1*d1;

d2=[zeros(1,3*R) x]; d2=a2*d2;

for i=1:1:R+1

y(i) = x(i);

end

y=[x zeros(1,3*R)] + d0 +d1 +d2;

wavplay(y,f

TOMSON WAVE

[org fs nbits]=wavread('tomson.wav');

while(out)

b=menu('Press button to play','Original Signal','Echo','Multiple Echo');

disp(b);

switch(b)

case 1,

y=org;

case 2,

y=echo1;

case 3,

46

Page 47: Audio Processing Material

y=multi_echo;

otherwise

out=0;

end

if(out) sound(y,fs);

end

end

References

(1) Patti Adank, Roeland Van Hout, and Roel Smits. An acoustic description of the vowels of(2) Northern and Southern Standard Dutch. J. Acoust. Soc. Am., 116:1729–1738, 2004.(3) Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonicsto-(4) noise ratio of a sampled sound. Proc. Institute of Phonetic Sciences University of Amsterdam,17:97–110, 1993.(5) Jonathan Harrington and Steve Cassidy. Techniques in Speech Acoustics. Kluwer Academic Publishers, 1999.(6) Keith Johnson. Acoustic and Auditory Phonetics. Blackwell, 1997. ISBN 0-631-20095-9.(7) Dennis H. Klatt. Software for a cascade/parallel formant synthesizer. J. Acoust. Soc. Am., 67:971–995, 1980.

47

Page 48: Audio Processing Material

(8) Dennis H. Klatt and Laura C. Klatt. Analysis, synthesis, and perception of voice quality(9) variations among female and male talkers. J. Acoust. Soc. Am., 87:820–857, 1990.(10) Donald E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming.Addison-Wesley, third edition, 1998.(11) L. F. Lamel, R. H. Kassel, and S. Sene_. Speech database development: Design and analysis of the acoustic-phonetic corpus. In Proc. DARPA Speech Recognition .(12) Louis C. W. Pols, H. R. C. Tromp, and Reinier Plomp. Frequency analysis of Dutch vowels(13) from 50 male speakers. J. Acoust. Soc. Am., 53:1093–1101, 1973.Rollin Rachelle. Overtone Singing Study Guide. Cryptic Voices Productions, Amsterdam,1995.(14) K. Saberi and D. R. Perrot. Cognitive restoration of reversed speech. Nature, 398:760, 1999.(15) Kenneth N. Stevens. Acoustic phonetics. MIT Press, 2nd edition, 2000.(16) “Listening Tests, Turning Opinion Into Fact”, F.E. Toole, J. Audio Eng. Soc., vol. 30, pp. 431-445 (1982 June).(17) “Listening Tests - Identifying and Controlling the Variables”, F.E. Toole, Proceedings of the 8th International Conference, Audio Eng, Soc. (1990May).(18) . “Subjective Evaluation”, F.E. Toole, in J. Borwick, ed. “Loudspeaker and Headphone Handbook - Second Edition”, chap. 11 (Focal Press,London, 1994).(19) “Subjective Measurements of Loudspeaker Sound Quality and Listener Performance”, F.E. Toole, J. Audio Eng. Soc., vol 33, pp. 2-32 (1985January/February)(20) "A Method for Training of Listeners and Selecting Program Material for Listening Tests", S. E. Olive, 97th Convention, Audio Eng. Soc.,Preprint No. 3893 (1994 November).(21). "Hearing is Believing vs. Believing is Hearing: Blind vs. Sighted Listening Tests and Other Interesting Things", F.E. Toole and S.E. Olive, 97thConvention, Audio Eng. Soc., Preprint No. 3894 (1994 Nov.).(22) . “Loudspeaker Measurements and Their Relationship to Listener Preferences”, F.E. Toole, J. Audio Eng, Soc., vol. 34, pt.1 pp.227-235 (1986April), pt. 2, pp. 323-348 (1986 May).

48

Page 49: Audio Processing Material

(23) “Loudspeakers and Rooms for Stereophonic Sound Reproduction”, F.E. Toole, Proceedings of the 8th International Conference, Audio Eng, Soc.(1990 May).(24) “The Modification of Timbre by Resonances: Perception and Measurement”, F.E. Toole and S.E. Olive, J. Audio Eng, Soc., vol. 36, pp. 122-142(1988 March).(25) . “The Variability of Loudspeaker Sound Quality Among Four Domestic-Sized Rooms”, S.E. Olive, P. Schuck, J. Ryan, S. Sally, M. Bonneville,

49