monophonic sound source separation with an unsupervised ... · d´epartement de g´enie...

28
Monophonic sound Source Separation with an unsupervised network of spiking neurones Ramin Pichevar and Jean Rouat epartement de g´ enie ´ electrique et g´ enie informatique Universit´ e de Sherbrooke, 2500 boul. de l’Universit´ e Sherbrooke, Qu´ ebec, CANADA, J1K 2R1 Abstract We incorporate auditory-based features into an unconventional pattern classification system, consisting of a network of spiking neurones with dynamical and multiplica- tive synapses. Although the network does not need any training and is autonomous, the analysis is dynamic and capable of extracting multiple features and maps. The neural network allows computing a binary mask that acts as a dynamic switch on a speech vocoder made of an FIR gammatone analysis/synthesis bank of 256 filters. We report experiments on separation of speech from various intruding sounds (siren, telephone bell, speech, etc.) and compare our approach to other techniques by using the Log Spectral Distortion (LSD) metric. Key words: amplitude modulation, auditory scene analysis, auditory maps, source separation, speech enhancement, spikes, neurones 1 Introduction Separation of mixed signals is a major issue in many applications in the con- text of audio processing. It can be used, for instance, to assist a robot in segregating multiple speakers, to ease the automatic transcription of video via the audio tracks, to separate musical instruments before automatic tran- scription; and to clean a signal before performing speech recognition on it, Ramin Pichevar is now with the Communications Research Centre, Ottawa, Canada. Email address: [email protected] [email protected] (Ramin Pichevar and Jean Rouat). URL: http://www-edu.gel.usherb.ca/pichevar http://www.gel.usherb.ca/rouat (Ramin Pichevar and Jean Rouat). Preprint submitted to Elsevier Science 14 December 2006

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

Monophonic sound Source Separation with an

unsupervised network of spiking neurones

Ramin Pichevar and Jean Rouat

Departement de genie electrique et genie informatiqueUniversite de Sherbrooke, 2500 boul. de l’Universite

Sherbrooke, Quebec, CANADA, J1K 2R1

Abstract

We incorporate auditory-based features into an unconventional pattern classificationsystem, consisting of a network of spiking neurones with dynamical and multiplica-tive synapses. Although the network does not need any training and is autonomous,the analysis is dynamic and capable of extracting multiple features and maps. Theneural network allows computing a binary mask that acts as a dynamic switch on aspeech vocoder made of an FIR gammatone analysis/synthesis bank of 256 filters.We report experiments on separation of speech from various intruding sounds (siren,telephone bell, speech, etc.) and compare our approach to other techniques by usingthe Log Spectral Distortion (LSD) metric.

Key words: amplitude modulation, auditory scene analysis, auditory maps, sourceseparation, speech enhancement, spikes, neurones

1 Introduction

Separation of mixed signals is a major issue in many applications in the con-text of audio processing. It can be used, for instance, to assist a robot insegregating multiple speakers, to ease the automatic transcription of videovia the audio tracks, to separate musical instruments before automatic tran-scription; and to clean a signal before performing speech recognition on it,

? Ramin Pichevar is now with the Communications Research Centre, Ottawa,Canada.

Email address: [email protected]@usherbrooke.ca (Ramin Pichevar and Jean Rouat).

URL: http://www-edu.gel.usherb.ca/pichevarhttp://www.gel.usherb.ca/rouat (Ramin Pichevar and Jean Rouat).

Preprint submitted to Elsevier Science 14 December 2006

Page 2: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

etc. The ideal instrumental setup would use an array of microphones duringthe recording phase in order to gather many audio channels. Unfortunately,in many situations, only one channel is available to the audio engineer thathas to solve the separation problem. As such, the automatic separation of thesources is much more difficult. Most of the monophonic systems described inthe scientific literature perform reasonably well on specific signals (generallyvoiced speech), but fail to efficiently separate a broad range of signals. Theserelatively unsatisfactory results may be enhanced by exploiting expertise andknowledge of engineering, psychoacoustics, physiology, and computer science.

In this paper we assume that monophonic source separation systems consistof two main stages: auditory map generation and auditory grouping. Ourapproach is based on Auditory Scene Analysis. An auditory map is a visualrepresentation of how a sound mixture is perceived in the brain.

2 Auditory Scene Analysis

A remarkable feat of the auditory system is its ability to disentangle theacoustic mixture and group the acoustic energy from the same event. Thisfundamental process of auditory perception is called Auditory Scene Analysis.Of particular importance in auditory scene analysis is the separation of speechfrom interfering sounds, or speech segregation.

2.1 Auditory Scene Analysis according to Bregman

Bregman [8] defines the concept of “auditory streams”, i.e., the mental perceptof a succession of auditory events. Auditory streaming entails two complemen-tary domains of study. The first one tries to determine how sounds cohere tocreate a sense of continuation. That is the subject of stream fusion. Since morethan one source can sound concurrently, the second domain of study exam-ines how concurrent activities retain their independent identities; that is thesubject of stream segregation. For instance, auditory streaming is importantin assigning consecutive speech elements to the same speaker, or in followinga melodic line in a background of other musical sounds.

In this paper, auditory stream segregation and integration are performed byour proposed neural network.

2

Page 3: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

2.2 Computational Auditory Scene Analysis

Computational Auditory Scene Analysis (CASA) is an attempt to realise Au-ditory Scene Analysis systems with computers. In Bregman’s terminology,bottom-up processing corresponds to primitive processing, and top-down pro-cessing means schema-based processing. The auditory cues proposed by Breg-man for simple tones are not directly applicable to complex sounds. Therefore,one should develop more sophisticated cues based on different auditory maps.

One of the first attempts to perform CASA has been done by Weintraub [95].Ellis [20] uses sinusoidal tracks created by the interpolation of the spectralpeaks of the output of a cochlear filterbank. Mellinger’s model [55] uses par-tials. A partial is formed if an activity on the onset maps (the beginning of anenergy burst) coincides with an energy local minimum of the spectral maps.Using the aforementioned assumption Mellinger proposed a CASA system inorder to separate musical instruments. Cooke [12] has introduced the “syn-chrony strands”, which is the counterpart of Mellinger’s cues in speech. Theintegration and segregation of streams is done using Gestalt and Bregman’sheuristics. Berthommier and Meyer use Amplitude Modulation maps [5] (seealso [90, 66, 56]). Gaillard [23] follows a more conventional approach by usingthe first zero crossings for the detection of pitch and harmonic structures in thefrequency-time map. Brown’s algorithm [9] is based on the mutual exclusivityGestalt principle. Hu and Wang use a pitch tracking technique [34]. Wang andBrown [93] use correlograms in combination with bio-inspired neural networks.Grossberg et al. [26] propose a neural architecture that implements Bregman’srules for simple sounds. Sameti et al. [82] use Hidden Markov Models (HMM),while in [81, 80], and [72] Factorial HMMs are used. Jang and Lee [42] usea technique based on Maximum a posteriori (MAP) criterion. For anotherprobability-based CASA see [15].

Irino and Patterson [39] propose an auditory representation that is synchronousto the glottis and preserves fine temporal information. Their representationmakes the synchronous segregation of speech possible. In [28] a model ofmulti-resolution with both high- and low- resolution representations of theaudio signal in parallel is used. The authors propose an implementation forspeech recognition. Nix et al. [60] perform a binaural statistical estimation oftwo speech sources by an approach that integrates temporal and frequency-specific features of speech. It tracks magnitude spectra and direction on aframe-by-frame basis.

Most of the aforementioned systems require training and are supervised 1 .Other works are reported in [76] and [14].

1 This is not the case for the proposed neural network architecture.

3

Page 4: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

3 Segregation and integration with binding

Neurone assemblies (groups) of spiking neurones can be used to implementsegregation and fusion (integration) of objects into an auditory image rep-resentation. Usually, in signal processing, correlations (or distances) betweensignals are implemented with delay lines, products and summations. Withspiking neurones, comparison (temporal correlation) between signals can bemade without the implementation of delay lines. This has been achieved bypresenting auditory images to spiking neurones with dynamic synapses. Then,a spontaneous organisation appears in the network by a set of neurones fir-ing in synchrony. Neurones with the same firing phase belong to the sameauditory object. In 1976 and 1981, the temporal correlation that performsbinding was proposed in [57] and Malsburg [52, 54, 53]. They observed thatsynchrony is a crucial feature to bind neurones associated to similar charac-teristics. Objects belonging to the same entity are bound together in time. Inother words, synchronisation between different neurones and desynchronisa-tion among different regions perform the binding. On the other hand, thereare other approaches such as the hierarchical coding, or attentional modelsthat try to find a solution without using the synchronisation concept (for areview see [74]) To a certain extent, such property has been exploited in [6]to perform unsupervised clustering for recognition on images, in [83] for vowelprocessing with spike synchrony between cochlear channels, in [32] to proposepattern recognition with spiking neurones, and in [48] to perform cell assemblyof spiking neurones using Hebbian learning with depression. In [94] an efficientand robust technique for image segmentation is presented. Wang and Brown[93] studied the potential application of neural synchrony in CASA.

Bio-inspired neural networks are well adapted to signal processing wheneverprocessing time is an important issue. They do not always require a trainingperiod and can work in a fully unsupervised mode. Adaptive and unsupervisedrecognition of sequences is a crucial property of living neurones. Among themany properties listed in this section, this paper implements the segregationand integration with sets of synchronous neurones.

4 Proposed system strategy

We propose a bio-inspired bottom-up CASA system. Figure 1 shows the build-ing blocks of this approach. The sound mixture is filtered by an FIR Gam-matone filterbank giving birth to 256 different signals, each belonging to oneof the cochlear channels. We propose two different representations: the CAM(Cochleotopic/AMtopic Map) and the CSM (Cochleotopic/Spectropic Map)as described in section 6. Depending on the nature of the intruding sound

4

Page 5: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

(speech, music, noise, etc.) one of the maps is selected as explained later insection 6. The map is applied to our proposed two-layered spiking neural net-work (section 7). We propose to generate a binary mask based on the neuralsynchrony in the output of the neural network (section 8). The binary mask isthen multiplied with the output of the FIR Gammatone synthesis filterbankand the channels are summed-up.

FIR Gammatone Analysis Filterbank

256

256

Output

Sound Mixture

Neural Synchrony

CSM Generation

Envelope Detection

CAM Generation

Spiking Neural Network

Mask Generation

FIR Gammatone Synthesis Filterbank

Fig. 1. The block diagram for the proposed bio-inspired sound source separationtechnique

Our bio-inspired approach has the following advantages, when compared toother non-bio-inspired techniques:

• It does not need the knowledge of explicit rules to create the separationmask. In fact, finding a suitable extension of the rules developed by Breg-man for simple sounds to the real world is difficult. Therefore, as long asthe aforementioned rules are not derived and well-documented [14], expertsystems will be difficult to use.

• It does not need any time-consuming training phase prior to the separationphase. This is contrary to approaches based on statistics like HMMs [82],Factorial HMMs [72, 81], or MAP [42].

• It is autonomous as it does not use hierarchical classification [53].

The work by Wang and Brown [93] is a breakthrough in the field of bio-inspired neural network CASA and, to our knowledge, is one of the first in

5

Page 6: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

the field. Our work uses the same oscillatory neurones as in [93], but with adifferent neural architecture and a different preprocessing stage. We propose2-D representation maps that do not require pitch estimation. Moreover, thesignal is continuously presented to the system, i.e. no segmentation is required.Our proposed architecture uses a 2-layered neural network, in which the secondlayer is one-dimensional. We use an FIR Gammatone filterbank with much lesssynthesis distortion [67].

5 Analysis/Synthesis Filterbank

We use an FIR implementation of the well-known Gammatone filterbank [62]as the analysis/synthesis filterbank. 256 channels having center frequenciesfrom 100 Hz to 3600 Hz, and uniformly spaced on an Equivalent Rectangu-lar Bandwidth (ERB) scale [62] are used. The sampling rate is 8000 sam-ples/second.

The actual time-varying filtering is done by the mask. Our technique is basedon generating masks obtained by grouping synchronous oscillators of the neu-ral net (see subsection 7.4). The outputs of the synthesis filterbank are mul-tiplied by the mask (defined in section 7). Thus, auditory channels belongingto interfering sound sources are muted and channels belonging to the soundsource of interest remain unaffected 2 .

Before the signals of the masked auditory channels are added to form the syn-thesised signal, they are passed through the synthesis filters, whose impulseresponses are time-reversed versions of the impulse responses of the corre-sponding analysis filters 3 .

This non-decimated FIR analysis/synthesis filterbank was proposed in [40]and also used in a perceptual speech coder in ([47]) (in the latter only 20channels were used). For more details on the design of the analysis/synthesisfilterbank see [67].

6

Page 7: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

� � �� � � � � �

� � � � � � ��� � � � � ������� ����������������������� � ��������������������� � ���������������������� �

����

���

��

� � � �� �

S2 SourceS2 Source

Fig. 2. Example of a twenty–four channels CAM for a mixture of /di/ and /da/ pro-nounced by two speakers; mixture at SNR = 0 dB and frame center at t = 166 ms.Channels 10-18 belong to one of the sources and other channels belong to the othersource.

6 Analysis and auditory images generation

We propose two spectral 2-D representations that are generated simultane-ously:

• A modified version of the well-known and documented amplitude modu-lation map, which we call Cochleotopic/AMtopic (CAM) Map – closelyrelated to modulation spectrograms as defined by other authors [2, 56];

• The Cochleotopic/Spectrotopic (CSM) that encodes the averaged spectralenergies of the cochlear filterbank outputs.

By proposing the CAM, it was desired to somewhat reproduce the AM process-ing performed by multipolar cells (Chopper-S) from the anteroventral cochlearnucleus [87]. The second representation (CSM) is motivated by the functioningof the spherical bushy cell processing from the ventral cochlear nucleus [29].

Our CAM/CSM generation algorithm is as follows:

2 This is, in someway, equivalent to labelling – for each time frame – cochlearchannels. A value of one is associated to the target signal and a value of 0 to theinterfering signal.3 For cancelling out cochlear channel delays.

7

Page 8: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

(1) Filter the sound source using a 256-filter ERB-scaled cochlear filterbankranging from 100 Hz to 3.6 kHz .

(2) • For CAM: Extract the envelope (AM demodulation) for channels 30-256; for other low frequency channels (1–29) use raw outputs 4 .

• For CSM: Do nothing in this step.(3) Compute the STFT of each cochlear output using a Hamming window 5 .(4) In order to increase the spectro-temporal resolution of the STFT, find

the reassigned spectrum of the STFT [69] (this consists of applying anaffine transform to the points in order to relocate the spectrum) 6 .

(5) Compute the logarithm of the magnitude of the STFT. The logarithmenhances the presence of the stronger source in a given 2D frequency binof the CAM/CSM 7 .

It has recently been observed that the efferent loop between the medial olivo-cochlear system (MOC) and the outer hair cells modifies the cochlear responsein such a way that speech is enhanced from the background noise [45]. Wesuppose here that envelope detection and selection between the CAM and theCSM, in the auditory pathway, could be associated to the change of stiffnessof hair cells combined with cochlear nucleus processing [25] [49]. For now, inthe present experimental setup, selection between the two auditory images isdone manually. In near future, we plan to implement efferent feedback controlfor signal representation adaptation.

Figure 2 is an example of a CAM computed trough a 24 cochlear channelsfilterbank for a /di/ and /da/ mixture pronounced by a female and malespeaker. Ellipses outline the auditory objects.

7 Architecture of the Neural Network

In this section, we propose a novel neural architecture based on bio-inspiredneurones.

4 Low-frequency channels are said to resolve the harmonics while others do not. Itsuggests a different strategy for low frequency channels [77].5 Non-overlapping adjacent windows with 4ms or 32ms lengths have been tested.6 Note that one can do a compromise by not computing the time-consuming reas-signed spectrum for slightly worse performance.7 log(e1 + e2) ' max(log e1, log e2) (unless e1 and e2 are both large and almostequal) [80].

8

Page 9: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

7.1 Bio-inspired Neural Networks

Bio-inspired (Spiking) neural networks try to mimic the behavior of real neu-rones in animals and humans. They allow the processing of temporal se-quences, contrary to most classical neural networks that are only suitablefor static (time-invariant) data.

In the case of bio-inspired neural networks, temporal sequence processing isdone naturally because of the intrinsic dynamic behavior of the neurones. Thepioneering work in the field of bio-inspired neural networks has been done byHodgkin and Huxley at the University of Plymouth [24]. In the 1950s, theycame up with a mathematical model to describe the behaviour of a squid axon.Although this model is the most complete so far (it can predict most of thebehaviours seen in simple biological neurones), it is very complex and difficultto simulate in an artificial neural network paradigm. Hence, simplified modelslike the Wang-Terman model described in the following subsection has beenproposed in the literature.

7.2 Building Blocks of the Neural Architecture

Building blocks of our architecture are the well-known ”Wang-Terman” oscil-lators [94]. There is an active phase when the neurone spikes and a relaxationphase when the neurone is silent. Information in our network is conveyed inthe relative phase of each oscillation (spike).

The state-space equations for this dynamical system are as follows:

dx

dt= 3x− x3 + 2− y + ρ + p + S (1)

dy

dt= ε[γ(1 + tanh(x/β))− y] (2)

Where x is the membrane potential (output) of the neurone and y is the statefor channel activation or inactivation. ρ denotes the amplitude of a Gaussiannoise, p is the external input to the neurone, and S is the coupling from otherneurones (connections through synaptic weights). ε, γ, and β are constants.The Euler integration method is used to solve the equations.

The network consists of two layers. The first layer is two dimensional and thesecond layer is one dimensional.

9

Page 10: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

7.3 First layer: Auditory image segmentation

The first layer essentially performs a segmentation of the auditory map im-age. A good reference on image segmentation with oscillatory neurones canbe found in [94]. The dynamics of the neurones we use is governed by a modi-fied version of the Van der Pol relaxation oscillator (Wang-Terman oscillators([93])). The first layer is a partially connected network of relaxation oscilla-

Neuroni,j

Neuronk,m

H(.)x(k,m;t)

wi,j,k,m

sum > ζ sum < ζ

σ=1 σ=0

−η

L i,j

Glo

bal C

ontr

olle

r

CAM/CSM

Bindingvia

synchronization

Channels

Freq

uenc

ies

G

dz/dt= σ − ξ z

Fig. 3. Architecture of the Two-Layer Bio-inspired Neural Network. G: Stands forglobal controller (the global controller for the first layer is not shown on the figure).One long range connection is shown in the figure.

tors [93]. Each neurone is connected to its four neighbors. The CAM (or theCSM) is applied to the input of the neurones. Our observations have shownthat the geometric interpretation of pitch (ray distance criterion) is less clearfor the first 29 channels (channels where harmonics are usually resolved). Forthis reason, we have also established long-range connections from clear (highfrequency) zones to confusion (low frequency) zones. These connections existonly across the cochlear channel number axis of the CAM. This architecturehelps the network to better extract harmonic patterns.

The weight at time t between neurone(i, j) and neurone(k,m) of the firstlayer is computed via the following formula:

wi,j,k,m(t) =1

Card{N(i, j)}0.25

eλ|p(i,j;t)−p(k,m;t)| (3)

here p(i, j; t) and p(k,m; t) are respectively external inputs to neuron(i, j) andneuron(k,m) ∈ N(i, j). Card{N(i, j)} is a normalisation factor and is equal

10

Page 11: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

to the cardinal number (number of elements) of the set N(i, j) containingneighbors connected to the neurone(i, j) (can be equal to 4, 3 or 2 dependingon the location of the neuron on the map, i.e. center, corner, etc.). The externalinput values are normalised. The value of λ depends on the dynamic range ofthe inputs and is set to λ = 1 in our case. This same weight adaptation is usedfor long range clear to confusion zone connections (Eq. 7) in CAM processingcase. The coupling at time t, Si,j(t) defined in Eq. 1 is :

Si,j(t) =∑

k,m∈N(i,j)

wi,j,k,m(t)H(x(k,m; t))− ηG(t) + κLi,j(t) (4)

H(.) is the Heaviside function and η is a constant. x(k, m; t) is the mem-brane potential. G(t) is a global controller whose dynamics is governed by thefollowing equation:

G(t) = αH(z − θ) (5)

dz

dt= σ − ξz (6)

σ is equal to 1 if the global activity of the network is greater than a predefinedthreshold ζ and is zero otherwise. α , ξ, and θ are constants. z is an internalstate variable.

Li,j(t) is the long-range coupling as follows:

Li,j(t) =

0 j ≥ 30∑k=225...256 wi,j,i,k(t)H(x(i, k; t)) j < 30

(7)

κ is a binary variable defined as follows:

κ =

0.2 for CAM

0 for CSM(8)

For a given frequency i, channels j that have an index smaller than 30 haveconnections from channels k indexed between 225 and 256.

The first layer is designed to handle presentations of auditory maps with con-tinuous sliding and overlapping windows on the signal – with real applicationperspectives in mind.

11

Page 12: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

Constant’s name Value

λ 1

θ 0.9

α -0.1

ξ 0.4

ζ 0.2

η 0.05

γ 4.0

ε 0.02

ρ 0.02

β 0.1

κ 0.2Table 1The numerical values of the different parameters used in the first layer of the net-work.

7.4 Second layer: temporal correlation and multiplicative synapses

The second layer performs temporal correlation between neurones. Each ofthe neurones represents a cochlear channel of the analysis/synthesis filterbank.For each presented auditory map, the second layer establishes binding betweenneurones whose entry is dominated by the same source. The dendrites establishmultiplicative synapses with the first layer. The second layer is an array of 256neurones (one for each channel) similar to those described by equations (1)and (2) from subsection 7.3. Each neurone receives the weighted product of theoutputs of the first layer neurones along the frequency axis of the CAM/CSM.The weights between layer one and layer two are defined as wll(i) = α

i, where

i can be related to the frequency bins of the STFT and α is a constant for theCAM case, since we are looking for structured patterns. For the CSM, wll(i) =α is constant along the frequency bins as we are looking for energy bursts.Therefore, the input stimulus to neuron(j) in the second layer is defined asfollows:

θ(j; t) =∏i

wll(i)Ξ{x(i, j; t)} (9)

12

Page 13: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

Constant’s name Value

α 1

µ 2Table 2The numerical values of the different parameters used in the second layer of thenetwork.

The operator Ξ is defined as:

Ξ{x(i, j; t)} =

1 for x(i, j; t) = 0

x(i, j; t) elsewhere(10)

where () is the averaging over a time window operator (the duration of the win-dow is on the order of the discharge period). The multiplication with x(i, j; t)is done only for non-zero x(i, j; t) (outputs of the first layer) (in which a spike ispresent) [22, 63]. This behaviour has been observed in the integration of Inter-aural Time Difference (ITD) and Inter Level Difference (ILD) information inthe barn owl’s auditory system [22] or in the monkey’s posterior parietal lobeneurones that show receptive fields that can be explained by a multiplicationof retinal and eye or head position signals [1].

The synaptic weights inside the second layer are adjusted through the followingrule:

w′ij(t) =

0.2

eµ|p(i;t)−p(j;t)| (11)

µ is chosen to be equal to 2. The ”binding” is done via this second layer.In fact, it is an array of fully connected neurones along with a global con-troller – global controller defined as in equations (5), (6). The global controllerdesynchronises the synchronised neurones from different auditory objects byemitting inhibitory activities whenever there is activity (spikings) in the net-work [93]. Thus, the network can adapt quickly to input changes.

The selection strategy at the output of the second layer is based on temporalcorrelation:

• Neurones belonging to the same source synchronise (same spiking phase)• Neurones belonging to other sources desynchronise (different spiking phase).

13

Page 14: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

8 Masking

Based on the phase synchronisation described in the previous section, a maskis generated by associating zeros and ones to different channels. Energy isnormalised in order to have the same SPL for all frames. Note that two-source mixtures are considered throughout this article but the technique canbe potentially used for more than two sources. In this case, for each time framen, the labelling of individual channels is equivalent to the use of multiple masks(one for each source).

9 Experiments

9.1 Database and comparison

Martin Cooke’s database [13] is used for evaluation purposes. The followingintruding signals have been tested: 1 kHz tone, FM siren, white noise, trilltelephone noise, and speech. The aforementioned noises have been added tothe target utterance. The results (i.e., audio files) can be found at [65]. Eachmixture is applied to our proposed system and the mixed sound sources areseparated. The Log Spectral Distortion (LSD) is used as a performance crite-rion [91, 92]. It is defined as: 8 :

LSD =1

L

L−1∑l=0

√√√√ 1

K

K−1∑k=0

[20 log10|I(k, l)|+ ε

|O(k, l)|+ ε]2 (12)

Where I(k, l) and O(k, l) are the FFT of I(t) (ideal source signal) and O(t)(separated source) respectively. L is the number of frames, K is the number offrequency bins and ε is meant to prevent extreme values (equal to .001 in ourcase). We compare our performance with four different approaches proposedin the literature:

• The system proposed in [93] is a neural-network-based CASA system thatuses a different neural architecture and a different type of preprocessing.

• The expert-system-based CASA approach proposed in [34], which uses moreconventional cues, such as pitch tracking for grouping different sources.

• An approach based on statistical learning from [41]. Note that in the lattercase a training is necessary prior to separation.

8 For other criteria, such as the PESQ (Perceptual Evaluation of Speech Quality)see [78] [68].

14

Page 15: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

• A speech enhancement technique based on wavelets by Bahoura and Rouat[3] and [4].

9.2 Separation performance

Table 3 gives comparison results based on the LSD for a first set of sound files:

• In all cases, our proposed architecture performs better than [93]. Our sys-tem causes less distortion of the target signal at the expense of less noiserejection. This is an advantageous strategy in hearing aid design.

• Our proposed technique outperforms [34] when the intrusion is a tone, siren,or white noise. For telephone ring, [34] has better scores.

• Our technique outperforms [4] in all cases except for white noise 9 . However,the technique in [4] is a speech enhancement technique that has never beendesigned for source separation. Speech enhancement techniques are moreadapted to background noise (including white noise) removal than to soundsource separation tasks. Note that the price to pay for a better performancewith our proposed technique, is a higher computational complexity whencompared with more conventional speech enhancement techniques such asthe one presented in [4].

• For the double-vowel, the LSD has the highest value – showing that sepa-ration is more difficult when the interference is speech.

Table 4 shows comparison results based on the LSD for a second set of soundfiles:

• Sound files used in [41] 10 are used for comparison purposes. The currentwork is compared with two other techniques from [41] and [4]. B-R givesonly one output (processed sound), in contrast with P-R and J-L, whichgive two extracted sounds for each method (Tables 3 and 4). Therefore, theresult of B-R is compared (and the LSD extracted) once with the originalmusic and once with the original female speaker. Results from the threetechniques are comparable for the extraction of the female speaker voice,while our technique outperforms [41] the other two for the extraction ofmusic from background. Note that our technique does not require any priorstatistical training, in contrast with [41].

In the following subsections, spectrograms for different sounds and differentapproaches are given for visual comparison purposes.

9 Even if the LSD is lower for the telephone ring (16.18 instead of 16.43), theperceptive quality for B-R is not as good as that of P-R.10 http://home.bawi.org/jangbal/research/demos/rbss1/sepres.html

15

Page 16: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

SNR of the initial P-R W-B H-W B-R

Intrusion mixture(dB)

LSD LSD LSD LSD

Tone -2 dB 5.38 18.87 13.59 9.77

Siren -5 dB 8.93 20.64 13.40 18.94

Tel. ring 3 dB 16.43 18.35 14.05 16.18

White noise -5 dB 16.82 35.25 26.51 14.84

Male (da) 0 dB 14.92 N/A N/A 17.70

Female (di) 0 dB 19.70 N/A N/A 24.04

Table 3The log spectral distortion (LSD) for four different methods: P-R (our proposedapproach), W-B (the method proposed by [93]), H-W (the method proposed by[34]), and B-R (the method proposed in [4]). The intrusion noises are as follows a)1 kHz pure tone, b) FM siren, c) telephone ring, d) white noise, e) male-speakerintrusion (/di/) for the French /di//da/ mixture, f) female-speaker intrusion (/da/)for the French /di//da/ mixture. Except for the last two tests, the intrusions aremixed with a sentence taken from Martin Cooke’s database.

Mixture Separated P-R J-L B-R

sources (LSD) (LSD) LSD

Music & female music 8.01 21.25 14.00

(AF) voice 17.54 16.49 18.16Table 4LSD for two different methods: P-R (our proposed approach), J-L ([41]), and B-R([3] and [4]). The mixture comprises a female voice with musical rock background.

9.3 Separation examples

9.3.1 Separation of speech from telephone trill

Figure 4 shows the mixture of the utterance “Why were you all weary?” withthe telephone trill noise (from Martin Cooke’s database). The trill telephonenoise (ring) is wideband, interrupted, and structured. Figure 5 shows the spec-trograms of separated utterance and trill telephone, obtained by using ourapproach. It is interesting to note that the medium- to low-frequency range ofthe telephone trill has been preserved. Figure 6 shows the extracted utteranceby using [93]. As can be seen, our approach performs better in medium tohigher frequencies.

16

Page 17: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

Time (s)0 1.51

0

4000

Fre

quen

cy (

Hz)

Time (s)0 1.51

0

4000

Fre

quen

cy (

Hz)

Fig. 4. Mixture of the utterance ”Why were you all weary?” with a trill telephonenoise.

Time (s)0 1.51

0

4000

Fre

quen

cy (

Hz)

Time (s)0 1.51

0

4000

Fre

quen

cy (

Hz)

Fig. 5. Top: The synthesised “Why were you all weary?” after the separation bythe approach proposed in this article. Bottom: The synthesised trill telephone afterthe separation by the approach proposed in this article.

9.3.2 Separation of speech from 1 kHz tone

In this experiment the utterance “I’ll willingly marry Marilyn” with a 1 kHzpure tone is used. The tone is narrowband, continuous, and structured. Figure

17

Page 18: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

Time (s)0 1.46

0

4000

Fre

quen

cy (

Hz)

Fig. 6. The synthesised “Why were you all weary?” by the approach proposed in[93] for the trill telephone and utterance mixture.

7 shows the original utterance plus the 1 kHz tone. Figure 8 shows the separa-tion results for our approach and the approach proposed in [93]. The methodproposed in [93] removes speech in middle and high frequencies, while thesefrequencies remain unaffected by our approach. When listening to the signaland according to the LSD (equals to 7.07), the tone has been removed (evenif a gray bar is shown on the top panel of figure 8.

4000

Freq

uenc

y (Hz

)

00 Time (s) 1.75

Fig. 7. Mixture of the utterance “I’ll willingly marry Marilyn” with 1 kHz tone.

9.3.3 Separation of speech from an FM signal (Siren)

Figure 9 shows the mixture of the utterance ”I’ll willingly marry Marylin” witha siren. The siren is a locally narrowband, continuous, structured signal. Thebottom panel of figure 10 shows the separated siren obtained by our proposedtechnique. The top panel of figure 10 shows the spectrogram of the separatedutterance. Figure 11 shows the spectrogram for the separated utterance usingthe method proposed by [93].

18

Page 19: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

Time (s)0 1.75

0

4000

Fre

quen

cy (

Hz)

Time (s)0 1.46

0

4000

Fre

quen

cy (

Hz)

Fig. 8. Top: The separation result for the 1 kHz plus utterance mixture using theapproach described in this article. The dynamic range between the darkest graylevel and the brightest level is 50 dB. Bottom: The synthesised “Why were youall weary?” by the approach proposed by [93]. The high-frequency information ismissing.

9.4 Discussions

Other Signal-to-Noise-Ratio (SNR)-like criteria such as the SNR, Segmen-tal SNR, PEL (Percentage of Energy Loss), and PNR (Percentage of NoiseResidue) are used in the literature by [93, 46, 33, 34, 35, 75, 66] and can beused as performance scores.

Although criteria like the PEL/PNR, the SNR, the Segmental SNR, and theLSD are used in the literature as performance criteria, they do not alwaysreflect the real perceptive performance of a given sound separation system. Forexample, the SNR, the PEL, and the PNR ignore high-frequency information.The LSD does not take into account temporal aspects such as phase. Therefore,LSD will not detect phase distortions in separation results. It is known that

19

Page 20: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

0 1.75

Frequency(Hz)

4000

Time(s)0

Fig. 9. Mixture of a siren and the sentence “I’ll willingly marry Marilyn”

performance evaluation cannot be based only on measures such as LSD 11 .Other criteria like the PESQ (Perceptual Evaluation of Speech Quality) [78][68] can be used as well for comparison purposes. More investigation shouldbe made to modify the criteria commonly used in source separation.

At any instant of time, the intruding noise such as tones, telephone rings andsirens have narrow bandwidths with strong localised spectral energy. Thesenoises come out easily in the CSM representation. On the other hand, intrudingsignals such as music and speech are wideband and would not separate fromother speech sources based on the CSM, while they do with CAM. CAM andCSM are complementary representations, at least for the limited databaseused here. Referring back to first sections of the paper, we can write thatthe separation of two interfering unvoiced speech sounds will very likely notbe feasible based only on the CSM and CAM representations. We know fromphysiology that other representations are available to the auditory system andshould be implemented for a more robust system. These new representationsshould not assume the stationarity of signals, as it has been assumed here byusing the Fourier Transform.

From the siren interfering signal we observe that the neural network is able tofollow the dynamic changes in time and frequency, which is a crucial propertyof the system. However, the limiting factor is the number of cochlear channelsand the width of the sliding window. Audio files with the results for three-source sound separation can be found on the web pages at [65] or [79].

11 Sound and demo files are avalaible for listening at: http://www-edu.gel.usherbrooke.ca/pichevar/Demos.htm

20

Page 21: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

Time (s)0 1.750

4000

Freq

uenc

y (Hz

)

Time (s)0 1.750

4000

Freq

uenc

y (Hz

)

Fig. 10. Top: The separation result for the utterance extraction from the sirenplus sentence mixture with our proposed technique. Bottom: The separation resultfor the siren extraction from the siren plus sentence mixture with our proposedtechnique.

10 Conclusion

A new system that comprises a perceptive analysis to extract multiple andsimultaneous features to be processed by an unsupervised neural network hasbeen proposed. There is no need to tune-up the neural network when changingthe nature of the signal. Furthermore, there is no training or recognition phase.

The proposed system has been tested on a limited corpus. Many improvementsshould be made before considering an extensive use of the approach in realsituations. Among them, there is a need for creating new feature maps foronset, offset, etc. Nevertheless, the experiments have led us to the conclusionthat computational neuroscience in combination with speech processing offersa strong potential for unsupervised and dynamic speech processing systems.

Even with crude approximation such as binary masking and non overlapping

21

Page 22: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

Time (s)0 1.510

4000

Fre

quen

cy (

Hz)

Fig. 11. The synthesised “Why were you all weary?” by the approach proposed in[93] for the siren plus utterance mixture case.

and independent windows 12 , we obtain relatively good synthesis intelligibil-ity.

Future developments will include the implementation of an efferent feedbackcontrol for signal representation adaptation and feature selections as a top-down (schema-driven) processing.

11 Aknowledgments

The authors would like to thank Philippe Gournay, Hossein Najaf-Zadeh, andRichard Boudreau for reading and commenting the paper, DeLiang Wang andGuoning Hu for their audio files and for fruitful discussions on oscillatory neu-rones, Christian Feldbauer, Gernot Kubin for their FIR implementation of theGammatone filterbanks and for fruitful discussions on filterbanks, and Jean-Marc Valin for discussions on the LSD performance criterion. Many thanksalso to the anonymous reviewers for their constructive feedbacks.

References

[1] Andersen, R., Snyder, L., Bradley, D., Xing, J., 1997. Multimodal repre-sentation of space in the posterior parietal cortex and its use in planningmovements. Ann. Rev. Neurosci., 20:303.

12 Binary masks create artifacts by placing zeros in the spectrum where the inter-fering source was. The absence of energy at these locations might be heard (hearingof absent signal or musical noise).

22

Page 23: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

[2] Atlas, L., Shamma, S. A., 2003. Joint acoustic and modulation frequency.EURASIP Journal on Applied Signal Processing (7), 668–675.

[3] Bahoura, M. , Rouat, J., 2001. Wavelet speech enhancement based onthe Teager Energy Operator. IEEE Signal Processing Letters 8, 10–12.

[4] Bahoura, M. , Rouat, J., 2001. New Approach for Wavelet Speech En-hancement. Eurospeech 2001, Danemark, pp. 1937–1940.

[5] Berthommier, F., Meyer, G., 1997. Improving of amplitude modula-tion maps for f0-dependent segregation of harmonic sounds. In: Eu-rospeech’97.

[6] Bohte, S. M., Poutre, H. L., Kok, J. N., March 2002. Unsupervised clus-tering with spiking neurons by sparse temporal coding and multilayerRBF networks. IEEE transactions on neural networks 13 (2), 426–435.

[7] Borisyuk, R., Denham, M., Hoppensteadt, F., Kazanovich, Y., Vino-gradova, O., 2001. Oscillatory model of novelty detection. Network:Computation in Neural Systems 12 (1), 1–20.

[8] Bregman, A., 1990. Auditory Scene Analysis. MIT Press.[9] Brown, G., Cooke, M., 1994. Computational auditory scene analaysis.

Computer Speech and Language, 297–336.[10] Chechik, G., Tishby, N., 2000. Temporally dependent plasticity: An in-

formation theoretic account. In: NIPS.[11] Compernolle, D. V., Ma, W., Xie, F., Diest, M. V., 1990. Speech recog-

nition in noisy environments with the aid of microphone array. SpeechComm. 9 (5–6), 433–442.

[12] Cooke, M., 1991. Modelling auditory processing and organisation. Ph.D.thesis, University of Sheffield (published in the Distinguished Disserta-tions in Computer Science Series, University of Cambridge Press, paperback, 2005).

[13] Cooke, M., 2004. http://www.dcs.shef.ac.uk/˜martin/.[14] Cooke, M., Ellis, D., 2001. The auditory organization of speech and other

sources in listeners and computational models. Speech Communication,141–177.

[15] Cooke, M., Green, P., Josifovski, L., Vizinho, A., 2001. Robust automaticspeech recognition with missing and unreliable acoustic data. SpeechComm. 34, 267–285.

[16] Delgutte, B., 1980. Representation of speech-like sounds in the dischargepatterns of auditory nerve fibers. JASA 68, 843–857.

[17] DeWeese, M., 1996. Optimization principles for the neural code. Net-work: Computation in Neural Systems 7 (2), 325–331.

[18] DeWeese, M. R., Zador, A. M., December 2002. Binary coding in audi-tory cortex. In: NIPS.

[19] E.L.D.A, 2004. http://www.elda.fr.[20] Ellis, D., 1996. Prediction-driven computational auditory scene analysis.

Ph.D. thesis, MIT.[21] Frisina, R. D., Smith, R. L., Chamberlain, S. C., 1985. Differential en-

coding of rapid changes in sound amplitude by second-order auditory

23

Page 24: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

neurons. Experimental Brain Research 60, 417–422.[22] Gabbiani, F., Krapp, H., Koch, C., Laurent, G., 2002. Multiplicative

computation in a visual neuron sensitive to looming. Nature 420, 320–324.

[23] Gaillard, F., 1999. Analyse de scenes auditives computationnelle(CASA): Un nouvel outil de marquage du plan temps-frequence pardetection d’harmonicite exploitant une statistique de passage par zero.Ph.D. thesis, INPG.

[24] Gerstner, W., 2002. Spiking Neuron Models: Single Neurons, Popula-tions, Plasticity. Cambridge University Press.

[25] Giguere, C., Woodland, P. C., 1994. A computational model of the au-ditory periphery for speech and hearing research. JASA, 331–349.

[26] Grossberg, S., Govindarajan, K., Wyse, L., Cohen, M., 2003. ART-STREAM: A neural network model of auditory scene analysis and sourcesegregation. Neural Networks.

[27] Handbook of Neural Computation, 1997. IOP publishing ltd and oxforduniversity press.

[28] Harding, S., Meyer, G., September 2003. Multi-resolution auditory sceneanalysis: Robust speech recognition using pattern-matching from a noisysignal. In: EUROSPEECH. pp. 2109–2112.

[29] Henkel, C. K., 1997. The Auditory System. In: Haines, D. E. (Ed.),Fondamental Neuroscience. Churchill Livingstone.

[30] Hewitt, M., Meddis, R., 04 1994. A computer model of amplitude-modulation sensitivity of single units in the inferior colliculus. Journalof the Acoustical Society of America 95 (4), 2145–2159.

[31] Ho, T. V., Rouat, J., May 1998. Novelty detection based on relax-ation time of a network of integrate–and–fire neurons. In: Proc. of theIEEE,INNS Int. Joint Conf. on Neural Networks. Vol. 2. pp. 1524–1529.

[32] Hopfield, J., 1995. Pattern recognition computation using action poten-tial timing for stimulus representation. Nature 376, 33–36.

[33] Hu, G., Wang, D., 2002. Monaural speech segregation based on pitchtracking and amplitude modulation. Tech. rep., Ohio State University.

[34] Hu, G., Wang, D., 2003. Separation of stop consonants. In: ICASSP2003, Hong Kong.

[35] Hu, G., Wang, D., 2004. Monaural speech segregation based on pitchtracking and amplitude modulation. IEEE Trans. On Neural Networks,to appear.

[36] Hunt, M. J., Lefebvre, C., 1987. Speech recognition using an auditorymodel with pitch-synchronous analysis. In: ICASSP. IEEE, New York,pp. 813–816.

[37] Hunt, M. J., Lefebvre, C., 04 1988. Speaker dependent and indepen-dent speech recognition experiments with an auditory model. In: IEEEICASSP. IEEE, New York, pp. 215–218.

[38] Hyvarinen, Karhunen, Oja, 2001. Independent Component Analysis.Wiley.

24

Page 25: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

[39] Irino, T., Patterson, R., 2003. Speech segregation using event syn-chronous auditory vocoder. In: ICASSP. Vol. V. pp. 525–528.

[40] Irino, T., Unoki, M., May 1998. A time-varying, analysis/synthesisauditory filterbank using the gammachirp. In: Proceedings of the In-ternational Conference on Acoustics, Speech, and Signal Processing(ICASSP)98. Vol. 6. Seattle, Washington, pp. 3653–3656.

[41] Jang, G., Lee, T., 2003. Single-Channel Signal Separation Using Time-Domain Basis Functions. IEEE-SPL, 168–171.

[42] Jang, G., Lee, T., 2003 A maximum likelihood approach to single channelsource separation. Journal of Machine Learning Research, vol. 4, pp1365-1392.

[43] Kaneda, Y., Ohga, J., 1986. Adaptive microphone-array system for noisereduction. TrASSP 34 (6), 1391–1400.

[44] Karlsen, B. L., Brown, G. J., Cooke, M., Crawford, M., Green, P., Re-nals, S., 1998. Analysis of a Multi-Simultaneous-Speaker Corpus. In :Computational Auditory Scene Analysis. L. Erlbaum.

[45] Kim, S., Frisina, D. R., Frisina, R. D., 2002. Effects of Age on Contralat-eral suppression of Distorsion Product Otoacoustic Emissions in HumanListeners with Normal Hearing. Audiology Neuro Otology, 7:348–357.

[46] Kouwe, A. J. W., Wang, D. L., Brown, G. J., 2001. A comparison ofauditory and blind separation techniques for speech segregation. IEEETrans. on Speech and Audio Processing 9, 189–195.

[47] Kubin, G., Kleijn, W. B., Mar. 1999. On speech coding in a perceptualdomain. In: ICASSP. Vol. 1. Phoenix, Arizona, pp. 205–208.

[48] Levy, N., Horn, D., Meilijson, I., Ruppin, E., 7 2001. Distributed syn-chrony in a cell assembly of spiking neurons. Neural Networks 14 (6–7),815–824.

[49] Liberman, M., Puria, S., Guinan, J. J., 1996. The ipsilaterally evokedolivocochlearreflex causes rapid adaptation of the 2f1-f2 distortion prod-uct otoacoustic emission. JASA 99, 2572–3584.

[50] Maass, W., 1997. Networks of spiking neurons: The third generation ofneural network models. Neural Networks 10 (9), 1659–1671.

[51] Maass, W., Bishop, C. M., 1998. Pulsed Neural Networks. MIT Press.[52] v. d. Malsburg, C., 1981. The correlation theory of brain function. Tech.

Rep. Internal Report 81-2, Max-Planck Institute for Biophysical Chem-istry.

[53] v. d. Malsburg, C., 1999. The what and why of binding: The modeler’sperspective. Neuron, 95–104.

[54] v. d. Malsburg, C., Schneider, W., 1986. A neural cocktail-party proces-sor. Biol. Cybern., 29–40.

[55] Mellinger, D. K. and Mont-Reynaud, B. M., 1996. Scene analysis. InHawkins, H., McMullen, T., Popper, A., and Fay, R., editors, AuditoryComputation, pages 271—331. Springer, New York.

[56] Meyer, G., Yang, D., Ainsworth, W., 2001. Applying a model of concur-rent vowel segregation to real speech. Computational models of auditory

25

Page 26: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

function, Eds. Greenberg S. and Slaney M., 297–310.[57] Milner, P., 1974. A model for visual shape recognition. Psychological

Review 81, 521–535.[58] Nakadai, K., Okuno, H. G., Kitano, H., Oct. 2002. Auditory fovea based

speech separation and its application to dialog system. In: IEEE/RSJInternational Conference on Intelligent Robots and Systems. pp. 1314–1319.

[59] Natschlager, T., Maass, W., Zador, A., 2001. Efficient temporal process-ing with biologically realistic dynamic synapses. Network: Computationin Neural Systems 12 (1), 75–87.

[60] Nix, J., Kleinschmidt, M., Hohmann, V., Sep. 2003. Computational au-ditory scene analysis by using statistics of high-dimensional speech dy-namics and sound source direction. In: EUROSPEECH. pp. 1441–1444.

[61] Panchev, C., Wermter, S., july 2003. Spiking-time-dependent synapticplasticity: From single spikes to spike trains. In: Computational Neuro-science Meeting. Springer–Verlag, pp. 494–506.

[62] Patterson, R., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C.,Allerhand, M., 1992. Complex sounds and auditory images. In: Cazals,Y., Demany, L., Horner, K. (Eds.), Auditory Physiology and Perception.Pergamon Press, Oxford, pp. 429–446.

[63] Pena, J., Konishi, M., 2001. Auditory spatial receptive fields created bymultiplication. Science 292, 294–252.

[64] Perrinet, L., 2003. Comment dechifrer le code impulsionnel de la vision? etude du flux parallele, asynchrone et epars dans le traitement visuelultra-rapide. Ph.D. thesis, Universite Paul Sabatier.

[65] Pichevar, R., 2004. http://www-edu.gel.usherbrooke.ca/pichevar/.[66] Pichevar, R., Rouat, J., 2003. Cochleotopic/AMtopic (CAM) and

Cochleotopic/Spectrotopic (CSM) map based sound source separationusing relaxation oscillatory neurons. In: IEEE Neural Networks for Sig-nal Processing Workshop, Toulouse, France.

[67] Pichevar, R., Rouat, J., Feldbauer, C., Kubin, G., 2004. A bio-inspiredsound source separation technique in combination with an enhanced FIRgammatone Analysis/Synthesis filterbank. In: EUSIPCO Vienna.

[68] Pichevar, R. , Rouat, J., 2004. A Quantitative Evaluation of a Bio-Inspired Sound Segregation Technique for Two- and Three-Source Mix-tures sounds. Lecture Notes in Computer Science (Springer-Verlag)., vol.3445, pp. 430-435.

[69] Plante, F., Meyer, G., Ainsworth, W., 1998. Improvement of speechspectrogram accuracy by the method of reassignment. IEEE Trans. onSpeech and Audio Processing, 282–287.

[70] Popper, A. N., Fay, R. (Eds.), 1992. The Mammalian Auditory Pathway:Neurophysiology. Springer–Verlag.

[71] Potamitis, I., Tremoulis, G., Fakotakis, N., September 2003. Multi-speaker doa tracking using interactive multiple models and probabilisticdata association. In: EUROSPEECH. pp. 517–520.

26

Page 27: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

[72] Reyes-Gomez, M. J., Raj, B., Ellis, D., 2003. Multi-channel source sep-aration by factorial HMMs. In: ICASSP 2003.

[73] Rieke, F., Warland, D., de Ruyter van Steveninck, R., Bialek, W., 1997.SPIKES Exploring the Neural Code. MIT Press.

[74] M. Riesenhuber, T. Poggio., 1999. Are Cortical Models Really Boundby the Binding Problem? Neuron, vol. 84, pp 87-93, 1999.

[75] Roman, N., Wang, D., Brown, G., 2003. Speech segregation based onsound localization. JASA.

[76] Rosenthal, D. F., Okuno, H. G. (Eds.), 1998. Computational AuditoryScene Analysis. L. Erlbaum.

[77] Rouat, J., Liu, Y. C., Morissette, D., 1997. A pitch determination andvoiced/unvoiced decision algorithm for noisy speech. Speech Comm. 21,191–207.

[78] Rouat, J. , Pichevar, R., 2005. Source separation with one ear: Propo-sition for an anthropomorphic approach. EURASIP Journal on AppliedSignal Processing, no.9, pp 1365-1374, 2005.

[79] Rouat, J., 2005. http://www.gel.usherbrooke.ca/rouat/.[80] Roweis, S., 2003. Factorial models and refiltering for speech separation

and denoising. In: Eurospeech 2003.[81] Roweis, S. T., 2000. One microphone source seperation. In: NIPS, Den-

ver, USA.[82] Sameti, H., Sheikhzadeh, H., Deng, L., Brennan, R., 1998. HMM based

strategies for enhancement of speech signals embedded in nonstationarynoise. IEEE Trans. on Speech and Audio Processing, 445–455.

[83] Schwartz, J. L., Escudier, P., 1989. Auditory processing in a post-cochlear neural network : Vowel spectrum processing based on spikesynchrony. EUROSPEECH, 247–253.

[84] Sejnowski, T. J., 1995. Time for a new neural code? Nature 376, 21–22.[85] Seltzer, M. L., Raj, B., Stern, R. M., 2002. Speech recognizer-based

microphone array processing for robust hands-free speech recognition.In: ICASSP. Vol. I. pp. 897–900.

[86] Speech Annotation and Corpus Tools, 2001. Speech communication jour-nal, vol. 33.

[87] Tang, P., Rouat, J., Oct 1996. Modeling neurons in the anteroventralcochlear nucleus for amplitude modulation (AM) processing: Applica-tion to speech sound. In: Proc. Int. Conf. on Spok. Lang. Proc. p.Th.P.2S2.2.

[88] Thorpe, S., Delorme, A., Rullen, R. V., 2001. Spike-based strategies forrapid processing. Neural Networks 14 (6-7), 715–725.

[89] Thorpe, S., Fize, D., Marlot, C., 1996. Speed of processing in the humanvisual system. Nature 381 (6582), 520–522.

[90] Todd, N., 1996. An auditory cortical theory of auditory stream segrega-tion. Network : Computation in Neural Systems 7, 349–356.

[91] Valin, J.-M., Michaud, F., Rouat, J., Ltourneau, D., Oct. 2003. Robustsound source localization using a microphone array on a mobile robot.

27

Page 28: Monophonic sound Source Separation with an unsupervised ... · D´epartement de g´enie ´electrique et g´enie informatique Universit´e de Sherbrooke, 2500 boul. de l’Universit´e

In: IEEE/RSJ-Int. Conf. on Intelligent Robots & Systems.[92] Valin, J. M., Rouat, J., Michaud, F., 2004. Microphone array post-

filter for separation of simultaneous non-stationary sources. In: IEEEInt. Conf. on Acoustics Speech Signal Processing.

[93] Wang, D., Brown, G. J., May 1999. Separation of speech from interferingsounds based on oscillatory correlation. IEEE Transactions on NeuralNetworks 10 (3), 684–697.

[94] Wang, D., Terman, D., 1997. Image segmentation based on oscillatorycorrelation. Neural Computation 9, 805–836.

[95] M. Weintraub. A theory and computational model of auditory monauralsound separation. Ph.D. thesis, Stanford, 1985.

[96] Wermter, S., Austin, J., Willshaw, D., 2001. Emergent Neural Compu-tational Architectures Based on Neuroscience, Towards Neuroscience-Inspired Computing. Springer–Verlag.

[97] Widrow, B., al., 1975. Adaptive noise cancelling: Principles and appli-cations. Proceedings of the IEEE 63 (12).

[98] Zotkin, D. N., Shamma, S. A., Ru, P., Duraiswami, R., Davis, L. S.,2003. Pitch and timbre manipulations using cortical representation ofsound. In: ICASSP. Vol. V. pp. 517–520.

28