decoding speech in the presence of other sourcesdpwe/pubs/barkce05-sfd-spcomm.pdf · decoding...

21
Decoding speech in the presence of other sources J.P. Barker a, * ,1 , M.P. Cooke a,1 , D.P.W. Ellis b a Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, UK b Department of Electrical Engineering, Columbia University, 500 W. 120th Street, New York, NY 10027, USA Received 7 June 2002; received in revised form 21 November 2003; accepted 20 May 2004 Abstract The statistical theory of speech recognition introduced several decades ago has brought about low word error rates for clean speech. However, it has been less successful in noisy conditions. Since extraneous acoustic sources are present in virtually all everyday speech communication conditions, the failure of the speech recognition model to take noise into account is perhaps the most serious obstacle to the application of ASR technology. Approaches to noise-robust speech recognition have traditionally taken one of two forms. One set of techniques attempts to estimate the noise and remove its effects from the target speech. While noise estimation can work in low-to-moderate levels of slowly varying noise, it fails completely in louder or more variable conditions. A second approach utilises noise models and attempts to decode speech taking into account their presence. Again, model-based techniques can work for simple noises, but they are computationally complex under realistic conditions and require models for all sources present in the signal. In this paper, we propose a statistical theory of speech recognition in the presence of other acoustic sources. Unlike earlier model-based approaches, our framework makes no assumptions about the noise background, although it can exploit such information if it is available. It does not require models for background sources, or an estimate of their number. The new approach extends statistical ASR by introducing a segregation model in addition to the conventional acoustic and language models. While the conventional statistical ASR problem is to find the most likely sequence of speech models which generated a given observation sequence, the new approach additionally determines the most likely set of signal fragments which make up the speech signal. Although the framework is completely general, we provide one interpretation of the segregation model based on missing-data theory. We derive an efficient HMM decoder, which searches both across subword state and across alternative segregations of the signal between target and interference. We call this modified system the speech fragment decoder. The value of the speech fragment decoder approach has been verified through experiments on small-vocabulary tasks in high-noise conditions. For instance, in a noise-corrupted connected digit task, the new approach decreases the word error rate in the condition of factory noise at 5 dB SNR from over 59% for a standard ASR system to less than 22%. Ó 2004 Elsevier B.V. All rights reserved. 0167-6393/$ - see front matter Ó 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2004.05.002 * Corresponding author. Tel.: +44 114 222 1824; fax: +44 114 222 1810. E-mail address: [email protected] (J.P. Barker). 1 Partially supported by EPSRC grant GR/47400/01. Speech Communication 45 (2005) 5–25 www.elsevier.com/locate/specom

Upload: phamdang

Post on 30-Mar-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

Decoding speech in the presence of other sources

J.P. Barker a,*,1, M.P. Cooke a,1, D.P.W. Ellis b

a Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, UKb Department of Electrical Engineering, Columbia University, 500 W. 120th Street, New York, NY 10027, USA

Received 7 June 2002; received in revised form 21 November 2003; accepted 20 May 2004

Abstract

The statistical theory of speech recognition introduced several decades ago has brought about low word error ratesfor clean speech. However, it has been less successful in noisy conditions. Since extraneous acoustic sources are presentin virtually all everyday speech communication conditions, the failure of the speech recognition model to take noise intoaccount is perhaps the most serious obstacle to the application of ASR technology.

Approaches to noise-robust speech recognition have traditionally taken one of two forms. One set of techniquesattempts to estimate the noise and remove its e!ects from the target speech. While noise estimation can work inlow-to-moderate levels of slowly varying noise, it fails completely in louder or more variable conditions. A secondapproach utilises noise models and attempts to decode speech taking into account their presence. Again, model-basedtechniques can work for simple noises, but they are computationally complex under realistic conditions and requiremodels for all sources present in the signal.

In this paper, we propose a statistical theory of speech recognition in the presence of other acoustic sources. Unlikeearlier model-based approaches, our framework makes no assumptions about the noise background, although it canexploit such information if it is available. It does not require models for background sources, or an estimate of theirnumber. The new approach extends statistical ASR by introducing a segregation model in addition to the conventionalacoustic and language models. While the conventional statistical ASR problem is to find the most likely sequence ofspeech models which generated a given observation sequence, the new approach additionally determines the most likelyset of signal fragments which make up the speech signal. Although the framework is completely general, we provide oneinterpretation of the segregation model based on missing-data theory. We derive an e"cient HMM decoder, whichsearches both across subword state and across alternative segregations of the signal between target and interference.We call this modified system the speech fragment decoder.

The value of the speech fragment decoder approach has been verified through experiments on small-vocabulary tasksin high-noise conditions. For instance, in a noise-corrupted connected digit task, the new approach decreases the worderror rate in the condition of factory noise at 5dB SNR from over 59% for a standard ASR system to less than 22%.! 2004 Elsevier B.V. All rights reserved.

0167-6393/$ - see front matter ! 2004 Elsevier B.V. All rights reserved.doi:10.1016/j.specom.2004.05.002

* Corresponding author. Tel.: +44 114 222 1824; fax: +44 114 222 1810.E-mail address: [email protected] (J.P. Barker).1 Partially supported by EPSRC grant GR/47400/01.

Speech Communication 45 (2005) 5–25

www.elsevier.com/locate/specom

Keywords: Robust speech recognition; Signal separation; Missing data recognition; Computational auditory scene analysis; Acousticmixtures

1. Introduction

In the real world, the speech signal is frequentlyaccompanied by other sound sources on reachingthe auditory system, yet listeners are capable ofholding conversations in a wide range of listeningconditions. Recognition of speech in such !adverse"conditions has been a major thrust of research inspeech technology in the last decade. Nevertheless,the state of the art remains primitive. Recentinternational evaluations of noise robustness havedemonstrated technologically useful levels ofperformance for small vocabularies in moderateamounts of quasi-stationary noise (Pearce andHirsch, 2000). Modest departures from such con-ditions lead to a rapid drop in recognitionaccuracy.

A key challenge, then, is to develop algorithmsto recognise speech in the presence of arbitrarynon-stationary sound sources. There are twobroad categories of approaches to dealing withinterference for which a stationarity assumptionis inadequate. Source-driven techniques exploit evi-dence of a common origin for subsets of sourcecomponents, while model-driven approaches utiliseprior (or learned) representations of acousticsources. Source-driven approaches include primi-tive auditory scene analysis (Brown and Cooke,1994; Wang and Brown, 1999; see review in Cookeand Ellis, 2001) based on auditory models of pitchand location processing, independent componentanalysis and blind source separation (Bell andSejnowski, 1995; Hyvarinen and Oja, 2000) whichexploit statistical independence of sources, andmainstream signal processing approaches (Parsons,1976; Denbigh and Zhao, 1992). The prime exam-ples of model-driven techniques are HMM decom-position (Varga and Moore, 1990) and parallelmodel combination (PMC) (Gales and Young,1993), which attempt to find model state sequencecombinations which jointly explain the acousticobservations. Ellis" !prediction-driven" approach

(Ellis, 1996) can also be regarded as a techniqueinfluenced by prior expectations.

Pure source-driven approaches are typicallyused to produce a clean signal which is then fedto an unmodified recogniser. In real-world listen-ing conditions, this segregate-then-recognise ap-proach fails (see also the critique in Slaney,1995), since it places too heavy a demand on thesegregation algorithm to produce a signal suitablefor recognition. Conventional recognisers arehighly sensitive to the kinds of distortion resultingfrom poor separation. Further, while current algo-rithms do a reasonable job of separating periodicsignals, they are less good both at dealing withthe remaining portions and extrapolating acrossunvoiced regions, especially when the noise back-ground contains periodic sources. The problemof distortion can be solved using missing data(Cooke et al., 1994, 2001) or multiband (Bourlardand Dupont, 1997) techniques, but the issue ofsequential integration across aperiodic intervalsremains.

Pure model-driven techniques also fail in prac-tice, due to their reliance on the existence of mod-els for all sources present in a mixture, and thecomputational complexity of decoding multiplesources for anything other than sounds which pos-sess a simple representation.

There is evidence that listeners too use a combi-nation of source and model driven processes(Bregman, 1990). For instance, vowel pairs pre-sented concurrently on the same fundamentalcan be recognised at levels well above chance, indi-cating the influence of top-down model-matchingbehaviour, but even small di!erences in fundamen-tal—which create a source-level cue—lead to sig-nificant improvements in identification indicatingthat the model-driven search is able e"ciently toexploit the added low-level information (Sche!ers,1983). Similarly, when the first three speech form-ants are replaced by sinusoids, listeners recognisethe resulting sine-wave speech at levels approach-

6 J.P. Barker et al. / Speech Communication 45 (2005) 5–25

ing natural speech, generally taken as evidence of apurely top-down speech recognition mechanism,since the tokens bear very little resemblance tospeech at the signal level (Bailey et al., 1977;Remez et al., 1981). However, when presented witha sine-wave cocktail party consisting of a pair ofsimultaneous sine-wave sentences, performancefalls far below the equivalent natural speech sen-tence-pair condition, showing that low-level signalcues are required for this more demanding condi-tion (Barker and Cooke, 1999).

In this paper, we present a framework whichattempts to integrate source- and model-drivenprocesses in robust speech recognition. We demon-strate how the decoding problem in ASR can beextended to incorporate decisions about which re-gions belong to the target signal. Unlike puresource-driven approaches, the integrated decoderdoes not require a single hard-and-fast prior segre-gation of the entire target signal, and, in contrastto pure model-based techniques, it does not as-sume the existence of models for all sources pre-sent. Since it is an extension of conventionalspeech decoders, it maintains all of the advantagesof the prevailing stochastic framework for ASR bydelaying decisions until all relevant evidence hasbeen observed. Furthermore, it allows a tradeo!between the level of detail derived from source-dri-ven processing and decoding speed.

Fig. 1 motivates the new approach. The upperpanel shows an auditory spectrogram of the utter-ance ‘‘two five two eight three’’ spoken by a malespeaker mixed with drum beats at a global SNRof 0dB. The centre panel segments the time-fre-quency plane into regions, which are dominated(in the sense of possessing a locally-favourableSNR) by one or other source. The correct assign-ment of regions to the two sources is shown inthe lower panel.

In outline, our new formalism defines as anadmissible search over all combinations of regions(which we call fragments) to generate the mostlikely word sequence (or, more generally, sequenceof source models). This is achieved by decompos-ing the likelihood calculation into three parts: inaddition to the conventional language model term,we introduce a segregation model, which defines

how fragments are formed, and a modified acousticmodel, which links the observed acoustics to sourcemodels acquired during training.

Section 2 develops the new formalism, andshows how the segregation model and partialacoustic model can be implemented in practice.Section 3 demonstrates the performance of theresulting decoder applied to digit strings withadded noise. Section 4 discusses issues that havearisen with the current decoder implementationand future research directions.

2. Theoretical development

The simultaneous segregation/recognition ap-proach can be formulated as an extension of theexisting speech recognition theory. When formu-lated in a statistical manner, the goal of the speechrecogniser is traditionally stated as to find theword sequence bW ! w1;w2; . . . ;wN with the maxi-mum a posteriori probability given the sequence ofacoustic feature vectors observed for the speech,X = x1,x2, . . .,xT:

bW ! argmaxW

P "W jX#: "1#

This equation is rearranged using Bayes" ruleinto

bW ! argmaxW

P "X jW #P "W #P "X # ; "2#

which separates the prior probability of the wordsequence alone P(W) (the language model), thedistribution of the speech features for a particu-lar utterance, P(XjW) (the acoustic model), andthe prior probability of those features P(X)(which is constant over W and thus will notinfluence the outcome of the argmax). P(W)may be trained from the word sequences in alarge text corpus, and P(XjW) is learned by mod-elling the distribution of actual speech featuresassociated with particular sounds in a speechtraining corpus.

Following our considerations above, we may re-state this goal as finding the word sequence, bW ,

J.P. Barker et al. / Speech Communication 45 (2005) 5–25 7

along with the speech/background segregation, bS ,which jointly have the maximum posteriorprobability.

Further, because the observed features are nolonger purely related to speech but in general in-clude the interfering acoustic sources, we will de-note them as Y to di!erentiate them from the Xused in our speech-trained acoustic modelsP(XjW). 2

bW ; bS ! argmaxW ;S

P "W ; SjY#: "3#

To reintroduce the speech features X, which arenow an unobserved random variable, we integratethe probability over their possible values, anddecompose with the chain rule to separate outP(SjY), the probability of the segregation basedon the observations:

P "W ;SjY# !Z

P "W ;X ;SjY#dX "4#

!Z

P "W jX ;S;Y#P "X jS;Y#dX $P "SjY#:

"5#Since W is independent of S and Y given X, the

first probability simplifies to P(WjX). As in thestandard derivation, we can rearrange it via Bayes"rule to obtain a formulation in terms of ourtrained distribution models P(W) and P(XjW):

P "W ; SjY# !Z

P"X jW #P "W #P"X#

P"X jS;Y#dX $ P "SjY# "6#

! P "W #Z

P"X jW # P"X jS;Y#P"X#

dX

! "P "SjY#:

"7#Note that because X is no longer constant, we can-not drop P(X) from the integral.

In the case of recognition with hidden Markovmodels (HMMs), the conventional derivationintroduces an unobserved state sequence Q = q1,q2, . . .,qT along with models for the joint probabil-ity of word sequence and state sequenceP(W,Q) = P(QjW)P(W). The Markovian assump-tions include making the feature vector xi at time idepend only on the corresponding state qi, makingP(XjQ) = "iP(xijqi). The total likelihood of a par-ticular W over all possible state sequences is nor-

2 Note, if we were not interested in the speech/backgroundsegregation but only in the most likely word sequence regardlessof the actual segregation then it would be more correct tointegrate Eq. (3) over the segregation space definingW0 = argmaxW

PSP(W,SjY). However, this integration pre-

sents some computational complexity so in practice even if wewere not directly interested in the segregation it may bedesirable to implement Eq. (3) directly and take bW as anapproximation of W 0.

Fig. 1. The top panel shows the auditory spectrogram of the utterance ‘‘two five two eight three’’ spoken by a male speaker mixed withdrum beats at 0dB SNR. The lower panel shows the correct segregation of speech energy (black) and drums energy (grey). The centrepanel illustrates the set of fragments generated using knowledge of the speech source and the noise source prior to mixing.

8 J.P. Barker et al. / Speech Communication 45 (2005) 5–25

mally approximated by the score over the singlemost-likely state sequence (the Viterbi path). Inour case, this gives,

bW ; bS ! argmaxW ;S

maxQ2QW

P "SjY#P "W #P"QjW #

%Z

P "X jQ# P"X jS;Y#P "X# dX ; "8#

where QW represents the set of all allowable statesequences corresponding to word sequence W.

Compare Eq. (8) to the corresponding equationfor identifying the word sequence in a conven-tional speech recogniser:

bW ! argmaxW

maxQ2QW

P "W #P "QjW #P "X jQ#: "9#

It can be seen that there are three significantdi!erences:

(1) A new term, P(SjY ) has been introduced. Thisis the !segregation model", describing the prob-ability of a particular segregation S given ouractual observations Y, but independent ofthe word hypothesis W—precisely the kindof information we expect to obtain from amodel of source-driven, low-level acousticorganisation.

(2) The acoustic model score P(XjQ) is now eval-uated over a range of possible values for X,weighted by their relative likelihood given theobserved signal Y and the particular choiceof segregation mask S. This is closely relatedto previous work on missing data theory,and is discussed in more detail in Section 2.3.

(3) The maximisation now occurs over both Wand S. Whereas conventional speech recogni-tion searches over the space of words

sequences, the extended approach has tosimultaneously search over the space of alladmissible segregations.

In the terms of Bregman"s !Auditory SceneAnalysis" account (Bregman, 1990), the segrega-tion model may be identified as embodying theso-called !primitive grouping process", and theacoustic model plays the part of the !schema-dri-ven grouping process". Eq. (8) serves to integratethese two complementary processes within theprobabilistic framework of ASR. The maximisa-tion over W and S can be achieved by extendingthe search techniques employed by traditionalASR. These three key aspects of the work, namely,the segregation model, the acoustic model and thesearch problem are addressed in greater detail inthe sections which follow (Fig. 2).

2.1. The segregation model

Consider the space of potential speech/back-ground segregations. An acoustic observation vec-tor, X may be constructed as a sequence of framesx1,x2, . . .,xT where each frame is composed ofobservations pertaining to a series of, say F, fre-quency channels. The observation vector is there-fore composed of T · F spectro-temporalfeatures. A speech/background segregation maybe conveniently described by a binary mask inwhich the label !1" is employed to signify that thefeature belongs to the speech source, and a !0" tosignify that the feature belongs to the background.As this binary mask has T · F elements it can beseen that there are 2TF possible speech/backgroundsegregations. So, for example, at a typical framerate of 100Hz, and with a feature vector employ-ing 32 frequency channels, there would be 23200

Search algorithme.g. modified decoder

Segregation modelsource-level grouping processes

Language modelbigrams, dictionary

Acoustic modelschema-driven processes

Segregation weightingconnection to observations

Fig. 2. An overview of the speech fragment decoding equation.

J.P. Barker et al. / Speech Communication 45 (2005) 5–25 9

possible segregations for a one second audiosample.

Fortunately, most of these segregations can beruled out immediately as being highly unlikelyand the size of the search space can be drasticallyreduced. The key to this reduction is to identifyspectro-temporal regions for which there is strongevidence that all the spectro-temporal pixels con-tained are dominated by the same sound source.Such regions constrain the spectro-temporal pixelscontained to share the same speech/backgroundlabel. Hence, for each permissible speech/back-ground segregation, the pixels within any givenfragment must either all be labelled as speech(meaning that the fragment is part of the speechsource) or must all be labelled as background(meaning that the fragment is part of some othersource). Consequently, if the spectro-temporalobservation vector can be decomposed into N suchfragments, there will be 2N separate ways of label-ling the fragments and hence only 2N valid segre-gations. In general each fragment will containmany spectro-temporal pixels, and 2N will bevastly smaller than the size of the unconstrainedsegmentation search space, 2TF.

The success of the segregation model dependson being able to identify a reliable set of coherentfragments. The process of dissecting the represen-tation into fragments is similar to the process thatoccurs in visual scene analysis. The first stage ofinterpreting a visual scene is to locate regionswithin the scene that are components of larger ob-jects. For this purpose all manner of primitiveprocesses may be employed: edge detection, conti-nuity, uniformity of colour, uniformity of textureetc. Analogous processes may be used in the anal-ysis of the auditory !scene", for example, spectro-temporal elements may be grouped if they formcontinuous tracks (i.e. akin to visual edge detec-tion), tracks may be grouped if they lie in har-monic relation, energy regions may groupedacross frequency if they onset or o!set at the sametime. Fig. 3 illustrates some of the mechanismsthat may be used to bind spectro-temporal regionsto recover partial descriptions of the individualsound sources. A detailed account of these so-called !primitive grouping processes" is given in(Bregman, 1990).

In the experiments that follow, each of the 2N

valid segregations is allocated an equal prior prob-ability. This stands as a reasonable first approxi-mation. However, a more detailed segregationmodel could be constructed in which the segrega-tion priors vary across segregations. Such a modelwould take into account factors like the relation-ship between the individual fragments of whichthey are composed. For example, if there are twofragments which cover spectro-temporal regionsin which the acoustic data is periodic and has thesame fundamental frequency, then these two frag-ments are likely to be parts of the same soundsource, and hence segregations in which they arelabelled as either both speech or both backgroundshould be favoured. Section 4.3 discusses furthersuch !between-fragment grouping" e!ects and ofthe modifications to the search algorithm that theyrequire.

2.2. The search problem

The task of the extended decoder is to find themost probable word sequence and segregationgiven the search space of all possible word se-quences and all possible segregations. Given thatthe acoustic match score:

P "X jQ#P "XjS;Y#=P "X#; "10#

is conditioned both on the segregation S and thesubword state Q, the (S,Q) search space cannotin general be decomposed into independentsearches over S and Q. Since the size of the S spaceexpands the overall search space it is imperativethat the search in the plane of the segregationspace is conducted as e"ciently as possible.

To illustrate this point, imagine a naive imple-mentation of the search illustrated in Fig. 4. In thisapproach, each segregation hypothesis is consid-ered independently, and therefore requires a sepa-rate word sequence search. If the segregationmodel has identified N coherent fragments, thenthere will be 2N segregation hypotheses to con-sider. Hence, the total computation required forthe decoding will scale exponentially with the num-ber of fragments. The total number of fragments islikely to be a linear function of the duration of theacoustic mixture being processed, therefore the

10 J.P. Barker et al. / Speech Communication 45 (2005) 5–25

computation required will be an exponential func-tion of this duration. For su"ciently large vocab-ularies, the cost of decoding the word sequencetypically makes up the greater part of the totalcomputational cost of ASR. It is clear that thenaive implementation of the word sequence/segre-gation search is unacceptable unless the total num-ber of fragments is very small.

The key to constructing an e"cient implemen-tation of the search is to take advantage of similar-ities that exist between pairs of segregationhypotheses. Consider the full set of possible segre-gations. There is a unique segregation for everypossible assignment of speech/background label-ling to the set of fragments. For any given pairof hypotheses, some fragments will have the samelabel. In particular, some hypotheses will di!er

only in the labelling of a single fragment. For suchpairs, the speech/background segregation will beidentical up to the time frame where the di!eringfragment onsets, and identical again from theframe where the fragment o!sets. The brute-forcesearch performs two independent word sequencesearches for two such similar segregation hypothe-ses (see Fig. 5, column 1). The computational costof these two independent searches may be reducedby allowing them to share processing up to thetime frame where the segregation hypotheses dif-fer—i.e. the onset of the fragment that is labelleddi!erently in each hypothesis, marked as time T1in column 2 of Fig. 5. This sharing of computationbetween pairs of segregation hypotheses can begeneralised to encompass all segregation hypothe-ses by arranging them in a graph structure. As we

Fig. 3. An illustration of short-term (above) and long-term (below) primitive grouping cues which may be exploited to recover partialdescriptions of individual sound sources. The figure shows a time-frequency representation of two simultaneous speech utterances.Regions where the energy of one source dominate are shown in dark grey, while those of the other source are in light grey.

J.P. Barker et al. / Speech Communication 45 (2005) 5–25 11

progress through time, new fragment onsets causeall current segregation hypotheses to branch,forming two complementary sets of paths. In oneset, the onsetting fragment is considered to bespeech while in the other it is considered to bebackground. However, although this arrangementsaves some computation, the number of segrega-tion hypotheses under consideration at any partic-ular frame still grows exponentially with time. Thisexponential growth may be prevented by notingthat segregation hypotheses will become identicalagain after the o!set of the last fragment by whichthey di!er (marked as time T2 in column 3 of Fig.5). At this point, the two competing segregationhypotheses can be compared and the least likelyof the pair can be rejected without a!ecting theadmissibility of the search. Again, this step canbe generalised to encompass all segregationhypotheses and e!ectively brings together the

branches of the diverging segregation hypothesistree.

Fig. 6 illustrates the evolution of a set of parallelsegregation hypotheses while processing a segmentof noisy speech which has been dissected into threefragments (shown schematically by the shaded re-gions in the figure). When the first fragment (white)commences, two segregation hypotheses areformed. In one hypothesis, the white fragment is la-belled as speech, while in the other it is assigned tothe background. When the grey fragment starts, allongoing hypotheses are again split with each paircovering both possible labellings for the grey frag-ment. When the white fragment ends, pairs ofhypotheses are merged if their labelling only di!erswith regard to the white fragment. This pattern ofsplitting and merging continues until the end ofthe utterance. Note that at any instant there areat most four active segregation hypotheses, not

Decoder

Decoder

Decoder

Decoder

Hyp 1

Hyp 2

Hyp 3

Hyp N

Hypotheses

Noisy Speech

FragmentsNoise + Speech

RecognitionHypothesis

Word SequenceSearch

Accepthighestscoring

hyp.

Segregation

Fig. 4. The figure illustrates a naive implementation of the segregation/word-sequence search. From a set of N noise and speechfragments, 2N speech/noise segregation hypotheses can be generated. It is then possible to search for the best word sequence given eachof these 2N segregation hypotheses. The overall best hypothesis can then be found by comparing the scores of these independentsearches.

12 J.P. Barker et al. / Speech Communication 45 (2005) 5–25

the eight required to consider every possible label-ling of each of the three fragments.

It is important to understand that the evolutionof segregation hypotheses is dependent on theword sequence hypothesis. For each ongoing wordsequence being considered by the decoder, a par-

ticular corresponding optimal segregation is simul-taneously developed.

If the word sequence is modelled using HMMs,then the segregation/word-sequence decoder canbe implemented by extending the token-passingViterbi algorithm employed in conventional ASR:

1T

Merge decodings whenspeech/backgroundhypotheses converge

each speech/backgroundIndependent decodings for

hypothesis

1T

hypotheses diverge

Split decoding whenspeech/background

DecoderB

DecoderA

DecoderA/B

2

T2

A

B

A

B

DecoderA

DecoderB

DecoderA/B

Decoder

Decoder

Decoder

A

B

A/B

31

Fig. 5. The e"cient segregation search exploits the fact that competing segregation hypotheses only di!er over a limited number offrames.

time

Fragments

Speech Source Hypotheses

Fig. 6. The evolution of a set of segregation hypotheses. Each parallel path represents a separate hypothesis, with the shaded dotsindicating which ongoing fragments are being considered as speech part of the speech source.

J.P. Barker et al. / Speech Communication 45 (2005) 5–25 13

• Tokens keep a record of the fragment assign-ments they have made, i.e. each token storesits labelling of each fragment encountered aseither speech or background.

• Splitting: When a new fragment starts all exist-ing tokens are duplicated. In one copy the newfragment is labelled as speech and in the other itis labelled as background.

• Merging: When a fragment ends, then for eachstate we compare tokens that di!er only in thelabel of the fragment that is ending. The lesslikely token or tokens are deleted.

• At each time frame, tokens propagate throughthe HMM as usual. However, each state canhold as many tokens as there are di!erent label-lings of the currently active fragments. Whentokens enter a state only those with the samelabelling of current active fragments are directlycompared. The token with the highest likeli-hood score survives and the others are deleted.

It should be stressed that the deletion of tokensin the !merging" step described above does not af-fect the admissibility of the search (i.e. it is not aform of hypothesis pruning). The e"cient algo-rithm will return an identical result to that of abrute-force approach, which separately considersevery word-sequence/segregation hypothesis. Thisis true as long as the Markov assumption remainsvalid. In the context of the above algorithm thismeans that the future of a partial hypothesis mustbe independent of its past. This places some con-straints on the form of the segregation model.For example, the Markov assumption may breakdown if the segregation model contains between-fragment grouping e!ects in which the future scor-ing of a partial hypothesis may depend on whichgroups it has previously interpreted as part ofthe speech source. In this case the admissibilityof the search can be preserved by imposing extraconstraints on the hypothesis merging condition.This point is discussed further in Section 4.3.

2.3. The acoustic model

In Eq. (8), the acoustic model data likelihoodP(XjQ) of a conventional speech recogniser is re-placed by an integral over the partially-observed

speech features X, weighted by a term conditionedon the observed signal features Y and the segrega-tion hypothesis S:Z

P"X jQ# P"X jS;Y#P "X#

dX ; "11#

where P(XjQ) is the feature distribution model ofa conventional recogniser trained on clean speech,and P(XjS,Y)/P(X) is a likelihood weighting fac-tor introducing the influence of the particular(noisy) observations Y and the assumed segrega-tion S.

The integral over the entire space of X—the fullmultidimensional feature space at every timestep—is clearly impractical. Fortunately, it canbe broken down into factors. Firstly, the Markovassumption of independent emissions given thestate sequence allows us to express the likelihoodof the sequence as the product of the likelihoodsat each time step i: 3

ZP"X jQ# P"X jS;Y#

P "X#dX

!Y

i

ZP "xijqi#

P "xijS;Y#P "xi#

dxi: "12#

Secondly, in a continuous-density (CDHMM)system, P(xjq) is modelled as a mixture of M mul-tivariate Gaussians, usually each with a diagonalcovariance matrix:

P "xjq# !XM

k!1

P "kjq#P "xjk; q#; "13#

where P(kjq) are the mixing coe"cients. Since theindividual dimensions of a diagonal-covarianceGaussian are independent, we can further factor-ise the likelihood over the feature vector elementsxj:

P "xjq# !XM

k!1

P "kjq#Y

j

P "xjjk; q#: "14#

3 This also assumes independence of each time step for theprior P(X) and for the likelihood of X given the segregationhypothesis and observations, P(XjS,Y). Both these assumptionsare open to serious question, and we return to them in Section4.

14 J.P. Barker et al. / Speech Communication 45 (2005) 5–25

Assuming a similar decomposition of the priorP(X), we can take the integral of Eq. (12) insidethe summation to give:Z

P "xjq# P "xjS;Y#P "x#

dx

!XM

k!1

P "kjq#Y

j

ZP "xjjk; q#

P "xjjS;Y#P "xj#

dxj;

"15#

where P(xjjk,q) is now a simple unidimensionalGaussian.

We can consider the factor

P "xjjS;Y#P "xj#

"16#

as the !segregation weighting"—the factor bywhich the prior probability of a particular valuefor the speech feature is modified in light of thesegregation mask and the observed signal. Sincewe are working with models of subband spectralenergy, we can use a technique closely related tothe missing-data idea of bounded integration(Cooke et al., 2001): For subbands that arejudged to be dominated by speech energy (i.e.,under the segregation hypothesis S, not one ofthe !masked" channels), the corresponding featurevalues xk can be calculated directly 4 from theobserved signal Y and hence the segregationweighting will be a Dirac delta at the calculatedvalue, x*:

P "xjjS;Y# ! d"xj & x'#; "17#Z

P "xjjk; q#P"xjjS;Y#

P "xj#dxj ! P"x'jk; q#=P"x'#:

"18#

The other case is that the subband correspond-ing to x is regarded as masked under the segrega-tion hypothesis. We can still calculate the spectralenergy x* for that band, but now we assume thatthis level describes the masking signal, and thespeech feature is at some unknown value smallerthan this. In this case, we can model P(xjS,Y) asproportional to the prior P(x) for x 6 x*, and zerofor x > x*. Thus,

P "xjjS;Y# !F $ P "xj# xj 6 x';

0 xj > x';

#"19#

ZP "xjjk; q#

P "xjjS;Y#P "xj#

dxj !Z x'

&1P "xjjk; q# $ F dxj;

"20#

where F is a normalisation constant to keep thetruncated distribution a true pdf i.e.

F ! 1R x'

&1 P "xj#dxj"21#

In Eq. (20), the likelihood gets smaller as moreof the probability mass associated with a particu-lar state lies in the range precluded by the maskinglevel upper bound; it models the !counterevidence"(Cunningham and Cooke, 1999) against a particu-lar state. For example, given a low x* the quieterstates will score better then more energetic ones.Since the elemental distributions P(xjjk,q) are sim-ple Gaussians, each integral is evaluated using thestandard error function.

Both the scaling factor F in Eq. (20) and theevaluation of the point-likelihood in Eq. (18) re-quire a value for the speech feature prior P(xj).In the results reported below we have made thevery simple assumption of a uniform prior onour cube-root compressed energy values betweenzero and some fixed maximum xmax, constantacross all feature elements and intended to be lar-ger than any actual observed value. This makes theprior likelihood P(xj) equal a constant 1/xmax andF = xmax/x* / 1/x*.

Using Eq. (18) for the unmasked dimensionsand Eq. (20) for the masked dimensions we canevaluate the acoustic data likelihood (or !acousticmatch score") for a single state at a particular timeslice with Eq. (15) which becomes:

4 The observed signal Y will in general be a richerrepresentation than simply the subband energies that wouldhave formed x in the noise-free case, since it may includeinformation such as spectral fine-structure used to calculatepitch cues used in low-level segregation models, etc. However,the information in x will be completely defined given Y in thecase of a segregation hypothesis that rates the whole spectrumas unmasked for that time slice.

J.P. Barker et al. / Speech Communication 45 (2005) 5–25 15

ZP "xjq# P "xjS;Y#

P "x# dx

!XM

k!1

P"kjq#Y

j2SO

P"x'j jk; q# $ xmax

Y

j2SM

%Z

P "xjjk; q# $xmax

x'jdxj; "22#

where SO is the set of directly observed (notmasked) dimensions of x, SM are the remaining,masked, dimensions, and x'j is the observed spec-tral energy level for a particular band j. This per-time likelihood can then be combined across alltimeslices using Eq. (12) to give the data likelihoodfor an entire sequence.

In practise it has been observed that Eq. (22)exhibits a bias towards favouring hypotheses inwhich too many fragments have been labelled asbackground, or alternatively towards hypothesesin which too many fragments have been labelledas speech. The reasons for this bias are presentlyunclear, but one possibility is that is it introducedby the uniform prior employed for P(xj). As anapproximate solution to this problem, the resultsof the integrations across the masked dimensionsare scaled by a tuning parameter a shifting the rel-ative likelihood of the missing and present dimen-sions. Giving a a high value tunes the decodertoward favouring hypotheses in which more frag-ments are labelled as background, while a lowvalue favours hypotheses in which more fragmentsare labelled as speech. Experience has shown thatthe appropriate value of a depends largely on thenature of the fragments (i.e. the segregationmodel) and little on the noise type or noise level.

Hence, it is easy to tune the system empiricallyusing a small development data set.

Finally, it is instructive to compare the speechfragment decoding approach being proposed herewith the missing data approach proposed in earlierwork (Cooke et al., 1994, 2001). Basic missing datarecognition consists of two separate steps per-formed in sequence: first, a !present-data" mask iscalculated, based, for instance, on estimates ofthe background noise level. Second, missing datarecognition is performed by searching for the mostlikely speech model sequence consistent with thisevidence. By contrast, the speech fragment decod-ing approach integrates these two steps, so that thesearch includes building the present-data mask tofind the subset of features most likely to corre-spond to a single voice, while simultaneouslybuilding the corresponding most likely word se-quence (Fig. 7).

2.4. Illustrative example

Fig. 8(A) shows the spectrogram of the utterance‘‘seven five’’, to which a stationary backgroundnoise and a series of broadband high-energy noisebursts have been added (panel B). The initialframes of the signal can be employed to estimateand identify the stationary noise component, leav-ing the unmasked speech energy and the non-sta-tionary noise bursts as candidate !present data",as shown in panel C. This however must be brokenup into a set of fragments to permit searching bythe speech fragment decoder.

In order to confirm that the top-down processin the decoder is able to identify the valid speech

Speech FragmentDecoder

time

freq

euen

cty

Noisy Speech

SearchTop Down

Speech and Background Fragments

Bottom UpProcessing

Coherent Fragments

Word–sequenceHypothesis

SpeechModels

Segregation HypothesisSpeech/Background

Fig. 7. An overview of the speech fragment decoding system. Bottom-up processes are employed to locate !coherent fragments"(regions of representation that are due entirely to one source) and then a top-down search with access to speech models is used tosearch for the most likely combination of fragment labelling and speech model sequence.

16 J.P. Barker et al. / Speech Communication 45 (2005) 5–25

Fig. 8. An example of the speech fragment decoder"s operation on a single noisy utterance: Panel A shows a spectrogram of theutterance ‘‘seven five’’. Panel B shows the same signal but after adding a two state noise source. Panel C shows the components of themixture that are not accounted for by the adaptive background noise model. Panel D displays a test set of perfectly coherent fragmentsgenerated using a priori knowledge of the clean signal. Panel E shows the groups that the speech fragment decoder identifies as beingspeech groups. The correct assignment is shown in panel F. Panel G plots the number of grouping hypotheses that are being consideredat each time frame.

J.P. Barker et al. / Speech Communication 45 (2005) 5–25 17

fragments, its performance was tested using asmall set of !ideal" coherent fragments. These canbe generated by applying a priori knowledge ofthe clean speech, i.e. comparing the clean andnoisy spectrograms to mark out the exact regionswhere either the speech or the noise bursts domi-nate. The ideal fragments are simply the contigu-ous regions which are formed by this segregationprocess (see Panel D of Fig. 8).

Given these fragments, the decoder is able tocorrectly recognise the utterance as ‘‘seven five’’,using the fragments in panel E as evidence of thespeech. The correct speech/noise fragment label-ling is shown in panel F. Comparing E and F, itcan be seen that the decoder has accepted all thespeech fragments, while correctly rejecting all thelarger fragments of noise. (Some small noise re-gions have been included in the speech, implyingtheir level was consistent with the speech models.)

3. Experiments employing SNR-based fragments

The first set of experiments employ a connecteddigit recognition task and compare the perform-ance of the speech fragment decoding techniquewith that of previously reported missing data tech-niques in which the speech/background segrega-tion is e!ectively decided before proceeding withrecognition (Cooke et al., 2001). The segregationmodel employed has been kept extremely simple.The coherent fragments are approximated directlyfrom the acoustic mixture by using a simple noiseestimation technique. The techniques presentedhere serve as a useful baseline against which theperformance of more sophisticated segregationmodels can be compared.

3.1. Procedure

3.1.1. Feature vectorsThe experiments in this section employ TIDigit

utterances (Leonard, 1984) mixed with NOISEXfactory noise (Varga et al., 1992) at various SNRs.NOISEX factory noise has a stationary back-ground component but also highly unpredictablecomponents such as hammer blows etc. whichmake it particularly disruptive for recognisers.

To produce the acoustic feature vectors thenoisy mixtures were first processed with a 24 chan-nel auditory filterbank (Cooke, 1991) with centrefrequencies spaced linearly in ERB-rate from 50to 8000Hz. The instantaneous Hilbert envelopeat the output of each filter was smoothed with afirst order filter with an 8ms time constant, andsampled at a frame-rate of 10ms. Finally, cube-root compression was applied to the energyvalues.

This forms a spectro-temporal sound energyrepresentation that is suitable for segregation. Thisrepresentation will henceforth be referred to as an!auditory spectrogram".

3.1.2. FragmentsThe fragments were generated by the following

steps:

(1) For each noisy utterance the first 10 frames ofthe auditory spectrogram are averaged to esti-mate a stationary noise spectrum. 5

(2) The noise spectrum estimate is used to esti-mate the local SNR for each frame and fre-quency channel of the noisy utterance.

(3) The spectro-temporal region where the localSNR is above 0dB is identified. This providesa rough approximation of the speech/back-ground segregation.If the additive noise source were stationarythen the first three steps would provide thecorrect speech/background segregation andthe speech fragment decoder technique wouldnot be needed. However, if the competingnoise source is non-stationary then some ofthe regions that are identified as speech willin fact be due to the noise. Hence, we nowproceed with the following steps, which allowthe speech fragment decoder technique to im-prove on the recognition result that wouldhave been achieved if we had used the initialapproximation to the speech/backgroundsegregation.

5 This technique assumes that there is a delay before thespeech source starts and hence the first frames provide a reliablemeasure of the noise background.

18 J.P. Barker et al. / Speech Communication 45 (2005) 5–25

(4) The initial approximation of the speech seg-ment is dissected by first dividing it into fourfrequency bands.

(5) Each contiguous region within each of the foursubbands is defined to be a separate fragment.

(6) The set of fragments and the noisy speech rep-resentation are passed to the speech fragmentdecoder.

The fragmentation process is summarised inFig. 9.

3.1.3. Acoustic modelsAn 8-state HMM was trained for each of the

eleven words in the TIDigit corpus vocabulary(digits !one" to !nine", plus the two pronunciationsof 0, namely !oh" and !zero"). The HMM stateshave two transitions each; a self transition and atransition to the following state. The emission dis-tribution of each state was modelled by a mixtureof 10 Gaussian distributions each with a diagonalcovariance matrix. An additional 3-state HMMwas used to model the silence occurring beforeand after each utterance, and the pauses thatmay occur between digits.

The scaling constant, a, required to balancemissing and present data (see Section 2.3), wasempirically tuned by maximising recognition per-formance on a small set of noisy utterances with

an SNR of 10dB. The value a = 0.3 was foundto give best performance. This value was then usedfor all noise levels during testing.

3.2. Artificial examples

As explained above, if the background noise isnon-stationary the local SNR estimates (whichhave been based on the assumption that the noiseis stationary), may be grossly inaccurate. A localpeak in noise energy can lead to a spectro-tempo-ral region that is mistakenly labelled as havinghigh local SNR. This error then generates a regionin the initial estimate of the speech/backgroundsegregation that is incorrectly identified as belong-ing to the speech source. If this segregation is useddirectly in conjunction with standard missing datatechniques then the error will lead to poor recogni-tion performance.

Fragmenting the initial speech segregationand applying the speech fragment decoder shouldallow incorrectly assigned regions to be rejectedfrom the speech source, thereby producing a bet-ter recognition hypothesis. This e!ect is illustratedin Fig. 10, where the spectro-temporal signalrepresentation has been altered to simulatebroad-band noise bursts. These unexpected com-ponents appear as bands in the present datamask and hence disrupt the standard missing data

Fig. 9. A summary of the front-end processing used to generate the fragments employed in the experiments described in Section 3.

J.P. Barker et al. / Speech Communication 45 (2005) 5–25 19

recognition technique (!1159" is recognised as!81o85898"). The third image in the figure showshow the mask is now dissected before beingpassed into the speech fragment decoder. Thefinal panel shows a backtrace of the fragments thatthe speech fragment decoder marks as present inthe winning hypothesis. We see that the noisepulse fragments have been dropped (i.e. relabelledas !background"). Recognition performance is nowmuch improved (!1159" is recognised as !81159").

Fig. 11 shows a further example with a di!erentpattern of artificial noise—a series of chirps—im-posed upon the same utterance. Again, noise con-

taminated fragments are mostly placed into thebackground by the decoder.

3.3. Results with real noise

The examples discussed in the previous sectionwere artificial and the background intrusions inthe data mask were very distinct. The experimentsin this section test the technique with speech mixedwith factory noise taken from the NOISEX corpus(Varga et al., 1992).

Fig. 12 compares the performance of the speechfragment decoding technique with that of a recog-

Fig. 10. An example of the speech fragment decoder system performance when applied to data corrupted by artificial transients (seetext).

20 J.P. Barker et al. / Speech Communication 45 (2005) 5–25

niser using the stationary SNR-based speech/back-ground segregation in conjunction with missingdata techniques.

It can be seen that speech fragment decodingprovides a significant improvement at the lowerSNRs, e.g. at 5dB recognition accuracy is im-proved from 70.1% to 78.1%—a word-error ratereduction from 29.9% to 21.9%, or 26.7% relative.

Also shown on the graph are results using atraditional MFCC system with 13 cepstral co-e"cients, deltas and accelerations, and cepstralmean normalisation (labelled MFCC + CMN).This demonstrates that the speech fragment decod-ing technique is providing an improvement over amissing data system that is already robust by thestandards of traditional techniques.

3.4. Discussion

The results in Fig. 12 labelled !a priori" show theperformance achieved using missing data tech-niques if prior knowledge of the noise is used tocreate a perfect local SNR mask. Even using thespeech fragment decoding technique results fallfar short of this upper limit as the noise level risesabove 10dB SNR.

One possible cause of this this significant per-formance gap is that the fragments supplied tothe speech fragment decoder are not su"ciently

coherent. In this work we have used a simple setof fragments generated by aggregating high energyregions in the SNR mask. If the noise and speechsources occupy adjoining spectro-temporal regionsthis technique will not be able to separate them.This is evident is Figs. 10 and 11 where, as a resultof both noise and speech being mixed in thesame fragment, much clean speech energy has been

Fig. 11. Another example of the speech fragment decoding for data corrupted with artificial chirps.

0 5 10 15 20 2000

20

40

60

80

100Factory Noise

Dig

it re

cogn

ition

acc

urac

y

SNR (dB)

a priorifragment decodermissing dataMFCC+CMN

Fig. 12. Recognition results for a baseline MFCC system, amissing data system, and the speech fragment decoder system.The !a priori" line represents results that are potentiallyachievable if the speech can be perfectly segregated from thenoise.

J.P. Barker et al. / Speech Communication 45 (2005) 5–25 21

removed from the masks and some of the noise en-ergy has survived.

The artificial examples highlight that the successof the system is strongly dependent on the qualityof the segregation model. By producing incoherentfragments, the segregation model limits the per-formance of the recogniser as it has e!ectivelymade hard decisions that cannot be undone at alater stage. Of course, the coherence of the frag-ments can be easily increased by splitting theminto smaller and smaller pieces. At the extremeeach fragment may contain a single spectro-tem-poral pixel which by definition must be coherent.However, over zealous fragmentation also hasundesirable consequences. First, it greatly in-creases the size of the segregation search spaceand hence increases the computational cost ofthe decoding process. Second, it weakens the con-straints imposed by the segregation model. If thereare a very large number of small fragments, the de-coder is more able to construct spurious speechdescriptions by piecing together spectro-temporalpieces from the collection of sound sourcespresent.

4. Discussion

In this paper we have laid the foundation for astatistical approach to computational auditoryscene analysis. In the sections that follow, we dis-cuss some of the issues that have arisen with ourcurrent implementation and suggest some possiblefuture research directions.

4.1. Improvements to fragment generation

The fragments in the current system rely on avery simple and crude model—mainly that energybelow an estimate !noise floor" is to be ignored,and the remainder can be divided up accordingto some simple heuristics. It is likely that morepowerful fragmentation techniques will result insignificant performance gains. In general, onecan imagine a two-phase process in which cuesfor auditory grouping (as listed, for example, inBregman, 1990, and Table 1 of Cooke and Ellis,2001) are applied to aggregate auditory filter out-

puts across time and frequency, followed by theapplication of segregation principles which serveto split the newly-formed regions. In contrast withearlier approaches to grouping and segregation,such a strategy can a!ord to be conservative inits application of grouping principles, since someof the work of aggregation can be left to the de-coder. In fact, since any groups formed at thisstage cannot later be split, it is essential thatany hard-and-fast decisions are based on reliablecues for grouping. In practice, this can beachieved both by adopting more stringent criteriafor incorporation of time-frequency regions intogroups and by weakening criteria for the splittingof groups.

For instance, within the regions currentlymarked as !voiced", subband periodicity measurescould indicate whether frequency channels appearto be excited by a single voice, or whether multiplepitches suggest the division of the spectrum intomultiple voices (as in Brown and Cooke, 1994).Sudden increases in energy within a single frag-ment should also precipitate a division, on thebasis that this is strong evidence of a new soundsource appearing.

The application of stricter grouping criteriamay appear to result in a loss of valuable informa-tion about which regions are likely to belong to-gether. However, we show in the followingsection that such information can be employedduring the decoding stage.

4.2. Statistical versus ruled-based segregationmodels

The speech fragment decoding theory is ex-pressed in terms of a statistical segregation model.However, the primitive grouping principles de-scribed in the previous section have tended to bemodelled by essentially rule-based systems andhave previously lacked a clear statistical footing.Psychoacousticians have set out to look for group-ing rules using reductionist approaches—essen-tially by studying the percepts generated byhighly simplified acoustic stimuli. Rule-basedmodels that rely on a small sets of parameterscan be hand tuned to fit such empirical psycho-acoustic data.

22 J.P. Barker et al. / Speech Communication 45 (2005) 5–25

An alternative approach, is to build a statisticalsegregation model from labelled noisy data. Labelscan be attached to the noisy acoustic data if thecontributions of the individual sources are knownprior to mixing. A corpus of synthetic soundscenes could be used to achieve this aim.

4.3. Between-fragment grouping

Psychoacoustic experiments provide evidencethat weak grouping e!ects may exist between thetightly bound local spectro-temporal fragments.For example, a sequence of tonal elements are morelikely to be perceived as emanating from the samesound source if they have similar frequencies (VanNoorden, 1975). These grouping e!ects may allowa fragment to have an influence on the evolvingsource interpretation that spans over a considerabletemporal window. However, such between-frag-ment grouping e!ects have a probabilistic natureand their influence can be overcome by learnedpatterns, such as musical melody (Hartmann andJohnson, 1991) or speech (Culling andDarwin, 1993).

Between-fragment grouping e!ects may be bestmodelled as soft biases rather than hard and fastrules. One approach would be to estimate priorprobabilities of the segregation hypotheses accord-ing to various distance measures between the frag-ments composing the sources that the segregationdescribes. A suitable distance measure may bebased on the similarity of a vector of fragmentproperties such as mean frequency, spectral shape,spatial location, mean energy. The posterior prob-ability of pairs of fragments belonging to the samesource given their properties could then be learntusing training data employing a priori fragmentssimilar to those used in Section 2.4. Such probabil-ities could be added into the segregation model byappropriately adjusting the scores for each evolv-ing segregation hypotheses as each new fragmentis considered by the decoding process.

When including long term between-fragmentgrouping probabilities into the segregation modelsome care has to be taken with the speech fragmentdecoding algorithm to ensure that the Markovproperty is preserved and that the segregation/word-sequence search remains admissible. In theversion of the algorithm described in Section 2.2,

decisions about the best labelling of a fragmentare made at the instant at which the fragment o!-sets. However, allowing for between-fragment ef-fects, it is not possible to know at this time pointhow the labelling of the present fragment will influ-ence the labelling of fragments occurring in the fu-ture. This problem can be overcome by firstlimiting the temporal extent of the between-frag-ment grouping e!ects to a fixed number of frames,say T,6 and second, delaying the decision over howto label a given fragment until the decoder haspassed the o!set of the fragment by T frames.

Note that the delay in fragment labelling deci-sions necessitated by between-fragment groupinge!ects will mean that there are on average more ac-tive hypotheses at any instant. The growth in thenumber of hypotheses will in general be an expo-nential function of the length of the delay which,in turn, has to be the same duration as the extent ofthe temporal influence between fragments. Conse-quently, there is a trade-o! between the temporalextent of the between-fragment grouping influ-ences and the size of the segregation search space(and hence computational cost of the decodingprocedure).

4.4. Approximating P(X)

In Eq. (12), we factored the ratio of the likeli-hood of the speech features conditioned on segre-gation and mask to their prior values byessentially assuming their values were independentat each time step i, i.e. we took:

P"X jS;Y#P "X# !

Y

i

P"xijS;Y#P "xi#

: "23#

This independence assumption is certainlyincorrect, but di"cult to avoid in practical sys-tems. We note, however, that depending on howP(xijS,Y)/P(xi) is calculated, the ratio may be rea-sonable even when the numerator and denomina-tor include systematic error factors, as long asthose factors are similar.

6 That is to say that between-fragment grouping probabil-ities are included for interactions between the fragment that isending and each fragment that overlaps a window that extendsback T frames before the fragment ended.

J.P. Barker et al. / Speech Communication 45 (2005) 5–25 23

A second weak point is our model for the priordistribution of individual speech feature elementsat a single time frame, P(xj), as uniform betweenzero and some global constant xmax. It would berelatively simple to improve this, e.g. by using indi-vidual single-Gaussian models of the prior distri-bution of features in each dimension. Since thisapplies only to the clean speech features X ratherthan to the unpredictable noisy observations Y,we already have the training data we need.

4.5. Three-way labelling of time-frequency cells

Although the primary purpose of the currentsystem is to decide which time-frequency pixelscan be used as evidence for the target voice, wenote that there is actually a three-way classifica-tion occurring, firstly between stationary back-ground and foreground (by the initial noiseestimation stage), then of the foreground energyinto speech and non-speech fragments (by thedecoding process). This special status of the sta-tionary background is not strictly necessary—those regions could be included in the search,and would presumably always be labelled asnon-speech—but it may reveal something moreprofound about sound perception in general. Justas it is convenient and e"cient to identify and dis-card the !background roar" as the first processingstage in this system, perhaps biological auditorysystems perform an analogous process of system-atically ignoring energy below a slowly varyingthreshold.

4.6. Computational complexity

In the Aurora experiments, the number of frag-ments per utterance often exceeded 100. However,as illustrated in Fig. 8(G), the maximum numberof simultaneous fragments was never greater than10 and the average number of hypotheses perframe computed over the full test set was below4. Although the decoder is evaluating on averageroughly four times as many hypotheses as a stand-ard missing data decoder, much of the probabilitycalculation may be shared between hypotheses andhence the computational load is increased by amuch smaller factor.

4.7. Decoding multiple sources

A natural future extension would be to searchfor fits across multiple simultaneous models, possi-bly permitting the recognition of both voices insimultaneous speech. This resembles the ideas ofHMM decomposition (Varga and Moore, 1990;Gales and Young, 1993). However, because each!coherent fragment" is assumed to correspond toonly a single source, the likelihood evaluation isgreatly simplified. The arguments about the rela-tionship between large, coherent fragments andsearch e"ciency remain unchanged.

5. Conclusion

We have presented a statistical foundation tocomputational auditory scene analysis, and devel-oped from this framework an approach to recog-nising speech in the presence of other soundsources that combines (i) a bottom up processingstage to produce a set of source fragments, with(ii) a top-down search which, given models ofclean speech, uses missing data recognition tech-niques to find the most likely combination ofsource speech/background labelling and speechmodel sequence. Preliminary ASR experimentsshow that the system can produce recognition per-formance improvements even with a simplisticimplementation of the bottom-up processing. Webelieve that through the application of moresophisticated CASA-style sound source organisa-tion techniques, we will be able to improve thequality of the fragments fed to the top-downsearch and further improve the performance ofthe system.

References

Bailey, P., Dorman, M., Summerfield, A., 1977. Identificationof sine-wave analogs of CV syllables in speech and non-speech modes. J. Acoust. Soc. Amer., 61.

Barker, J., Cooke, M., 1999. Is the sine-wave speech cocktailparty worth attending? Speech Commun. 27, 159–174.

Bell, A.J., Sejnowski, T.J., 1995. An information-maximizationapproach to blind separation and blind deconvolution.Neural Comput. 7 (6), 1004–1034.

24 J.P. Barker et al. / Speech Communication 45 (2005) 5–25

Bourlard, H., Dupont, S., 1997. Subband-based speech recog-nition. In: Proc. ICASSP"97, Munich, Germany, April 1997.pp. 1251–1254.

Bregman, A.S., 1990. Auditory Scene Analysis. MIT Press.Brown, G.J., Cooke, M., 1994. Computational auditory scene

analysis. Comput. Speech Lang. 8, 297–336.Cooke, M.P., 1991. Modelling auditory processing and organ-

isation. Ph.D. Thesis, Department of Computer Science,University of She"eld.

Cooke, M., Ellis, D., 2001. The auditory organisation of speechand other sound sources in listeners and computationalmodels. Speech Commun. 35, 141–177.

Cooke, M., Green, P., Crawford, M., 1994. Handling missingdata in speech recognition. In: Proc. ICSLP"94, Yokohama,Japan, September 1994. pp. 1555–1558.

Cooke, M., Green, P., Josifovski, L., Vizinho, A., 2001. Robustautomatic speech recognition with missing and unreliableacoustic data. Speech Commun. 34 (3), 267–285.

Culling, J., Darwin, C., 1993. Perceptual separation of simul-taneous vowels: within and across-formant grouping by F0.J. Acoust. Soc. Amer. 93 (6), 3454–3467.

Cunningham, S., Cooke, M., 1999. The role of evidence andcounter-evidence in speech perception. In: ICPhS"99. pp.215–218.

Denbigh, P., Zhao, J., 1992. Pitch extraction and separation ofoverlapping speech. Speech Commun. 11, 119–125.

Ellis, D.P.W., 1996. Prediction-driven computational auditoryscene analysis. Ph.D. Thesis, Department of ElectricalEngineering and Computer Science, MIT.

Gales, M.J.F., Young, S.J., 1993. HMM recognition in noiseusing parallel model combination. In: Eurospeech"93, Vol.2. pp. 837–840.

Hartmann, W., Johnson, D., 1991. Stream segregation andperipheral channeling music perception. Music Percept. 9(2), 155–184.

Hyvarinen, A., Oja, E., 2000. Independent component analysis:Algorithms and applications, neural networks. NeuralNetworks 13 (4–5), 411–430.

Leonard, R., 1984. A database for speaker-independent digitrecognition. In: Proc. ICASSP"84. pp. 111–114.

Parsons, T., 1976. Separation of speech from interfering speechby means of harmonic selection. J. Acoust. Soc. Amer. 60(4), 911–918.

Pearce, D., Hirsch, H.-G., 2000. The AURORA experimentalframework for the performance evaluation of speechrecognition systems under noisy conditions. In: Proc.ICSLP"00, Vol. 4, Beijing, China, October 2000, pp. 29–32.

Remez, R., Rubin, P., Pisoni, D., Carrell, T., 1981. Speechperception without traditional speech cues. Science 212,947–950.

Sche!ers, M., 1983. Sifting vowels: auditory pitch analysis andsound segregation. Ph.D. Thesis, University of Groningen,The Netherlands.

Slaney, M., 1995. A critique of pure audition. In: Proc. 1stWorkshop on Computational Auditory Scene Analysis,Internat. Joint Conf. Artificial Intelligence, Montreal. pp.13–18.

Van Noorden, L., 1975. Temporal coherence in the perceptionof tone sequences. Ph.D. Thesis, Eindhoven University ofTechnology.

Varga, A.P., Moore, R.K., 1990. Hidden Markov modeldecomposition of speech and noise. In: ICASSP"90. pp.845–848.

Varga, A., Steeneken, H., Tomlinson, M., Jones, D., 1992. TheNOISEX-92 study on the e!ect of additive noise onautomatic speech recognition. Tech. rep., Speech ResearchUnit, Defence Research Agency, Malvern, UK.

Wang, D., Brown, G., 1999. Separation of speech frominterfering sounds based on oscillatory correlation. IEEETrans. Neural Networks 10 (3), 684–697.

J.P. Barker et al. / Speech Communication 45 (2005) 5–25 25