sparse nonnegative matrix based on -divergence for single channel separation in cochleagram

7/30/2019 SPARSE NONNEGATIVE MATRIX BASED ON -DIVERGENCE FOR SINGLE CHANNEL SEPARATION IN COCHLEAGRAM

1/14

International Journal of Mathematics and Computer

Applications Research (IJMCAR)

ISSN 2249-6955Vol. 2 Issue 4 Dec 2012 11-24

TJPRC Pvt. Ltd.,

SPARSE NONNEGATIVE MATRIX BASED ON -DIVERGENCE FOR

SINGLE CHANNEL SEPARATION IN COCHLEAGRAM

M. E. ABD EL AZIZ & WAEL KIDER

Department of Mathematics, Faculty of Science, Zagazig University, Zagazig 44519, Egypt

ABSTRACT

In this paper, a novel family of -divergence based two-dimensional nonnegative matrix factorization methods to

solve SCBSS has been proposed. The separation system of cochleagram and the family of divergence based

factorization algorithms have been developed in a principled manner coupled with the theoretical support of audio signal

separability. The proposed method enjoys at least two significant advantages: Firstly, the cochleagram rendered by the

gammatone filterbank has non-uniform time-frequency resolution which enables the mixed signal to be more separable and

improves the efficiency in source tracking. Secondly, the divergency holds a desirable property of scale invariant that

enables low energy components in the cochleagram bear the same relative importance as the high energy ones. We

compare our system to the Factorial SC and SNMF2D models, where the proposed algorithm shows a superior

performance in terms of signal-to interference ratio. Finally, the low computational requirements of the algorithm allows

close to real time applications.

KEYWORDS: Blind Signal Separation (BSS), Nonnegative Matrix Factorization (NMF), Divergence, - NMF,

Single Channel Source Separation (SCSS)

INTRODUCTION

Single channel source separation (SCSS) aims to extract several source signals from a single mixture recording.

Since at least two sources are interfering and sound sources may overlap in time so that the standard source separation

methods such as ICA (Hyvarinen et al 2001) cannot be applied, the standard NMF or SNMF models (Schmidt et al 2006)

are only satisfactory for solving source separation providing that spectral frequencies do not change over time. The

recently SNMF2D model (Gao et al 2011) solving the problem of SNMF where the spectral dictionary and temporal code

optimized by using kullback divergence, where they rarely interfere in a time-frequency representation. This fact has been

used in computational auditory scene analysis (Wang et al 2006, Brown 1994); inspired by the human ability to organize

the perceived time-frequency representation according to likely sources, but SNMF2D has some drawbacks that originate

from its lack of generalized criterion for controlling the sparsity. Roweis (Roweis 2003) introduced the refiltering

framework which uses so-called spectrogram masks in order to attenuate spectrogram parts which do not belong to the

desired sources. To estimate these mask signals, he proposed the factorial-max vector quantizer (VQ) model, which

assumes that the magnitude-log source spectrograms are generated by vector quantizers plus a noise term. In order to train

speaker specific code-books and to estimate the noise variances he applied k-means to source specific spectrograms.

Hence, max-VQ explicitly models the sources in a training stage. The factorial-max VQ model can be extend by replacing

the vector quantizers with sparse coders (Peharz 2010). A sparse coder can be seen as a generalization of a vector

quantizer, since it represents data with a linear combination of up to so-called atoms ( being a parameter to chose),

while a vector quantizer uses a single, non-scalable code-word, consequently. In order to train speaker specific dictionaries,


2/14

12 M.E. Abd El Aziz & Wael Kider

it use a non-negative matrix factorization algorithm with -sparseness constraints on the coefficient matrix (NMF ).Thesparse coder model suffer from some drawbacks such as it affected by outlier and noise since it using Euclidean distance

and also it using STFT that will produce errors especially when complicated transient phenomena such as the mixing of

speech and music occur in the analysed signal.

The aim of this work is to remedy these drawbacks so we formulate a single channel NMF model that accounts for

convolutive mixing and can see as generalization for (Peharz 2010) in which it using -NMF algorithm where it is robust

with respect to noise and/or outliers in single channel convolution. The source cochleagram spectrograms are modeled

through NMF and the mixing filters serve to identify the elementary components pertaining to each source.

The remaining of this paper is organized as follows. Single channel NMF model is introduced in section 2.

Section 3 is devoted to the Factorial Sparse Coder algorithm. In section 4 the definition of -divergence. Section 5

presents the estimation of spectral basis and temporal code. Section 6 presents the results of our algorithm to source

separation in various settings. Conclusions are drawn in section 7.

SINGLE CHANNEL NMF MODEL

We consider sampled signal generated as 2 convolutive noisy mixtures of point source signals such that

1

Where is additive noise. The time-domain mixing given by (1) can be approximated in the short-time Fouriertransform (STFT) domain as:

2

where and are the complex-valued STFTs of the corresponding time signals, 1 , , is a frequency binindex, 1 , , is a time frame index. Equation (2) can be rewritten in matrix form: (3)We used NMF to model the power spectrogram | | | ,| of source j as a product of two nonnegativematrices and , such that

(4)The 3D-representation for matrices and presented in Figure 1.


3/14

Sparse Nonnegative Matrix Based on -Divergence for Single Channel Separation in Cochleagram 13

Figure1: (A) Frontal Slice 3D-Representation (B) Vertical and Horizontal Slice 3D Representation

FACTORIAL SPARSE CODER ALGORITHM

In this section we illustrated the Factorial Sparse Coder Model (Factorial SC) (Peharz 2010) where it using

method similar to K-SVD algorithm for dictionary training for sparse coders consist of two stages. For the sparse coding

stage it proposed non-negative matching pursuit (NMP), a non-negative variant of OMP. In the dictionary update step it

use several iterations of nonnegative matrix factorization (NMF) proposed by Lee and Seung (Lee et al 2001) , the

Factorial Sparse Coder Model reformulation the equation (4) as

where is a source specific dictionary, is the corresponding coefficient vector and is an index vectorindicating the selected atoms. The summarized of Factorial SC algorithm can found in algorithm 1.

Where a solution is defined as a triplet , , where contains the indices of the selected atoms out of, are the corresponding coefficients and is the residual. The set of all solutions is denoted as . Starting with a single

trivial solution , , , in every iteration each solution is extended with up to atoms, selected by the functionselectBestAtoms. In selectBestAtoms, it calculate . Atoms with negative values in , and atoms which wouldmake the prior probability to zero, are discarded, where the prior probabilities are calculated according to the original

dictionaries and as.

, , |

5

where the factors and | can be estimated from the coefficient matrix returned by NMF (Peharz2010).When is the number of remaining atoms, , atoms with largest values in are selected. The innerproducts and the indices of the selected atoms are returned in the vectors and . In lines 10-12, we perform NMF forthe coefficient vector , which approximate equation (6).

arg min , : 0 6


4/14


Continuing in this manner, the solution set comprises up to solutions in iteration. After 1 iterations, it startto prune the solution set to the best solutions in every iteration, i.e. it select the solutions with highest posterior (7),where the probabilities and are evaluated according to the original dictionaries.

, | |

,

,

,

7The Laplacian form factors can be estimated from the residual error in the training stage. When the algorithmhas stopped, it select the solution with maximal posterior out of the final solution set and build the coefficient matrix ,which is split according to the original dictionaries: . The approximations of the source spectrograms are thengiven as . It calculate a mask for each source according to , 1,2. Finally,approximations of the source signals are given by the inverse short term fourier transform (ISTFT) of the masked mixture:

, where is the original complex mixture spectrogram.Algorithm 1: Factorial SC

1. , , 2. for l=1:L3. 4. for5. , , 6. , , , , 7. for b=1: | |8. , 9. , 10. for j = 1 : J

11. 12. endfor

13. 14. , , 15. endfor

16. endfor

17. 18. if then19. Prune to the best solutions20. endif

21. endfor

Since this algorithm work in STFT it has some drawbacks such as the classical spectrogram as computed by the

STFT has an equal-spaced bandwidth across all frequency channels. Since speech signals are characterized as highly non-

stationary and non-periodic whereas music changes continuously; therefore, application of the Fourier transform will

produce errors especially when complicated transient phenomena such as the mixing of speech and music occur in the

analysed signal. Unlike the spectrogram, the log-frequency spectrogram possesses non-uniform TF resolution. However, it

does not exactly match to the nonlinear resolution of the cochlear since their centre frequencies are distributed

logarithmically along the frequency axis and all filters have constant-Q factor (Brown 1991).

On the other hand, the gammatone filters used in the cochlear model are approximately logarithmically spaced

with constant-Q for frequencies from /10 to /2 and approximately linearly spaced for frequencies below /10 .Hence, this characteristic results in selective non-uniform resolution in the TF representation of the analysed audio signal.


5/14


Gammatone filterbank was previously proposed in (Hu et al 2007, Jin et al 2009) as a model to cochlear filtering which

decomposes the time-domain input into the frequency domain. The impulse response of a gammatone filter centered at

frequency is given by:

,

, 00 , 8

where denotes the order of filter, represents the rectangular bandwidth which increases as the center frequency

increases. With regards to a particular filter channel , let be the center frequency. Then, the filter output response , can be expressed as: , , 9

where represents convolution. The response is shifted backwards by 1/2 to compensate for thefilter delay. The output of each filter channel is divided into time frame with 50% overlap between consecutive frames (Hu

et al 2005). The resulting outputs form the time-frequency spectra which are then constructed to form the cochleagram.

The use of the gammatone filter is consistent according to the neurobiological modeling perspective. Figure 2 shows an

example of frequency response for different types transform.

So by work in cochleagram spectrum we solve the problem of STFT, in the next section the -divergence introduced to

solve the problem of outliers/noise that produced by using Euclidean distance.

Figure 2: Different Types Transform (A) Original Source (B) Cochleagram (C) Spectrum (D) Log-Spectrum

-DIVERGENCE

The -divergence (Cichocki et al 2011) can be defined as :

| 1 , 1 10

This divergence can be by suitable choice of the (, ) parameters simplifies into some existing divergences,

including the well-known Alpha- and Beta-divergences. For example when 1the -divergence reduces to the

Alpha-divergence (Cichocki et al 2009)::


6/14


, | 1, 111 1 1

0, 1

On the other hand, when 1, it reduces to the Beta-divergence (Cichocki et al 2010):

, |

1 1,121 11 1 , 1 0

Also -divergence reduces to the standard Itakura-Saito divergence for 1 and 1 (Lee et al 2001).,

|

1 13

We used -divergence for many reasons that found in Cichocki et al 2011), in which it illustrated the role of the

hyper-parameters and

on the robustness of the

-divergence with respect to errors and noises, and it compare the

behavior of the -divergence with the standard Kullback-Liebler divergence, also by scaling arguments of the -

divergence by a positive scaling factor 0, it yields the following relation, | | 14

These basic properties imply that whenever 0, we can rewrite the -divergence in terms of a

-order Beta-

divergence combined with an -zoom of its arguments as

, | | 15Estimation of the Spectral Basis and Temporal Code

In order to use -divergence so our objective function is:

|| 16Where is the structure defined by :

17Let be a scalar parameter of the set , . The derivative of w.r.t :

D | | 18Where || is the derivative of || w.r.t. given by

|| 19


7/14


The gradient of the -divergence can be expressed in a compact form (for any , ) in terms of a 1 deformed logarithm .By using (18), we obtain the following derivatives:

D | ||

1 . ..

20

D | ||

1 . ..

21

The previous equations can be written in the following matrix form:

D | 1 22D | 1 23

So the update rule for both and in matrix form are . . .. 24

.

.

.

.

25

In finally we conclude our algorithm (in which we called -FSC) as follow

Algorithm 2: -FSC

Input :

Output: , 1. cochleagram 2. , -NM F (

2.

Estimate and | from coefficient matrix.3. , Factorial SC( , , , | % replace line 11 by . . ..


8/14


Algorithm3 : -NMF 1: Initialize randomly

2: for 1: do3: sparsely code with using -NMP

4: for 1: do5: . .

.

.

6: || || , k 1 , , K7: . . ..8: end for

9: end for

Algorithm 4: -NMP

1: 2: 3: 4: for

1:do

5: 6: 7: 8: if 0then9: Terminate

10: end if

11: , 12: , 13: for 1: do14: . . ..15: end for

16: 17: end for

RESULTS AND ANALYSIS

Experiment Setup

The proposed method is tested by separating music and speech sources. Several experimental simulations under

different conditions have been designed to investigate the efficacy of the proposed method. MATLAB is used as the

programming platform. For mixture generation, two speakers' male and female were selected from TIMIT speech database

(www.ldc.upenn.edu/Catalog/LDC93S1.html.) and the music signals are selected from the RWC database

(http://staff.aist.go.jp). Some mixtures are sampled at 16 kHz sampling rate and other at 8 kHz. We compare our algorithm

-FSC with MMSS (Li et al 2009), SNMF2D and Factorial SC algorithms. Where the TF representation for Factorial SC

and MMSS is computed by normalizing the time-domain signal to unit power and computing the STFT using 1024 point

Hamming window FFT with 50% overlap. For SNMF2D the frequency axis of the obtained spectrogram is then

logarithmically scaled and grouped into 175 frequency bins in the range of 50 Hz to 8 kHz with 24 bins per octave. For

-FSC the cochleagram based on Gammatone filterbank of 128 channels (filter order of 4) and the output is divided into

20-ms time frame with 50% overlap between consecutive frames. In all cases, the sources are mixed with equal average

power over the duration of the signals. Two types of mixtures are used: mixture of music and speech; mixture of different

kinds of music.


9/14


Measure of Performance

We have evaluated our separation performance in terms of the signal-to-distortion ratio (SDR) which is one form

of perceptual measure. This is a global measure that unifies source-to-interference ratio (SIR), source-to-artifacts ratio

(SAR), and source-to-noise ratio (SNR). MATLAB routines for computing these criteria are obtained from the SiSEC08

webpage (Vincent et al 2008, Vincent et al 2005).

Analysis of Results

Figure 3 shows the time domain of the original speech of male, female and the mixture of two sources; Figure 4

show the Cochleagram of two sources and its mixture .Figure 5 further shows the separation results in the cochleagram.

The plot clearly shows the spectral energy of the two audio sources is clustered at different frequencies in the cochleagram

due to their different fundamental frequencies. These prominent features have been separated using the proposed -FSC

algorithm. Figure 6 shows the final recovered time-domain sources.

To further analyses the performance of all the above matrix factorization methods in separating the mixed signal

and capturing the TF patterns of the sources, the cochleagram of the each recovered source has been plotted in Figure 5. In

Figure 5, panels (a)-(b), (c)-(d) , (e)-(f) and (g)-(h) denote the recovered cochleagram of the female speech and male by

using the Factorial SC, MMSS, SNMF2D and-FSC algorithms, respectively. In particular, panels (c)-(d) implies thatMMSS algorithm cannot obtain better reconstruction of the sources. SNMF2D give better estimation than MMSS. On the

other hand, it is noted that both Factorial SC and -FSC algorithms exhibit good reconstruction of the female speech as

well as the male. However, the Factorial SC algorithm fails to identify several missing components as indicated in the red

box marked area of panel (a)-(b). Hence, less accuracy is obtained in the estimation of the male as compared with the -

FSC algorithm which has successfully estimated both sources with high accuracy.

Table 1 shows the comparison of the proposed algorithm (-FSC) based on the cochleagram with other

algorithms such as MMSS, SNMF2D and Factoral SC. It is noted that MMSS give poor results and SNMF2D is better than

MMSS but less than others algorithms .Where both Factoral-SC and -FSC algorithms exhibit a good reconstruction in

terms of SDR, SIR and SAR. However, the resulting factorizations are not equivalent.

The major reason for the large discrepancy between them is the resulting spectrogram fails to infer the dominating

source. This leads to high degree of ambiguity in TF domain and causes lack of uniqueness in extracting the spectral-

temporal features of the sources. The cochleagram enables the mixed signal to be more separable and thereby reduces the

mixing ambiguity between |S|and |S|. This explains the performance of separating mixture music and female utteranceis highest among all the mixtures because both sources have very distinguishable TF patterns in the cochleagram.

In summary, all the results in Table 1 and Figures 5 unanimously show the importance of using the (-FSC)

factorization algorithm in order to correctly estimate the spectral and temporal features of each source.


10/14


Figure3 :(A) Original Female speech (B) Original Male speech (C) Mixture of Sources

Figure4 :(A) Cochleagram of Original Female speech (B)Cochleagram of Original Male speech (C) Cochleagram of

Mixture

Table1: Comparison between -FSC,Factoral SC, SNMF2D and MMSS

Mixture Algorithm

SDR SAR SIR

S1 S2 S1 S2 S1 S2

Female1 Speech and male speech -FSC 12.7711 12.4117 13.9310 13.9303 19.2441 17.8848

Factoral SC 12.6270 12.2991 13.8221 13.8214 18.9913 17.7675

SNMF2D -17.444 16.783 -17.335 42.0723 16.0476 16.7971

MMSS 3.7309 7.5410 4.9557 8.0659 11.0300 17.6080

Female1 Speech and Female speech -FSC 13.7165 13.6159 14.7072 14.7966 21.8654 19.9908

Factoral SC 12.8072 11.9564 13.4305 13.8704 20.7392 16.6114

SNMF2D -19.962 11.9852 9.2016 12.0191 -19.464 33.3438

MMSS 5.1450 5.3637 6.3621 6.7431 13.1413 11.0951

Music and music -FSC 17.4555 17.9902 18.3730 18.9706 24.7368 23.6153

Factoral SC 16.4991 17.3420 17.3343 17.6024 23.1495 21.3892

SNMF2D -24.925 18.2986 15.5392 18.3553 -24.805 37.2287

MMSS 10.7776 -8.1304 -4.8884 11.0298 23.5920 0.7689

Music and Female -FSC 14.1972 15.3901 15.1863 15.9172 21.2375 19.8083Factoral SC 14.3782 13.6053 14.4900 14.1660 20.9607 18.9371

SNMF2D -17.5836 8.8261 9.7609 8.8630 -17.1394 30.0781

MMSS 6.5059 9.3063 7.3610 9.3634 14.7163 28.6164


11/14


Figure5: Separation Results: (a)-(b), (c)-(d) , (e)-(f) and (g)-(h) Denote the Recovered Female speech and Male in the

Cochleagram by using the Factorial SC,MMSS,SNMF2D , -FSC Algorithms, Respectively

Figure 6 :Time Separation of Source ,(a)-(b) Factorial SC. (c)-(d) MMSS. (e)- (f) SNMF2D. (g)-(h) -FSC

CONCLUSIONS

In this paper we proposed a separation framework using the gammatone filterbank. That produces a non-uniform

TF domain termed as the cochleagram whereby each TF unit has different resolution unlike the classical spectrogram

which deals only with uniform resolution. Towards this end, it is shown that the mixed signal is significantly more

separable in the cochleagram than the classic spectrogram and the log-frequency spectrogram (constant-Q transform).


12/14


Also a family of -divergence based novel two-dimensional nonnegative matrix factorization algorithms has been

developed to extract the spectral and temporal features of the sources. The proposed factorizations are scale invariant

whereby the lower energy components in the cochleagram can be treated with equal importance as the higher energy

components. Within the context of SCBSS, this property is highly desirable as it enables the spectral-temporal features of

the sources that are usually characterized by large dynamic range of energy to be estimated with significantly higher

accuracy. This is to be contrasted with the matrix factorization based on LS distance and KL divergence where both

methods favor the high-energy components but neglect the low-energy components.

In the comparison of FSC and NMF2D algorithms, the proposed FSC obtains the best separation performance.

The impetus behind this work is that, sparseness achieved by the conventional NMF, SNMF, NMF2D and SNMF2D is not

efficient enough; in source separation it is very necessary to yield control over the degree of sparseness explicitly for each

temporal code.

REFERENCES

1. Brown, C. (1991).Calculation of a constant Q spectral transform, J. Acoust. Soc. Am., vol. 89, no 1, pp. 425434.2. Brown, C. and Cooke, M.(1994). Computational auditory scene analysis, Computer Speech and Language, vol.8,

pp. 297336.

3. Cichocki, A. , Zdunek, R. ,and Phan, A.H. (2009). Nonnegative Matrix and Tensor Factorizations, John Wiley &Sons Ltd.: Chichester, UK.

4. Cichocki, A. , Sergio , C. and Amari, S. (2011). Generalized Alpha-Beta Divergences and Their Application toRobust Nonnegative Matrix Factorization, Entropy, 13, 134-170.

5. Cichocki, A. and Amari, S. (2010). Families of Alpha- Beta- and Gamma- divergences: Flexible and robustmeasures of similarities, Entropy, 12, pp. 15321568.

6. Gao, B. , Woo, W. L. and Dlay, S. S. (2011). Single channel source separation using EMD-subband variableregularized sparse features, IEEE Trans .Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 961976.

7. http://staff.aist.go.jp8. Hu, G. and Wang, D. L. (2007). Auditory segmentation based on onset andoffset analysis, IEEE Trans. Audio,

Speech and Language Processing, vol. 15, no. 2, pp. 396405.

9. Hu, G. and Wang,.( 2004).Monaural speech segregation based on pitch tracking and amplitude modulation, IEEETrans. Neural Networks, vol. 15, no. 5, pp. 11351150.

10. Hyvarinen, A., Karhunen,J. and Oja, W. (2001). Independent Component Analysis. John Wiley & Sons.11. Jin, Z. and Wang, D.L (2009). A supervised learning approach to monaural segregation of reverberant speech,

IEEE Trans. on Audio, Speech and Language Processing, vol. 17, pp.625-638.

12. Lee, D. D. and Seung, H. S.(2001). Algorithms for non-negative matrix factorization, Advances in neuralinformation processing systems, vol. 13, pp. 556562.

13. Li ,Y., Woodruff, J. and D.L Wang .(2009). Monaural musical sound separation based on pitch and commonamplitude modulation. IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, pp. 1361-

1371.

14. Peharz, R. (2010).Single channel source separation using dictionary design methods for sparse coders, Mastersthesis, Graz University of Technology.

15. Roweis. S. (2003). Factorial models and refiltering for speech separation and denoising, in EUROSPEECH, pp.10091012.


13/14


16. Schmidt, M. N. and Morup, M.(2006). Nonnegative matrix factor 2-D deconvolution for blind single channelsource separation, in Proc.Int. Conf. Ind. Compon. Anal. Blind Signal Separat. (ICABSS06), Charleston, SC,

vol. 3889, pp. 700707.

17. Vincent, E. , Araki ,S .(2008).Signal Separation Evaluation Campaign (SiSEC 2008). [Online]. Available:http://sisec.wiki.irisa.fr.

18. Vincent, E. , Gribonval, R. and Fevotte, C.(2005). Performance measurement in blind audio source separation,IEEE Trans. on Audio, Speech, and Language Processing. vol. 14, no. 4, pp. 14621469, Jul.

19. www.ldc.upenn.edu/Catalog/LDC93S1.html.20. Wang, D.and Brown, G. J. ( 2006). Computational Auditory Scene Analysis: Principles, Algorithms,and

Applications, ser. IEEE Press. J. Wiley and Sons Ltd.


14/14

sparse nonnegative matrix based on -divergence for single channel separation in cochleagram

Documents