methodology for speaker identification and recognition system

13
Methodology for Speaker Identification and Recognition System Ade-Bello Abdul-Jelili Department of Electrical and Computer Engineering University of New Mexico E-mail: [email protected] ABSTRACT Speech recognition is an area of research which deals with the recognition of speech by machine in several conditions and speaker recognition is the computational task of validating a person’s identity based on their voice. These performs well under restricted conditions (quiet environment), but performance degrades in noisy environments. This paper presents a brief survey on automatically separating a speaker's signal from noisy multiple speech signals. In general, approaches to this problem consider a small set of features to be extracted from the input signals. These features are carefully chosen to emphasize signal characteristics that differ between individual speaker speeches. The two phases of a speaker recognition system are the enrolment phase where speech samples from the different speakers are turned into models and the verification phase where a sample of speech is tested to determine if it matches a proposed speaker. The criteria for designing speech recognition system are pre-processing filter, feature extraction techniques, speech classifiers, database, and performance evaluation. The objective of this paper is to summarize and explain well known methods for capturing the characteristic of these signals like majority of the state-of-the-art approaches. The Mel-Frequency Cepstral Coefficients (MFCCs) is used to achieve effective and robust results in the extraction of speech features capable of operating in noisy environment. Hidden Markov Models (HMMs) were intended to be used for the recognition stage as they give better recognition for the speaker’s features and provides a generative model that defines a score-space used as features for the most recent advancement in speech recognition using discriminative classifier. Keywords –– Robust Speech Recognition, Speaker Identification, Ambient noise, Features Extraction, Generative Kernels, and Discriminative Classifiers. INTRODUCTION Recent data on mobile phone users all over the world, and the number of telephone landlines in operation confirm that voice is the most accessible biometric, as no extra acquisition device or transmission system is needed. This fact gives voice an advantage over other biometrics especially when remote users or systems are taken into account [1]. Speaker recognition is the process of automatically recognizing who is speaking based on individual information included in the speech waves. Any utterance contains information about the words being spoken and also about the identity of the speaker. In a speech recognition system we wish to select the first type of feature and ignore the second; in a speaker recognition system we wish to do just the opposite. The speaker’s voice can be used to verify the identity of the person and allow accessing to services such as banking by telephone, database access services, security control for confidential information areas and remote access to computers [2]. Speaker recognition systems are classified as text-dependent and text-independent. Text-dependent systems require a user to pronounce some specified utterances containing the same text as the training data. Text-independent means that there is no limitation on the text used in the system. The most important parts of a speaker recognition system are the feature extraction and the recognition methods. The feature extraction step converts the properties of the signal which are important for the pattern recognition task to a format that simplifies the distinction of the classes. The recognition step aims to estimate the general extension of the classes within feature space from a training set [2]. For feature extraction, Linear Predictive Coding (LPC) of speech has proved to be a valid way to compress the spectral envelope

Upload: ade-bello-abdul-jelili

Post on 28-Nov-2015

86 views

Category:

Documents


1 download

DESCRIPTION

Speech recognition is an area of research which deals with the recognition of speech by machine in several conditions and speaker recognition is the computational task of validating a person’s identity based on their voice. These performs well under restricted conditions (quiet environment), but performance degrades in noisy environments. This paper presents a brief survey on automatically separating a speaker's signal from noisy multiple speech signals. In general, approaches to this problem consider a small set of features to be extracted from the input signals. These features are carefully chosen to emphasize signal characteristics that differ between individual speaker speeches. The two phases of a speaker recognition system are the enrolment phase where speech samples from the different speakers are turned into models and the verification phase where a sample of speech is tested to determine if it matches a proposed speaker. The criteria for designing speech recognition system are pre-processing filter, feature extraction techniques, speech classifiers, database, and performance evaluation. The objective of this paper is to summarize and explain well known methods for capturing the characteristic of these signals like majority of the state-of-the-art approaches. The Mel-Frequency Cepstral Coefficients (MFCCs) is used to achieve effective and robust results in the extraction of speech features capable of operating in noisy environment. Hidden Markov Models (HMMs) were intended to be used for the recognition stage as they give better recognition for the speaker’s features and provides a generative model that defines a score-space used as features for the most recent advancement in speech recognition using discriminative classifier.

TRANSCRIPT

Methodology for Speaker Identification and Recognition System

Ade-Bello Abdul-Jelili

Department of Electrical and Computer Engineering University of New Mexico E-mail: [email protected]

ABSTRACT

Speech recognition is an area of research which deals with the recognition of speech by machine in several conditions and speaker recognition is the computational task of validating a person’s identity based on their voice. These performs well under restricted conditions (quiet environment), but performance degrades in noisy environments. This paper presents a brief survey on automatically separating a speaker's signal from noisy multiple speech signals. In general, approaches to this problem consider a small set of features to be extracted from the input signals. These features are carefully chosen to emphasize signal characteristics that differ between individual speaker speeches. The two phases of a speaker recognition system are the enrolment phase where speech samples from the different speakers are turned into models and the verification phase where a sample of speech is tested to determine if it matches a proposed speaker. The criteria for designing speech recognition system are pre-processing filter, feature extraction techniques, speech classifiers, database, and performance evaluation. The objective of this paper is to summarize and explain well known methods for capturing the characteristic of these signals like majority of the state-of-the-art approaches. The Mel-Frequency Cepstral Coefficients (MFCCs) is used to achieve effective and robust results in the extraction of speech features capable of operating in noisy environment. Hidden Markov Models (HMMs) were intended to be used for the recognition stage as they give better recognition for the speaker’s features and provides a generative model that defines a score-space used as features for the most recent advancement in speech recognition using discriminative classifier. Keywords –– Robust Speech Recognition, Speaker Identification, Ambient noise, Features Extraction, Generative Kernels, and Discriminative Classifiers. INTRODUCTION

Recent data on mobile phone users all over the world, and the number of telephone landlines in operation confirm that voice is the most accessible biometric, as no extra acquisition device or transmission system is needed. This fact gives voice an advantage over other biometrics especially when remote users or systems are taken into account [1]. Speaker recognition is the process of automatically recognizing who is speaking based on individual information included in the speech waves. Any utterance contains information about the words being spoken and also about the identity of the speaker. In a speech recognition system we wish to select the first type of feature and ignore the second; in a speaker recognition system we wish to do just the opposite. The speaker’s voice can be used to verify the identity of the person and allow accessing to services such as banking by telephone, database access services, security control for confidential information areas and remote access to computers [2]. Speaker recognition systems are classified as text-dependent and text-independent. Text-dependent systems require a user to pronounce some specified utterances containing the same text as the training data. Text-independent means that there is no limitation on the text used in the system. The most important parts of a speaker recognition system are the feature extraction and the recognition methods. The feature extraction step converts the properties of the signal which are important for the pattern recognition task to a format that simplifies the distinction of the classes. The recognition step aims to estimate the general extension of the classes within feature space from a training set [2]. For feature extraction, Linear Predictive Coding (LPC) of speech has proved to be a valid way to compress the spectral envelope

in an all-pole model [1]. Most speech recognition systems use Mel-Frequency Cepstral Coefficients (MFCCs) and its first and sometimes second derivative in time to better reflect dynamic changes [3]. For the recognition step, the problem of text dependent speaker recognition is a problem of comparing a sequence of feature vectors to a model of the user. For this comparison, there are two methods that have been widely used, template based methods and statistical methods. Dynamic Time Warping (DTW) is one of the most widely used template based methods [1]. Statistical methods and in particular Hidden Markov Models (HMMs) tend to be used more often than template based methods. They provide more flexibility; allow using speech units from sub-phoneme units to words and enabling the design of text-prompted systems [4]. In this paper, a feature extraction algorithm for speech signals is described. This algorithm extract the MFCCs for the feature extraction stage with the characteristics of the individual speakers and the approximations with detail features are calculated. Based on this mechanism, the multi-resolution features of the speech signal can easily be extracted by calculating the related coefficients and frames energy levels. Using the MFCC features of a signal, classification can be done using various approaches; stochastic approach, Dynamic Time Warping (DTW), vector quantization (VQ) and pattern recognition approach using Hidden Markov Model (HMM). HMMs are the most used approach for decades used for identification stage [9]. An introduction into the combination of HMM with the most recent development in speech recognition [5] using the HMMs features to develop generative model with defines score-spaces will be discussed. The vast majority of research for speech recognition has concentrated on improving the performance of systems based on hidden Markov models (HMMs). These are an example of a generative model and are currently used in state-of-the-art speech recognition systems. A wide number of approaches have been developed to improve the performance of these systems under changes of speaker and noise. Despite these approaches, systems are not sufficiently robust to allow speech recognition systems to achieve the level of impact that the naturalness of the interface should allow.

Figure 1 Flow diagram of component of a state-of-the-art speech recogniser ( [ 5] )

Figure 2 gives a schematic overview of the approach the shaded part of the diagram indicates the generative model of a state-of-the-art speech recogniser. In this project, the generative models are used to define a score-space. These scores then form features for the discriminative classifiers. This approach has a number of advantages. It is possible to use current state-of-the-art adaptation and robustness approaches to compensate the acoustic models for particular speakers and noise conditions. As well as enabling any advances in these approaches to be incorporated into the scheme, it is not necessary to develop approaches that adapt the discriminative classifiers to speakers, style and noise. Using generative models also allows the dynamic aspects of speech data to be handled without having to alter the discriminative classifier [5]. The final advantage is the nature of the score-space obtained from the generative model. Generative models such as HMMs have underlying conditional independence assumptions that, whilst enabling them to efficiently represent data sequences, they do not accurately represent the dependencies in data sequences such as speech. The score-space associated with a generative model does not have the same conditional independence assumptions as the original generative model. This allows more accurate modelling of the dependencies in the speech data. The rest of this paper is organized as follows. Section 2 gives a description of the proposed feature extraction technique and provides detailed description of each constituting part. Section 3 introduces the recognition techniques intended to be used. The experiments and the results obtained are given in section 4. Concluding remarks are given in section 5.

APPROACH Rabiner and Juang [6] has classified speaker recognition into two main areas. It is classified into verification and identification. Speaker verification is the process of accepting or rejecting the identity claim of speakers. It authenticate that a person is who she or he claim to be. This technology can be used as a biometric feature for verifying the identity of a person in an application like banking by telephone and voice mail and identifying speakers in a classroom setting. Speaker identification is the process of determine the identity of unknown speaker which provides the given utterances. In this project, a simple speaker verification system will be implemented using Mel-frequency Cepstral Coefficients (MFCCs) as the features used to create vector features for HMMs. The mean components in the HMMs will be concatenated into super-vectors used as generative models designed with score-spaces for discriminative classifiers to verifying if a test utterance matches a proposed speaker. HMMs will not be discussed in this paper but a detailed step extraction of information from the speech signal will be described as shown in the figure 2.

Figure 2 Flow diagram of MFCCs Feature Extraction

Pre-processing This step is the first step to create feature vectors. The objective in the pre-processing is to modify the speech signal, x(n), so that it will be more suitable for the feature extraction analysis. The pre-processing operations comprises of noise cancelling and pre-emphasis. The first thing to consider is if the speech, x(n) is corrupted by some noise, d(n), for example where s(n) is the signal of interest. The two most common methods of noise reduction are spectral subtraction and adaptive noise cancellation. Pre-emphasis The pre-emphasizer is used to spectrally flatten the speech signal [6]. This is usually done by a high-pass filter. The most commonly used filter for this step is the FIR filter described in the equation below:

Where is the input signal and is the output of the pre-emphasis. The digitised speech signal is put through a

low-order digital system to make it less susceptible to finite precision effects later in the signal processing. Frame blocking In this step the pre-emphasis speech signal is blocked into frames of N samples, with adjacent frames being separated by M samples. The first frame consists of N speech samples and second frame begins M samples after the first frame, and overlaps it by N-M samples. Similarly the third frame begins 2M samples after the first frame and overlaps it by N-2M samples. This process continues until all the speech is accounted for within one or more frames. Windowing This step window each individual frame [6] so as to minimise the signal discontinuities at the beginning and the end of the frame. The result of windowing will be ;

Where will be the output of the window. A typical window used for as shown from above equation is the hamming window defined over the interval N -1. Feature Extraction The next step is to extract relevant information from the speech blocks using mel-cepstrum method [11]. These measures provides a good model of the speech signal and leads to a fairly good representation f the vocal tract characteristics other measures to add to the feature vectors are the energy measures [10]. This step comprises of FFT, Mel-Filtering, non-linear transformation and cepstral coefficients that result in the MFCCs. Fast Fourier Transform The next processing step is the Fast Fourier Transform, which converts each frame of N samples from the time domain into the frequency domain. The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT), which is defined on the set of N samples.

In general are complex numbers and we only considering their absolute values (frequency magnitudes). The

resulting sequence { } is interpreted as follow: positive frequencies

correspond to values

, while negative frequencies

correspond to

. Here, Fs denote the

sampling frequency. The result after this step is often referred to as spectrum or periodogram. Mel-filtering The low-frequency components of the magnitude spectrum are ignored. The useful frequency band lies between 64 Hz and half of the actual sampling frequency. This band is divided into 23 channels equidistant in mel-frequency domain. Each channel has triangular-shaped frequency window. Consecutive channels are half-overlapping. The choice of the starting frequency of the filter bank is 64 Hz, roughly corresponds to the case where the full frequency band is divided into 24 channels and the first channel is discarded using any of the three possible sampling frequencies. The Output of the mel filter is the weighted sum of the FFT magnitude spectrum values in each band. Triangular half-overlapped windowing is used as follows:

Non-Linear Transformation The output of mel filtering is subjected to a logarithm function (natural logarithm).

The same flooring is applied as in the case of energy calculation, that is, the log filter bank outputs cannot be smaller than -50. Cepstrum In this final step, we convert the log mel-spectrum back to time. The result is called the mel-frequency cepstrum coefficients (MFCC). The cepstral representation of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis. Because the mel-spectrum coefficients (and so their logarithm) are real numbers, we can convert them to the time domain using the Discrete Cosine Transform (DCT). Therefore if we denote those mel power spectrum coefficients that are the result of the last step are:

Energy measure The logarithmic frame energy measure (logE) is computed after the offset compensation filtering and framing for each frame:

Where M is the frame length and is the offset-free input signal. A floor is used in the energy calculation which makes sure that the result is not less than -50. The floor value (lower limit for the argument of ln) is approximately 2e-22. Front-end output The final feature vector consists of 14 coefficients: the log-energy coefficient and the 13 cepstral coefficients. The first coefficient is often redundant when the log-energy coefficient is used. However, the feature extraction algorithm is defined here for both energy and . Depending on the application, either the coefficient or the log-energy coefficient may be used. EXPERIMENTS AND RESULTS Experimental Setup The database contains the speech data files of 8 speakers; 4 males and 4 females. These speech files consist of isolated English word (Zero). Each speaker repeats each word twice, one of the utterances is for training and on for testing. The data were recorded using a microphone, and all samples are stored in Microsoft wave format files with 8000 Hz sampling rate, 16 bit PCM and mono channels.[ref] Using the feature extraction technique MFCC, the algorithm was applied to the voice signal after mixing four different speakers together to mimic a classroom setting, yielding approximations and details channels at different levels, the MFCCs are used to extract features from the channels. The number of extracted speech features is proportional to the number of decomposition levels. Although more decomposition processes can obtain more information from the input signals, the computational complexity and the number of useless features will increase greatly. The Mel filter bank is designed with 23 frequency bands. In the calculation of all the features, the speech signal is partitioned into frames; the frame size of the analysis is 256 samples with 100 samples overlapping. These results are obtained using Matlab. For each unknown person who is to be recognized, features are extracted from his voice sample. HMMs were proposed to be used for recognition with the proposed technique of feature extraction. Also in order to evaluate the performance of the proposed method in a noisy environment, the test patterns for four utterances are corrupted by additive white Gaussian noise to the original signal. Simulations/Results For simulation the VOICEBOX toolbox [ref] for signal processing was used which contains several tools in MATLAB. First combine multiple speakers into a single audio file, play the output sound and then add a Gaussian noise [8].

Figure 3 Time Domain Plots for each speaker

0 0.2 0.4 0.6 0.8 1 1.2 1.4-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6Time domain Plot for speaker-1

0 0.2 0.4 0.6 0.8 1 1.2 1.4-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Time domain Plot for speaker-2

Plotted the signal for just two speakers and their combinations and applied the function melcepst.m from the toolbox to divide the speech signal into corresponding numbers of frames which depends on the sampling rate (fs) of the signal itself with overlapping samples of 110.

Figure 4 Time Domain Plots for multiple speaker with added noise

A hamming window of length 256 was applied with a FFT of length 256 samples for 8KHz. Then the starting frequency of the filter bank as 64Hz and the full frequency band was divided into 24 channels to obtain the 13 coefficients of the MFCCs after taking the logarithm of the filter output. From the database [7] speaker one and speaker two utterances were mixed together and as shown in figure below, then mixed with noise to perfectly mimic a classroom environment before the MFCC algorithm was implemented to extract the 13 cepstral coefficients that can be used by a decision matching algorithms for speaker identification.

CONCLUSION and FUTURE WORK Since most state-of-the-art approaches for speech recognition and speaker identification make use of MFCCs on large scale including the revolutionary HMMs which is still greatly in use today for most advancement in the field. Areas providing opportunities for further work will be; adaptation and noise-robustness design specifically for algorithms of huge area of interest like the HMMs, whereby system can be develop to adapt to various kind of noises and Optimising the actual likelihood used for decoding increase performance. Another area is using HMMs for score-space generation for the applications of discriminative classifier for adverse environment recognition of speech [5]. Currently the features that are extracted are log-likelihoods and their derivatives. Other features may be valuable that could inherit greater features. Also, regularisation techniques may allow higher-order derivatives to be included without harming generalisation and speed. [5] As discussed from the introduction, classification using the latest advanced method [ref], where approaches to bring the sophistication of log-linear models with generative models to large-vocabulary speech recognition will be an area to look into, this will require a number of optimisations and approximations, both for training and decoding.

Time (s)

Mel-cepstr

um

coeff

icie

nt

13 cepstral coefficients of LOG output for speaker 1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

2

4

6

8

10

12

-4

-3

-2

-1

0

1

2

3

4

5

6

Time (s)

Mel-cepstr

um

coeff

icie

nt

13 cepstral coefficients of LOG output for speaker 2

0.2 0.4 0.6 0.8 1

2

4

6

8

10

12

-6

-4

-2

0

2

4

6

Time (s)

Mel-cepstr

um

coeff

icie

nt

13 cepstral coefficients of LOG output for multiple speaker

0.5 1 1.5 2

2

4

6

8

10

12

-6

-4

-2

0

2

4

6SPEAKER 1

REGIONSPEAKER 2

REGION

0 0.2 0.4 0.6 0.8 1 1.2 1.4 -1

0

1 Time domain Plot for multiple speaker

0 0.2 0.4 0.6 0.8 1 1.2 1.4 -1

0

1 Time domain Plot for noisy multiple speaker

REFERENCES [1] A.K Jain, P. Flynn, and A.A. Ross, “handbook of biometrics”, Springer, 2008. [2] A.O. Afolabi, A. Williams, and O. Dotun, “Development of a text dependent speaker identification security system”, Research Journal of Applied Sciences, 2 (6), pp. 677-684, 2007. [3] B. Plannerer, “An introduction to speech recognition”, Munich, Germany, 2005. [4] J.P. Campbell, “Testing with the yoho cd-rom voice verification corpus”, In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 341-344, 1995. http://www.ll.mit.edu/mission/communications/ist/publications/jpc.html [5] M. J. F. Gales, R.C Van Dalen, J. Yang, A. Ragni and S. X. Zhang (Jan. 2012 & 2013). “Generative Kernels and Score-Spaces for Classfication of Speech: Progress Report I & II” Computer Speech and Language. Technical Report CUED/F- INFENG/TR.689 http://mi.eng.cam.ac.uk/~mjfg/Kernel/index.html#publications [6] L.Rabiner and B.H Juang (1993) "Fundamentals of Speech Recognition" Prentice Hall Signal Processing series [7] Lawrence Rabiner. Data acquired for speakers uttering speeches from http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/speech%20course.html [8] Hans Hirsch Niederrhein, University of Applied Sciences David Pearce, Motorola Labs.noise http://dnt.kr.hsnr.de/wwwsim/wwwsim.php [9] AURORA Project (European Language Resources Association) http://aurora.hsnr.de/ [10] C.Y.Fook , M. Hariharam, S.Yaacob, Adom A.H, "A Review: Malay Speech Recognition and audio Visual Speech Recognition" International Conference on Biomedical Engineering (ICoBE) Penang. 2012. [11] Shri.R.T.Patil. Ms Sonali Sambhaji Kumbhar, T.K.I.E.T. Waranagar, "Robust Speaker Identification using CFCC". International Journal of Engineering Science Invention. www.ijesi.org Volume 2 Issue 7| July 2013.

APPENDIX function [f,t,w]=enframe(x,win,inc,m) %ENFRAME split signal up into (overlapping) frames: one per row. [F,T]=(X,WIN,INC) % % Usage: (1) f=enframe(x,n) % split into frames of length n % % (2) f=enframe(x,hamming(n,'periodic'),n/4) % use a 75% overlapped Hamming

window of length n % % (3) frequency domain frame-based processing: % % S=...; % input signal % OV=2; % overlap factor of 2 (4 is also

often used) % INC=20; % set frame increment in samples % NW=INC*OV; % DFT window length % W=sqrt(hamming(NW,'periodic')); % omit sqrt if OV=4 % W=W/sqrt(sum(W(1:INC:NW).^2)); % normalize window % F=rfft(enframe(S,W,INC),NW,2); % do STFT: one row per time frame,

+ve frequencies only % ... process frames ... % X=overlapadd(irfft(F,NW,2),W,INC); % reconstitute the time waveform

(omit "X=" to plot waveform) % % Inputs: x input signal % win window or window length in samples % inc frame increment in samples % m mode input: % 'z' zero pad to fill up final frame % 'r' reflect last few samples for final frame % 'A' calculate window times as the centre of mass % 'E' calculate window times as the centre of energy % % Outputs: f enframed data - one frame per row % t fractional time in samples at the centre of each frame % w window function used % % By default, the number of frames will be rounded down to the nearest % integer and the last few samples of x() will be ignored unless its length % is lw more than a multiple of inc. If the 'z' or 'r' options are given, % the number of frame will instead be rounded up and no samples will be ignored. % % Example of frame-based processing: % INC=20 % set frame increment in samples % NW=INC*2 % oversample by a factor of 2 (4 is

also often used) % S=cos((0:NW*7)*6*pi/NW); % example input signal % W=sqrt(hamming(NW),'periodic')); % sqrt hamming window of period NW % F=enframe(S,W,INC); % split into frames % ... process frames ... % X=overlapadd(F,W,INC); % reconstitute the time waveform (omit

"X=" to plot waveform)

% Bugs/Suggestions: % (1) Possible additional mode options: % 'u' modify window for first and last few frames to ensure WOLA % 'a' normalize window to give a mean of unity after overlaps % 'e' normalize window to give an energy of unity after overlaps % 'wm' use Hamming window % 'wn' use Hanning window % 'x' include all frames that include any of the x samples

% Copyright (C) Mike Brookes 1997-2012 % Version: $Id: enframe.m 3274 2013-07-23 10:07:38Z dmb $

% % VOICEBOX is a MATLAB toolbox for speech processing. % Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % This program is free software; you can redistribute it and/or modify % it under the terms of the GNU General Public License as published by % the Free Software Foundation; either version 2 of the License, or % (at your option) any later version. % % This program is distributed in the hope that it will be useful, % but WITHOUT ANY WARRANTY; without even the implied warranty of % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the % GNU General Public License for more details. % % You can obtain a copy of the GNU General Public License from % http://www.gnu.org/copyleft/gpl.html or by writing to % Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

nx=length(x(:)); if nargin<2 || isempty(win) win=nx; end if nargin<4 || isempty(m) m=''; end nwin=length(win); if nwin == 1 lw = win; w = ones(1,lw); else lw = nwin; w = win(:)'; end if (nargin < 3) || isempty(inc) inc = lw; end nli=nx-lw+inc; nf = fix((nli)/inc); na=nli-inc*nf; f=zeros(nf,lw); indf= inc*(0:(nf-1)).'; inds = (1:lw); f(:) = x(indf(:,ones(1,lw))+inds(ones(nf,1),:)); if nargin>3 && (any(m=='z') || any(m=='r')) && na>0 if any(m=='r') ix=1+mod(nx-na:nx-na+lw-1,2*nx); f(nf+1,:)=x(ix+(ix>nx).*(2*nx+1-2*ix)); else f(nf+1,1:na)=x(1+nx-na:nx); end nf=size(f,1); end if (nwin > 1) % if we have a non-unity window f = f .* w(ones(nf,1),:); end if nargout>1 if any(m=='E') t0=sum((1:lw).*w.^2)/sum(w.^2); elseif any(m=='A') t0=sum((1:lw).*w)/sum(w); else t0=(1+lw)/2; end t=t0+inc*(0:(nf-1)).'; end

function c=melcepst(s,fs,w,nc,p,n,inc,fl,fh) %MELCEPST Calculate the mel cepstrum of a signal C=(S,FS,W,NC,P,N,INC,FL,FH) % % % Simple use: (1) c=melcepst(s,fs) % calculate mel cepstrum with 12 coefs, 256

sample frames % (2) c=melcepst(s,fs,'e0dD') % include log energy, 0th cepstral coef,

delta and delta-delta coefs % % Inputs: % s speech signal % fs sample rate in Hz (default 11025) % w mode string (see below) % nc number of cepstral coefficients excluding 0'th coefficient [default 12] % p number of filters in filterbank [default: floor(3*log(fs)) = approx 2.1 per

ocatave] % n length of frame in samples [default power of 2 < (0.03*fs)] % inc frame increment [default n/2] % fl low end of the lowest filter as a fraction of fs [default = 0] % fh high end of highest filter as a fraction of fs [default = 0.5] % % w any sensible combination of the following: % % 'R' rectangular window in time domain % 'N' Hanning window in time domain % 'M' Hamming window in time domain (default) % % 't' triangular shaped filters in mel domain (default) % 'n' hanning shaped filters in mel domain % 'm' hamming shaped filters in mel domain % % 'p' filters act in the power domain % 'a' filters act in the absolute magnitude domain (default) % % '0' include 0'th order cepstral coefficient % 'E' include log energy % 'd' include delta coefficients (dc/dt) % 'D' include delta-delta coefficients (d^2c/dt^2) % % 'z' highest and lowest filters taper down to zero (default) % 'y' lowest filter remains at 1 down to 0 frequency and % highest filter remains at 1 up to nyquist freqency % % If 'ty' or 'ny' is specified, the total power in the fft is preserved. % % Outputs: c mel cepstrum output: one frame per row. Log energy, if requested, is

the % first element of each row followed by the delta and then the delta-

delta % coefficients. %

% BUGS: (1) should have power limit as 1e-16 rather than 1e-6 (or possibly a better way

of choosing this) % and put into VOICEBOX % (2) get rdct to change the data length (properly) instead of doing it

explicitly (wrongly)

% Copyright (C) Mike Brookes 1997 % Version: $Id: melcepst.m 3497 2013-09-26 16:10:51Z dmb $ % % VOICEBOX is a MATLAB toolbox for speech processing. % Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % This program is free software; you can redistribute it and/or modify % it under the terms of the GNU General Public License as published by % the Free Software Foundation; either version 2 of the License, or % (at your option) any later version. % % This program is distributed in the hope that it will be useful, % but WITHOUT ANY WARRANTY; without even the implied warranty of % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the % GNU General Public License for more details. % % You can obtain a copy of the GNU General Public License from % http://www.gnu.org/copyleft/gpl.html or by writing to % Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

if nargin<2 fs=11025; end if nargin<3 w='M'; end if nargin<4 nc=12; end if nargin<5 p=floor(3*log(fs)); end if nargin<6 n=pow2(floor(log2(0.03*fs))); end if nargin<9 fh=0.5; if nargin<8 fl=0; if nargin<7 inc=floor(n/2); end end end

if isempty(w) w='M'; end if any(w=='R') z=enframe(s,n,inc); elseif any (w=='N') z=enframe(s,hanning(n),inc); else z=enframe(s,hamming(n),inc); end f=rfft(z.'); [m,a,b]=melbankm(p,n,fs,fl,fh,w); pw=f(a:b,:).*conj(f(a:b,:)); pth=max(pw(:))*1E-20; if any(w=='p') y=log(max(m*pw,pth)); else ath=sqrt(pth); y=log(max(m*abs(f(a:b,:)),ath)); end c=rdct(y).'; nf=size(c,1); nc=nc+1; if p>nc c(:,nc+1:end)=[]; elseif p<nc c=[c zeros(nf,nc-p)]; end if ~any(w=='0') c(:,1)=[]; nc=nc-1; end if any(w=='E') c=[log(max(sum(pw),pth)).' c]; nc=nc+1;

end

% calculate derivative

if any(w=='D') vf=(4:-1:-4)/60; af=(1:-1:-1)/2; ww=ones(5,1); cx=[c(ww,:); c; c(nf*ww,:)]; vx=reshape(filter(vf,1,cx(:)),nf+10,nc); vx(1:8,:)=[]; ax=reshape(filter(af,1,vx(:)),nf+2,nc); ax(1:2,:)=[]; vx([1 nf+2],:)=[]; if any(w=='d') c=[c vx ax]; else c=[c ax]; end elseif any(w=='d') vf=(4:-1:-4)/60; ww=ones(4,1); cx=[c(ww,:); c; c(nf*ww,:)]; vx=reshape(filter(vf,1,cx(:)),nf+8,nc); vx(1:8,:)=[]; c=[c vx]; end

if nargout<1 [nf,nc]=size(c); t=((0:nf-1)*inc+(n-1)/2)/fs; ci=(1:nc)-any(w=='0')-any(w=='E'); imh = imagesc(t,ci,c.'); axis('xy'); xlabel('Time (s)'); ylabel('Mel-cepstrum coefficient'); title('13 cepstral coefficients of LOG output for multiple speaker') map = (0:63)'/63; colormap([map map map]); colorbar; end

%************************************************************************** % SCRIPT FOR PROJECT % combine multiple speakers into in audio file %************************************************************************** [y1,f1,nbit_1]=wavread('s1.wav'); [y2,f2,nbit_2]=wavread('s2.wav'); [y3,f3,nbit_3]=wavread('s3.wav'); [y4,f4,nbit_4]=wavread('s4.wav'); [y5,f5,nbit_5]=wavread('s5.wav'); [y6,f6,nbit_6]=wavread('s6.wav'); [y7,f7,nbit_7]=wavread('s7.wav'); [y8,f8,nbit_8]=wavread('s8.wav'); [m,n]=size(y1); Y1=[y1,y2(1:m,:)]; Y2=[y3(1:m,:),y4(1:m,:)];%,y5(1:m,:),y6(1:m,:),y7(1:m,:),y8(1:m,:)]; sound(Y1,f1); sound(Y2,f1); %************************************************************************** %*************************************************************************** % add noise t = 0:1/f1:length(Y1)/f1-1/f1; %generate the correct time vector % figure(1) % subplot(311) %set up a subplot % plot(t,Y1) %plot the signal in the time domain

%code to generate gaussian noise sigma = 0.02; mu = 0; n = randn(size(Y1))*sigma + mu*ones(size(Y1)); noisy = n + Y1; % subplot(312) % plot(t,noisy)%add the gaussian noise to the original signal %**************************************************************************** %*************************************************************************** % divide signals into different frames for both signal and noisy signal n=64; win=hamming(n,'periodic'); inc=n/4; m='z'; [frame_1,t_1,w_1]=enframe(Y1,win,inc,m); [frame_2,t_2,w_2]=enframe(noisy,win,inc,m); yfft=fft(frame_1); %take the FFT of the original signal frames. xfft=fft(frame_2); %take the FFT of the of frame signal with noise added f = -length(frame_1)/2:length(frame_1)/2-1; %generate the appropriate frequency %scale. ysfft=fftshift(yfft); %calculate the shifted FFT of the original %signal xsfft=fftshift(xfft); %same as above but for the signal with noise %added figure(2) subplot(311) %plot the shifted FFT of the original signal in the frequency domain plot(f,abs(ysfft)); subplot(312) %plot the shifted FFT of the original signal with noise added in the %frequency domain plot(f,abs(xsfft)); %signal to compare melcepst(y1,fs,w,nc,p,n,inc,fl,fh)