main document

TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER

CHAPTER-1________________________________________________________________________________________________

INTRODUCTION

SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY


1. 1 INTRODUCTION:

Speaker recognition is the process of recognizing automatically who is speaking

on the basis of individual information included in speech waves. This technique uses the

speaker's voice to verify their identity and provides control access to services such as

voice dialing, database access services, information services, voice mail, security control

for confidential information areas, remote access to computers and several other fields

where security is the main area of concern.

The speech signal contains many levels of information. Primarily a message is

conveyed via the spoken words. At other levels, speech conveys the information about

the language being spoken, the emotion, gender, and the identity of the speaker. The

automatic recognition of speaker and speech recognition are very closely related. While

speech recognition sets its goals at recognizing the spoken words in speech, the aim of

automatic speaker recognition is to identity the speaker by extraction, characterization

and recognition of the information contained in the speech signal.

Speech is a complicated signal produced as a result of several transformations

occurring at several different levels: semantic, linguistic, articulatory, and acoustic.

Differences in these transformations are reflected in the differences in the acoustic

properties of the speech signal. Besides there are speaker related differences which are a

result of a combination of anatomical differences inherent in the vocal tract and the

learned speaking habits of different individuals. In speaker recognition, all these

differences are taken into account and used to discriminate between speakers.

1.2 LITERATURE SURVEY:

Speech is a natural means of communication for humans. It is not surprising that

humans can recognize the identity of a person by hearing his voice. About 2-3 seconds of

speech is sufficient for a human to identify a voice. One review on human speech

recognition states that many studies of 8-10 speakers yield accuracy of more than 97% if

a sentence or more of the speech is heard. Performance falls if the length of the speech is



short and if the number of speakers is more. Speaker Recognition is one area of artificial

intelligence where machine performance can exceed human performance using short test

utterances and a large number of speakers in which case machine accuracy often exceed

that of humans. Research on Speaker Identification systems can be dated back to more

than fifty years .

1.2.1 EARLY SYSTEMS (1960-1980):

The first reported work on Speaker Recognition can be dedicated to Pruzansky at

Bell Labs , as early as 1963, who initiated research by using filter banks and correlating

two digital spectrograms for a similarity measure.

The system used several utterances of commonly spoken words by ten talkers and

converted it to time-frequency-energy patterns. Some of each talker's utterances were

used to form reference patterns and the remaining utterances served as test patterns. The

recognition procedure consisted of cross-correlating the test patterns with the reference

patterns and selecting the talker corresponding to the reference pattern with the highest

correlation as the talker of the test utterance.

1.2.2 MEDIEVAL SYSTEMS (1980-2000):

In this period there was lot of development in Speaker Identification technology.

These advances were both in the field of feature extraction and feature matching.

Voice pitch (F0) and formant frequencies (F1, F2, F3) extracted from time

aligned, un-coded and coded speech samples were compared to establish the statistical

distribution of error attributed to the coding system . The mel-warped cepstrum is a very

popular feature domain. The mel warping transforms the frequency scale to place less

emphasis on high frequencies. It is based on the nonlinear human perception of the

frequency of sounds . The cepstrum can be considered as the spectrum of the log

spectrum. Removing its mean reduces the effects of linear time-invariant filtering (e.g.,

channel distortion). Often, the time derivatives of the mel cepstra (also known as delta

cepstra) are used as additional features to model trajectory information.



Studies on automatically extracting the speech periods of each person separately

from a dialogue/conversation/meeting involving more than two people have appeared as

an extension of speaker recognition technology. Increasingly, speaker segmentation and

clustering techniques have been used to aid in the adaptation of speech recognizers and

for supplying metadata for audio indexing and searching.

As an alternative to the template-matching approach for text dependent speaker

recognition, the Hidden Markov Model (HMM)technique was introduced. HMMs have

the same advantages for speaker recognition as they do for speech recognition.

Remarkably robust models of speech events can be obtained with only small amounts of

specification or information accompanying training utterances. Speaker recognition

systems based on an HMM architecture used speaker models derived from a multi-word

sentence, a single word, or a phonemes.

1.2.3 RECENT TRENDS IN SPEAKER IDENTIFICATION (2000 ONWARDS):

We can divide the recent advances in Speaker Identification in two categories: Feature Extraction and Feature Matching.

1.2.3.(a) Feature Extraction:

Recently feature extraction techniques like MFCC, wavelet decomposition and

Transform domain techniques are being explored.

Mel-Frequency Cepstral Coefficients (MFCC):

There has been a shift from LPC parameters to Mel-Frequency Cepstral

Coefficients (MFCC) for feature extraction. MFCC’s are based on the known variation of

the human ear’s critical bandwidths with frequency. The MFCC technique makes use of

two types of filter, namely, linearly spaced filters and logarithmically spaced filters. To

capture the phonetically important characteristics of speech, signal is expressed in the

Mel frequency scale. This scale has linear frequency spacing below 1000 Hz and a

logarithmic spacing above 1000 Hz. As a reference point, the pitch of a 1 KHz tone, 40

dB above the perceptual hearing threshold is defined as1000 Mels.



Fig.1.1: Block Diagram Of MFCC Processor

Fig. 1.1 shows the block diagram representation of the process to convert the

speech signal into MFCC. Here the speech signal is first converted into frames and then

windowed (e.g. hamming window), to minimize the signal discontinuities at the

beginning and end of each frame. The next step is to convert the signal into frequency

domain by applying DFT on the windowed frames. Next step is the Mel-frequency

wrapping, where the Mel scale is used. Eq.2.1 shows the conversion of frequency (f) to

Mel Frequency. To implement this, filter bank approach is used. In the final step, the log

Mel spectrum is converted back to time, which is called the MFCC. This is done by using

DCT.

Mel( f ) = 2595×log10(1+ f / 700)


Frameblocking

Windowing

CEPSTRUM

MEL-frequencywrapping

FFTSpeech as Input

MEL- CEPSTRUM


Wavelets:

Also another technique for feature extraction which is being explored is using

wavelet decomposition. Speech signals have a wide variety of characteristics, in both

time and frequency domains. To analyze the non-stationary signals like speech, both time

and frequency resolutions are important.

Therefore while extracting features; it would be useful to analyze the signal from

multi-resolution perspective. Wavelets provide both time as well as frequency resolution.

The wavelet analysis procedure is to adopt a wavelet prototype function, called an

analyzing wavelet or mother wavelet. Temporal analysis is performed with a contracted,

high frequency version of the prototype wavelet, while frequency analysis is performed

with a dilated, low frequency version of the same wavelet.

In, Speaker Identification using different levels of decomposition of the speech

signal using discrete wavelet transform (DWT) and Daubechies wavelets (mother

wavelet) has been shown. Fig. 2.2shows the how the speech signal is decomposed into

approximate(a1,…, a7) and detail coefficients (d1,…, d7) by using low pass and high

pass filters at each stage. The speech signal has been decomposed up to seven levels

using Discrete Wavelet Transform(DWT) by using different Daubechies mother

wavelets. The mean of the approximate and exact coefficients of every level have been

taken as the feature vector.



Fig.1.2: 7th Level Wavelet Decomposition of the speech signal intoapproximate and detail coefficients

1.2.3(b) Feature matching

Artificial Neural network:

As the techniques for feature matching have shifted from template matching to

statistical modeling (e.g. HMM), distance based to likelihood based method. The non-

parametric approach of VQ is still being used. The recent trend is the use of Artificial

Neural Network (ANN). Being widely used in pattern recognition tasks, neural networks

have also been applied in speaker recognition.

Dynamic Time Warping (DTW):

The most popular method to compensate for speaking-rate variability in template-

based systems is known as DTW . This method accounts for the variation over time

(trajectories) of parameters corresponding to the dynamic configuration of the articulators

a dynamic time warping (DTW) based text-dependent speaker verification system. The

investigation on Gaussian mixture model (GMM) by comparing it with some preliminary



experiments on multilayered perceptron network (MLP) with back propagation learning

algorithm (BKP) and dynamic time warping (DTW) techniques.

Although these advances have taken place, there are still many practical

limitations which hinder the widespread commercial deployment of applications and

services. A more sound understanding of the complex speech signal and its parameters is

through which this can be achieved.

1.3 PROPOSED SYSTEM:

This system entails the design of a speaker recognition code using MATLAB.

Signal processing in the time and frequency domain yields a powerful method for

analysis. MATLAB’s built in functions for frequency domain analysis as well as its

straight forward programming interface makes it an ideal tool for speech analysis

projects. Speech editing was performed as well as degradation of signals by the

application of Gaussian noise. Background noise was successfully removed from a signal

by the application of a 3rdorder Butterworth filter. A code was then constructed to

compare the pitch and formant of a known speech file to a bunch of unknown speech files

and choose the top twelve matches.



CHAPTER 2

______________________________________________________________________________________________

_APPROACH



2.1 DESCRIPTION:

The physiological component of voice recognition is related to the physical shape

of an individual’s vocal tract, which consists of an air way and the soft tissue cavities

from which vocal sounds originate. To produce the speech the components work in

combination with the physical movement of jaw, tongue and the larynx and resonances in

the nasal passages.

Fig .2.1 Human vocal system

There are two forms of speaker recognition” text dependent” and” text

independent”. In a system using “text dependent” speech the individual presents either a

fixed or prompted phrase that is programmed in to the system and can improve

performance especially with cooperative users. A “text independent” system has no



advance knowledge of the presenters phrasing and is much more flexible in situations

where the individual submitting the sample may be unaware of the collection or

unwilling to cooperate which presents more difficult challenge.

The speaker recognition system analyzes the frequency content of the speech and

compares characteristics such as quality, duration intensity dynamics and pitch of the

signal.

2.2 CLASSIFICATION:

Speaker recognition can be classified into a number of categories. Figure below

provides the various classifications of speaker recognition.

Fig. 2.2: Classification of Speaker Recognition



2.2.1 OPEN SET (vs) CLOSED SET:

Speaker recognition can be classified into open set and closed set speaker

recognition. This category of classification is based on the set of trained speakers

available in a system. Let us discuss them in details.

1. Open Set: An open set system can have any number of trained speakers. We have an

open set of speakers and the number of speakers can be anything greater than one.

2. Closed Set: A closed set system has only a specified (fixed) number of users registered

to the system.

2.2.2 IDENTIFICATION (vs )VERIFICATION :

This category of classification is the most important among the lot. Automatic

speaker identification and verification are often considered to be the most natural and

economical methods for avoiding unauthorized access to physical locations or computer

systems. Let us discuss them in detail:-

1. Speaker identification: It is the process of determining which registered speaker

provides a given utterance.

2. Speaker verification: It is the process of accepting or rejecting the identity claim of a

speaker. Figure 2.3.1 below and figure 2.3.2in the next page illustrate the basic

differences between speaker identification and verification systems.



Fig 2.3: Practical examples of Identification and Verification Systems

2.2.3 TEXT-DEPENDENT vs TEXT-INDEPENDENT

This is another category of classification of speaker recognition systems. This

category is based upon the text uttered by the speaker during the identification process.

Let us discuss each in details:-

1. Text-Dependent: In this case, the test utterance is identical to the text used in the

training phase. The test speaker has prior knowledge of the system.

2. Text-Independent: In this case, the test speaker doesn’t have any prior knowledge

about the contents of the training phase and can speak anything.



2.3 DESIGN APPROACH:

This multi faceted design project can be categorized into different sections:

speech editing, speech degradation, speech enhancement, pitch analysis, formant analysis

and waveform comparison. The resulting discussion will be segmented based on these

delineations.

2.3.1 SPEECH EDITING

The file recorded with my slower speech was found from the ordered list of

speakers. It was determined that the length of the vector representing this speech file had

a magnitude of 30,000. Thus the vector was partitioned into two separate vectors of equal

length and the vectors were written to a file in opposite order. The file was then read and

played back.

2.3.2 SPEECH DEGRADATION

The file recorded with my faster speech was found from the ordered list of

speakers. Speech degradation was performed by adding Gaussian noise generated by the

MATLAB function randn() to this file. A comparison was then made between the clean

file and the signal with the addition of Gaussian noise.

2.3.3 SPEECH ENHANCEMENT

The file recorded with my slower speech and noise in the background was found

from the ordered list of speakers. This signal was then converted to the frequency domain

through the use of a shifted FFT and correctly scaled frequency vector. The higher

frequency noise components were then removed by application of a 3rd order

Butterworth low pass filter with the cutoff chosen to remove as much of the noise signal

as possible while still preserving the original signal.



2.3.4 PITCH ANALYSIS


speakers. Pitch analysis was conducted and relevant parameters were extracted. The

average pitch of the entire wav file was computed and found to have a value of154.8595

Hz. The graph of pitch contour versus time frame was also created to see how the pitch

varies over the wav file. The results of pitch analysis can be used in speaker recognition,

where the differences in average pitch can be used to characterize a speech file.

2.3.5 FORMANT ANALYSIS

Formant analysis was performed on my slow speech file. The vector position of

the peaks in the power spectral density were calculated and can be used to characterize a

particular voice file. This technique is used in the waveform comparison.

Formant frequencies have rarely been used as acoustic features for speech

recognition, in spite of their phonetic significance. For some speech sounds one or more

of the formants may be so badly defined that it is not useful to attempt a frequency

measurement. Also, it is often difficult to decide which formant labels to attach to

particular spectral peaks.

2.3.6 WAVEFORM COMPARISON

Using the results and information learned from pitch and formant analysis, a

waveform comparison code was written. Speech waveform files can be characterized

based on various criteria. Average pitch and formant peak position vectors are two such

criteria that can be used to characterize a speech file. The slow speech file was used as a

reference file. Four sorting routines were then written to compare the files. The sorting

routines performed the following functions: sort and compare the average pitch of the

reference file with all 83 wav files, compare the formant vector of the reference file to all

wav files, sort for the top 20 average pitch correlations and then sort these files by

formant vectors, and finally to sort for the top 20 formant vector correlations and then

sort these by average pitch. Sample code for the case of comparing the average pitch and



then comparing the top 12 most likely matches by formant peak difference vectors is

given .

BLOCK DIAGRAM:

Fig 2.4 Block Diagram of Speaker Recognition System


SPEECHDEGRADATI

ON

FORMANT ANALYSIS

WAVEFORMCOMPARISI

ON

PITCH ANALYSIS

SPEECHENHANCEM

ENT

SPEECH EDITING

GAUSSIAN NOISE

3rd orderButter worth filter

FFT

Voice.wav file


CHAPTER 3

________________________________________________________________________________________________

SPEECH EDITING



Editing speech data is currently time-consuming and error-prone. Speech editors

rely on acoustic waveform representations, which force users to repeatedly sample the

underlying speech to identify words and phrases to edit. Instead we developed a semantic

editor that reduces the need for extensive sampling by providing access to meaning.

3.1 INTRODUCTION:

Speech is an important informational medium. Large amounts of valuable spoken

information are exchanged in meetings, voicemail and public debates .Speech also has

general benefits over text, being both expressive and easy to produce . Speech archives

are now becoming increasingly prevalent, but until recently it was hard to exploit these

archives because of a shortage of effective tools for accessing and manipulating speech

data.

A set of transformations of f0 contours, energy, duration and spectral content, for

the manipulation of affect in speech signals. This set includes operations like selective

extension, shrinking, and actions like ‘cut and paste’.

3.2 AFFECT EDITOR:

A complete system may allow the user to choose either a desired target

expression that will be automatically translated into operators and contours, or choose the

operators and manipulations manually. The editing tool should offer a variety of editing

operators, such as changing the intonation , speech rate, the energy in different frequency

bands and time frames, or add special effects.

3.3 IMPLEMENTATION:

The editor requires a preprocessing stage before editing an utterance. Post

processing is also necessary for reproducing a new speech signal. The input signal is

preprocessed in a way that allows processing of different features separately . The time-

frequency domain is used because it allows for local changes of limited durations, and of

specific frequency bands. From human computer interaction point of view, it allows



visualization of the changeable features, and gives the user graphical feedback for most

operations.


speakers. It was determined that the length of the vector representing this speech file had

a magnitude of 30,000. Thus the vector was partitioned into two separate vectors of equal

length and the vectors were written to a file in opposite order

Fig 3.1: speech editing waveform

Effective editing can extract and summarize the main points of a speech record,

allowing others to access key information without having to listen to all of it. Most

current speech editors rely on an acoustic representation. To edit speech, users listen to

the underlying speech and then manipulate the acoustic representation. This is a laborious

process that involves multiple editing actions, repeatedly sampling the speech to precisely

identify the beginning and end of regions of interest.

One important design implication is that we need to move away from general-

purpose acoustic tools for processing speech. Acoustic editors are designed to deal with

all forms of audio data, but speech editing has specific demands, that are not well met by

such general tools. By building tools that are specifically tailored to represent meaning,

we can provide more effective ways to process speech.



Further design implications arise from user comments about the semantic editor.

One challenge is to indicate to users that a transcript is inaccurate. One possibility is that

we might use confidence information from the speech recognizer to signal this . Regions

of low Automatic Speaker Recognition confidence could be grayed in the transcript to

alert users to areas of potentially poor quality.

Users also wanted to be able to correct transcripts and comment on their edits. We

have therefore extended our semantic editor to: (a) allow users to correct original

transcription errors; (b) combine edited transcripts with explanatory user textual

comments.

In this technique we will going to record a set of the speech signal in ‘.wav’ (dot)

wave format and taking a speech signal from the set of recorded speech waves on which

we will going to perform Speech Editing. Here the length of the vector representing this

speech file must have a magnitude of 30,000. However, this vector is then divided into

two separate vectors having equal length & in opposite order. Now with the help of

MATLAB Programming & Tools we will going to develop a code by which the given

wave file is read and then the same file is played in reverse order. The general

representation of the speech editing waveform in forward mode as well as reverse mode

is shown in figure below.



Fig 3.2: Example for speech editing

In conclusion, speech editing is better and faster for accurate Automatic Speaker

Recognition(ASR) and more efficient than acoustic editing even when transcription is

poor. These results are highly promising, suggesting that speech editing may remove a

major barrier to making speech into useful data.



CHAPTER 4

------------------------------------------------------------------------------------------------------------------------------------------------

SPEECH DEGRADATION

&

SPEECH ENHANCEMENT



4.1 INTRODUCTION TO SPEECH DEGRADATION

The human auditory system is faced with the formidable challenge of segregating

signals of importance, such as speech, from interference caused by either internal or

external noise contributions. Internal noise contributions are inherent in, for example,

certain speech impediments affecting the production of speech sounds, whereas external

noise contributions are unrelated to the signal of interest (say, a speech message obscured

by noise in cocktail-party settings). Noise can also be either transient or continuous, that

is, it coincides with the signal of interest, or is continuous background noise contributed

by an unrelated source. One might expect that the way the auditory cognitive system

extracts the relevant features of speech from noise contribution should also be reflected in

brain dynamics, specifically in changes in the amplitude, latency and source location of

cortical activation. Previous research, having focused on how particular aspects of noise

are reflected in brain dynamics, has, unfortunately, left us with a fragmentary picture on

how the human brain segregates stimulus features of interest from noise. Given this

shortcoming, we aim here for a comprehensive look at how internal vs. external noise of

either transient or continuous nature is reflected in cortical activity in humans.

In natural auditory environments, the intelligibility of speech is often reduced by

distortions from external sound sources, which may include informational (e.g.,

concurrent speakers) masking in addition to the energetic (noise) distortion . External

distortions, such as stochastic noise, are independent from the speech sounds.

Consequently, the detrimental effects of external sources on speech perception can be

mitigated by using acoustic features specific to the sound sources, such as pitch, timbre,

intensity and spatial cues ,to segregate the speech from distortions. These temporally

synchronized, spectral features characteristic to speech sounds are integrated into a

coherent whole, thus leading to successful perception of speech. In contrast, external

distortions typically have a different spectral structure and are temporally unsynchronized

with speech, and this facilitates the segregation of speech from noise contributions.



4.2 IMPLEMENTATION

The file recorded with my faster speech was found from the ordered list of

speakers. Speech degradation was performed by adding Gaussian noise generated by the

MATLAB function randn() to this file. A comparison was then made between the clean

file and the signal with the addition of Gaussian noise.

Noise plays a vital role in speech degradation. Thus noise estimation is one of the

major part while performing the speech recognition task. Therefore, it is understood if the

estimated noise is low it will not affect the speech signal but if the noise is high then

speech will get distorted and loss intelligibility. Moreover, this process not only help us

in making comparison between the clean file and the signal with the added Gaussian

noise, it also can be further viewed as which filter in DSP (Digital signal processing )

such as Chebychev Filter, Butterworth Filter etc. can be worked better to remove this

Gaussian noise.

Fig 4.1 Block diagram for speech degradation

4.2.1 CONTINUOS DISTORTIONS IN SPEECH

The speech sounds were degraded by presenting either the quantization noise or

the stochastic noise continuously in the background while measuring auditory evoked

responses to the undistorted speech sound. For the quantization noise, this was


UnknownLinear

filteringClea

nspeech

Unknown

Additive NoiseDegrade

d Speech

Compensation

Compensated

Speech


accomplished by first extracting the time-domain distortion waveform created by the

reduction of the amplitude resolution from the degraded speech sound. Because the

undistorted speech sound was a periodic voiced utterance, the distortion waveform (i.e.,

the quantization error signal) was also periodic. Therefore, a continuous version of the

quantization noise could be easily generated by concatenating one cycle of the periodic

distortion waveform. In the case of the continuous stochastic noise, a pseudo-random

number generator was first used to generate two copies of a 10-second white noise

sequence. Finally, two continuous stochastic noise sequences with the duration of 10 min

were obtained by concatenating the corresponding energy-scaled noise sequences.

4.2.3 GAUSSIAN NOISE MODELS

Gaussian white noise models have become increasingly popular as a canonical

type of model in which to address certain statistical problems. We briefly review some

statistical problems formulated in terms of Gaussian "white noise", and pursue a

particular group of problems connected with the estimation of monotone functions. These

new results are related to the recent development of likelihood ratio tests for monotone

functions are studied . We conclude with some open problems connected with

multivariate interval censoring.

This is one of the non-parametric methods for speaker identification. When

feature vectors are displayed in d-dimensional feature space after clustering, they some-

how resemble Gaussian distribution. It means each corresponding cluster can be viewed

as a Gaussian probability distribution and features belonging to the clusters can be best

represented by their probability values. The only difficulty lies in efficient classification

of feature vectors.

Here, we briefly review a slice of the past and current research work on "white

noise models" and we present some results on estimation of a monotone function

observed "in white noise", and study a canonical version of the problem which arises

repeatedly in the asymptotic distribution theory for nonparametric estimators of

monotone functions. It carries through an analogous estimation problem in which some

additional knowledge of the monotone function is available, namely its value at one



point. This arises naturally when address in the problem of finding a likelihood ratio test

of the hypothesis H : f(to) = θo where f is monotone.

Fig 4.2 Speech file with and without Gaussian noise added to it.

The Gaussian Noise Generator block generates discrete-time white Gaussian

noise. You must specify the Initial seed vector in the simulation.

The Mean Value and the Variance can be either scalars or vectors. If either of

these is a scalar, then the block applies the same value to each element of a sample-based

output or each column of a frame-based output. Individual elements or columns,

respectively, are uncorrelated with each other.



When the Variance is a vector, its length must be the same as that of the Initial

seed vector. In this case, the covariance matrix is a diagonal matrix whose diagonal

elements come from the Variance vector. Since the off-diagonal elements are zero, the

output Gaussian random variables are uncorrelated.

When the Variance is a square matrix, it represents the covariance matrix. Its off-

diagonal elements are the correlations between pairs of output Gaussian random

variables. In this case, the Variance matrix must be positive definite, and it must be N-by-

N, where N is the length of the Initial seed.

The Initial seed parameter initializes the random number generator that the

Gaussian Noise Generator block uses to add noise to the input signal. For best results, the

Initial seed should be a prime number greater than 30. Also, if there are other blocks in a

model that have an Initial seed parameter, you should choose different initial seeds for all

such blocks. You can choose seeds for the Gaussian Noise Generator block using the

Communications Block set randseed function. This returns a random prime number

greater than 30. Entering randseed again produces a different prime number. If you

supply an integer argument, randseed always returns the same prime for that integer. For

example, randseed(5) always returns the same answer.

4.3 INTRODUCTION TO SPEECH ENHANCEMENT

Speech enhancement aims to improve speech quality by using various algorithms.

It may sound simple. It can be at least clarity and intelligibility, pleasantness, or

compatibility with some other method in speech processing. Intelligibility and

pleasantness are difficult to measure by any mathematical algorithm. Usually listening

tests are employed. However ,since arranging listening tests may be expensive, it has

been widely studied how to predict the results of listening tests. No single philosopher’s

stone or minimization criterion has been discovered so far and hardly ever will. The

central methods for enhancing speech are the removal of background noise, echo

suppression and the process of artificially bringing certain frequencies into the speech

signal.



We shall focus on the removal of background noise after briefly discussing what

the other methods are all about . First of all, every speech measurement performed in a

natural environment contains some amount of echo. Echoless speech, measured in a

special anechoic room, sounds dry and dull to human ear. Echo suppression is needed in

big halls to enhance the quality of the speech signal, especially if the distance between

the microphone and the speaker is large. 1In the current telephone networks speech is

band limited between 300–3400 Hz. Sooner or later the markets will be dominated by

third generation phones in which the frequency band of the speech is, for instance, 50-

7500 Hz. The delight of this wideband speech will be tamed unless the entire

conversation is travelling in a wideband network. Artificial bandwidth expansion can be

utilized to restore the frequencies that disappear on the route. These methods are also

useful in speech compression.

When the background noise is suppressed, it is crucial not to harm or garble the

speech signal. Or at least not very badly. Another thing to remember is that quiet natural

background noise sounds more comfortable than more quiet unnatural twisted noise. If

the speech signal is not intended to be listened by humans, but driven for instance to a

speech recognizer, then the comfortness is not the issue. It is crucial then to keep the

background noise low. Background noise suppression has many applications. Using

telephone in a noisy environment like in streets of in a car is an obvious application.

Traditionally, the background noise has been suppressed when sending speech from the

cockpit of an airplane to the ground or to the cabin. It is easy to come up with similar

examples.

It is also a good idea to enhance speech for coding and recognition purposes.

Speech codecs have been optimized for speech and they usually make the background

noise sound weird. Moreover, enhanced speech can be compressed in fewer bits than

non-enhanced. Speech recognition systems whose operation relies on the features

extracted from speech will be disturbed by extra noise sounds .Active noise suppression is

a method in which the idea is to produce anti-noise into the listener’s ear to cancel the

noise. The delay must be kept very small to avoid producing more noise instead of

cancelling the existing noise. For this reason, most of the methods for active noise



suppression are fully analog: A/D and D/A transforms inevitably produce some amount

of delay. The operation of all the speech enhancement methods in the following sections

is based on the spectra calculated from adjacent frames of speech. In practice, the frames

are a little bit overlapping and the frame size is a couple of dozens of milliseconds. The

windowed speech frame is padded with zeros to make its length equal to the nearest

power of two

4.3.1 BACKGROUND NOISE ESTIMATION

All the speech enhancement methods aimed at suppressing the background noise

are (naturally) based in one way or the other on the estimation of the background noise. If

the background noise is evolving more slowly than the speech, i.e., if the noise is more

stationary than the speech, it is easy to estimate the noise during the pauses in speech.

Finding the pauses in speech is based on checking how close the estimate of the

background noise is to the signal in the current window. Voiced sections can be located

by estimating the fundamental frequency. Both methods easily fail on unstressed

unvoiced or short phonemes, taking them as background noise. On the other hand, this is

not very dangerous because the effect of these faint phonemes on the background noise

estimate is not that critical.

A working VAD (voice activity detection) in hand, giving values of zero and one

as indicators of the voice activity in each frame, enables us to update the estimate of the

background noise spectrum during the frames that have zero VAD, using the formulae

where is the spectrum of the noisy speech, _is a decay rate coefficient

flattening the spectrum, and the index refers to the current frame.



4.4 IMPLEMENTATION

The speech enhancement technique enlightens upon the major use Speech

Degradation technique i.e .removal of Gaussian noise from the original speech wave. In

this technique firstly the degraded signal i.e original signal mixed with Gaussian noise is

first converted to the frequency domain with the help of FFT tool in MATLAB

Programming. Then higher frequency noise components are then removed with the help

of 3rdorder Butterworth low pass filter, according to the equation,

Where, D (u,v) is the rms value of u and v, Do determines the cut-off frequency, n is the

filter order. The reason to choose butter worth filter here because it has the capability to

filter the Gaussian noise more closely &approximates an ideal low pass filter as the order,

n, is increased. The resulting filtered signal was then scaled and plotted with the original

noisy signal to compare the filtering result and the general representation of speech

enhancement type wave form is shown in figure below.



Fig 4.3 Comparison of natural and LPF filtered signal

4.4.1 FAST FOURIER TRANSFORMS

The Fast Fourier Transform (FFT) resolves a time waveform into its sinusoidal

components. The FFT takes a block of time-domain data and returns the frequency

spectrum of the data. The FFT is a digital implementation of the Fourier transform. Thus,

the FFT does not yield a continuous spectrum. Instead, the FFT returns a discrete

spectrum, in which the frequency content of the waveform is resolved into a finite

number of frequency lines, or bins.

Number of Samples

The sampled time waveform input to an FFT determines the computed spectrum.

If an arbitrary signal is sampled at a rate equal to fs over an acquisition time T, N samples

are acquired. Compute T with the following equation:



where

T is the acquisition time

N is the number of samples acquired

fs is the sampling frequency

Compute N with the following equation:

N = T · fs

Frequency Resolution

For FFT, the spectrum computed from the sampled signal has a frequency

resolution df. Calculate the frequency resolution with the following equation:

Maximum Resolvable Frequency

The sampling rate of a time waveform determines the maximum resolvable

frequency. According to the Shannon Sampling Theorem, the maximum resolvable

frequency must be half the sampling frequency. To calculate the maximum resolvable

frequency, use the following equation:

where

fmax is the maximum resolvable frequency

fNyquist is the Nyquist frequency



fs is the sampling frequency

fftshift:

Y = fftshift(X) rearranges the outputs of fft, fft2, and fftn by moving the zero-

frequency component to the center of the array. It is useful for visualizing a Fourier

transform with the zero-frequency component in the middle of the spectrum. For vectors,

fftshift(X) swaps the left and right halves of X.

Relationship with FFT.

fft transforms input images into frequency domain; while fftshift reorganize the

outputs of fft by moving zeroth lag to the center of spectrum. Thus, fftshift is usually

used together with the Fourier transform

4.4.2 BUTTERWORTH FILTERS

Here we describe the commonly-used nth-order butterworth low pass filter. First we

show how to use known design specifications to determine filter order and 3dB cut-off

frequency. Then, we show how to determine filter poles and the filter transfer function.

Along the way ,we describe the use of common MATLAB signal processing tool box

functions that are useful in designing Butterworth low pass filters.



Fig 4.4 Frequency response of a Butterworth filter of order n.

A Butterworth filter has the maximally flat response in the pass-band. At the cut-

off frequency, ωc, the attenuation is -3dB. Above the -3dB point the attenuation is

relatively step with a roll off of -20dB/decade/pole. The figure below shows the

frequency response of such a filter.

The poles of a Butterworth filter are located on a circle with radius ωc and are

spaced apart by an angle 180o/n in which n is the order of the filter (number of poles).

The first pole is located 180o/2n from the jω axis, as shown in the figure below.



Fig 4.5 Poles of a Butterworth filter.

Syntax

[n,Wn] = buttord(Wp,Ws,Rp,Rs)

[n,Wn] = buttord(Wp,Ws,Rp,Rs,'s')

Description

buttord calculates the minimum order of a digital or analog Butterworth filter

required to meet a set of filter design specifications.

Digital Domain

[n,Wn] = buttord(Wp,Ws,Rp,Rs) returns the lowest order, n, of the digital

Butterworth filter that loses no more than Rp dB in the pass band and has at least Rs dB

of attenuation in the stopband. The scalar (or vector) of corresponding cutoff frequencies,

Wn, is also returned. Use the output arguments n and Wn in butter.



Table 4.1:Description of stop band and pass band filter parameters

wp Pass band corner frequency Wp, the cutoff frequency, is a scalar or a two-element vector with values between 0 and 1, with 1 corresponding to the normalized Nyquist frequency, π radians per sample.

Ws Stop band corner frequency Ws, is a scalar or a two-element vector with values between 0 and 1, with 1 corresponding to the normalized Nyquist frequency.

Rp Pass band ripple, in decibels. This value is the maximum permissible passband loss in decibels.

Rs Stop band attenuation, in decibels. This value is the number of decibels the stop band is down from the pass band.

Analog Domain

[n,Wn] = buttord(Wp,Ws,Rp,Rs,'s') finds the minimum order n and cutoff

frequencies Wn for an analog Butterworth filter. You specify the frequencies Wp and Ws

similar those described in the Description of Stop band and Pass band Filter Parameters

table above, only in this case you specify the frequency in radians per second, and the

pass band or the stop band can be infinite.


jar:file:///C:/Program%20Files%20(x86)/MATLAB/R2009a/help/toolbox/signal/help.jar!/buttord.html#bqivzby


CHAPTER 5

________________________________________________________________________________________________

PITCH ANALYSIS

&

FORMANT ANALYSIS



5.1 INTRODUCTION TO PITCH ANALYSIS

At a linguistic level, speech can be viewed as a sequence of basic sound units

called phonemes. A phoneme is a sound or group of different sounds perceived to have

the same function by the speakers of a language. An example of a phoneme is /k/ sound

in the words kit and skill. The same phoneme may give rise to many different sounds or

allophones at the acoustic level, depending on the phonemes which surround it. Different

speakers producing the same string of phonemes convey the same information yet sound

different as a result of differences in dialect and vocal tract length and shape.

Speech is a means of communication and exchange of thoughts between

individuals. The spoken word comprises of vowels and consonant which are the speech

sound units. The speaker characteristics are identified from speech data and are analyzed

using suitable analysis techniques. The analysis technique aims at selecting proper frame

size along with some overlap and extracting the relevant features from speech. Lots of

study has been carried out to investigate acoustic indicators to detect emotions in speech .

The characteristics that are mostly commonly considered include Fundamental frequency

F0, duration, intensity ,spectral variation and wavelet based features. In this paper linear

feature extraction techniques and their extraction algorithms are explained. These

features are then used to identify if the person is in neutral, happy or sad emotional state.

There is a substantial amount of work on the frequency of the voice fundamental

(F0) in the speech of speakers who differ in age and sex. The data reported nearly always

include an average measure of F0,usually expressed in Hz. Typical values obtained for

F0are 120 Hz for men and 200 Hz for women. The mean values change slightly with age.

Many methods and algorithms are in use for pitch detection divided into two main camps,

time domain analysis and frequency domain analysis.

Pitch in terms of speech analysis can be defined as a technique which allows the

ordering of sounds on a frequency-related scale. Pitch analysis helps us in identifying the

state of speech of a person.



The considered states are neutral, happy, sad. Therefore it is very important to

understand the concept of pitch analysis. Proposed system describes a technique that

involves the extraction of basic parameters of Pitch Analysis. Now, the calculation of the

average pitch of the entire .wav format speech file that are recorded in our data base of

different speakers was done and found to have a certain value which can be used in voice

recognition, where the differences in average pitch can be used to characterize a voice

file.

5.2 PITCH ANALYSIS

Pitch is defined as the fundamental frequency of the excitation source. Hence an

efficient pitch extractor and an accurate pitch estimate calculation can be used in an

algorithm for gender identification. A pitch detection algorithm (PDA) is an algorithm

designed to estimate the pitch or fundamental frequency of a digital recording of speech

or a musical note or tone. This can be done in the time domain or the frequency domain.

Time domain signals were converted to Frequency domain for better analysis and

noise removal. Noise were removed by application of a low-pass filter (here it was

Butterworth low-pass filter).

Fig 5.1 Time domain plots of female and male voice sample

Pitch analysis was conducted and relevant parameters were extracted and the

graph of pitch contour versus time frame was also plotted to see how the pitch varies over

the speech signal. The average pitch of the speech signal was calculated.



All the convolutions computed during this analysis were based on FFT/IFFT

algorithm implemented in MATLAB software. Appropriate rectangular windows were

designed and used for the analysis.


speakers. Pitch analysis was conducted and relevant parameters were extracted .The

average pitch of the entire wav file was computed and found to have a value of154.8595

Hz. The graph of pitch contour versus time frame was also created to see how the pitch

varies over the wav file. The results of pitch analysis can be used in speaker recognition,

where the differences in average pitch can be used to characterize a speech file.

There are a large set of methods that have been developed in the speech

processing area for the estimation of pitch. Among them the three mostly used methods

include, autocorrelation of speech, cepstrum pitch determination and single inverse

technique (SIFT) pitch estimation. Success of these methods is due to the involvement of

simple steps for the estimation of pitch. Even though autocorrelation method is of

theoretical interest, it produce a frame work for SIFT methods.

Fig 5.2 Pitch contour plot

The main limitation of pitch estimation by the auto correlation of speech is that

there may be peaks larger than the peak corresponding to the pitch period T0 due to the



of the vocal tract, As a result there may be picking of peaks and hence wrong estimation

of pitch. The approach to minimize such errors is to separate the vocal tract and

excitation source related information in the speech signal and there use the source

information for pitch estimation.

5.3 PITCH AUTOCORRELATION

The correlation between two waveforms is a measure of their similarity. The

waveforms are compared at different time intervals, and their “sameness” is calculated at

each interval. The result of a correlation is a measure of similarity as a function of time

lag between the beginnings of the two waveforms. The autocorrelation function is the

correlation of a waveform with itself. One would expect exact similarity at a time lag of

zero, with increasing dissimilarity as the time lag increases.

The pitch extraction process of a speech signal can be based on computing the

short-time autocorrelation function of the speech signal. Short-Time Auto-correlation for

a speech signal is given by

Where, Rn(k) = Short-Time Auto-correlation

x = Speech Signal

w = Window

k = Sample time at which auto-correlation was calculated

For voiced segments of speech, the Short-Time Auto-correlation function shows

periodicity of the speech. Rn(k) decreases with k as the computation process continues.



A commonly used method to estimate pitch (fundamental frequency) is based on

detecting the highest value of the Autocorrelation function (ACF) in the region of

interest. For a given discrete time signal x(n), defined for all n, the autocorrelation

function is generally defined as

If x(n) is assumed to be exactly periodic with period P, i.e., x(n)=x(n+P) for all n,

then it is easy to show that the autocorrelation Rx (_)= Rx(_+P) is also periodic with the

same period. Conversely, periodicity in the autocorrelation function indicates periodicity

in the signal. For non-stationary signals, such as speech, the concept of a long-time

autocorrelation measurement as given by above equation is not really suitable. In

practice, short speech segments, consisting of only N samples, are operated with .That is

why a short-time autocorrelation function, given by below equation , is used instead.

where N is the length of analyzed frame, T is the number of autocorrelation points to be

computed.

5.4 INTRODUCTION TO FORMANT ANALYSIS

Formant frequencies have rarely been used as acoustic features for speech

recognition, in spite of their phonetic significance. For some speech sounds one or more

of the formants may be so badly defined that it is not useful to attempt a frequency

measurement. Also, it is often difficult to decide which formant labels to attach to

particular spectral peaks. Proposed system describes a new method of formant analysis

which includes techniques to overcome both of the above difficulties. Using the same

data and HMM model structure, results are compared between a recognizer using



conventional cepstrum features and one using three formant frequencies, combined with

fewer cepstrum features to represent general spectral trends.

It has been known for many years that formant frequencies are important in

determining the phonetic content of speech sounds. Several authors have therefore

investigated formant frequencies as speech recognition features, using various methods

for basic analysis, such as linear prediction analysis by synthesis with Fourier spectra and

peak picking on cepstrally smoothed spectra. However, using formants for recognition

can sometimes cause problems, and they have not yet been widely adopted. It is obvious,

for example, that formant frequencies cannot discriminate between speech sounds for

which the main differences are unrelated to formants. Thus they are unable to distinguish

between speech and silence or between vowels and weak fricatives. Whenever any

formants are poorly defined in the signal (e.g. in fricatives), measurements will be

unreliable, and it is therefore essential that their estimated frequencies should be given

little weight in the recognition process.

It is impossible to determine from the spectrum of some speech sounds whether a

particular peak should be associated with one formant or with a pair, and sometimes a

formant may be so weak as a consequence of weak excitation that it causes no peak in the

spectrum. Either of these situations can cause all higher-frequency formants to be

wrongly labeled , with disastrous effects on the recognition .In such cases alternative

labellings must be produced, and any uncertainties that cannot be resolved in other ways

must be resolved within the recognition algorithm.

To be useful as features for automatic speech recognition, formant frequencies

must be supplemented by signal level and general spectral shape information, such as

provided by low-order cepstrum features, for example. However, whenever the speech

spectrum has a peaky structure, the phonetic detail is better described by formant

frequencies than by the more usual higher-order cepstrum features, which have no simple

relationship with formant frequencies.



5.5 IMPLEMENTATION

Formants in normal language can be defined as the spectral peaks of the sound

spectrum. With the help of above discussed Pitch and Formant Analysis, a waveform

comparison code was written with the help of MATLAB Programming. Thus, based on

this code we can easily characterized Speech waveform files. In this process a

reference .wav file was used which is then compared with the remaining .wav files.

Moreover, a sorting routine is performed in which sorting and comparison of the average

pitch of the reference file with all the other 5 .wav files. The technique further includes

the comparison of formant vector of the reference file to all .wav files, and thus sorting

for the top 3 average pitch correlations and then again sort these files by formant vectors

correlations and then sort these by average pitch. In this way, we can easily recognize the

speaker.

In Formant Analysis technique, we will going to perform on any of the .wav

format speech file taken from the set of recorded .wav speech signal. Further with the

help of MATLAB Programming we had prepared a code for Formant Analysis. With the

help of this code we can easily calculate the first five formants that are present in .wav

speech file, calculation of difference between the vector peak positions of these five

formants, vector position of the peaks in the power spectral density were easy calculated

and can be used to determine the speech file. The general waveform of formant analysis

can be shown in figure below.

Formant analysis was performed on my slow speech file . The first five peaks in

the power spectral density were returned and the first three can be seen . Also, the vector

position of the peaks in the power spectral density were calculated and can be used to

characterize a particular voice file. This technique is used in the waveform comparison

section.



Fig 5.3 Plot of the first few formants of a speech file

5.5.1 EXTRACTION OF FORMANT FREQUENCIES

Formants are defined as the spectral peaks of sound spectrum, of the voice, of a

person. In speech science and phonetics, formant frequencies refer to the acoustic

resonance of the human vocal tract . They are often measured as amplitude peaks in the

frequency spectrum of the sound wave. We have considered the first 3 formantsf1, f2, f3

for analysis of emotions. For different vowels, the range of f1 lies between 270 to 730 Hz

while the range off2 and f3 lie between 840 to 2290 and 1690 to 3010 Hz respectively.

Formant frequencies are very much important in the analysis of the emotional state of a

person.

The Linear predictive coding technique (LPC) has been used for estimation of the

formant frequencies . The analog signal is converted in .wav digital format. The signal is

transformed to frequency domain using FFT and the power spectrum is further

calculated. Then the signal is passed through a Linear Predictive Filter (LPC) with

11coefficients and the absolute values are considered. The roots of the polynomial are

obtained which contain both real and imaginary parts.



The phase spectrum is further displayed which clearly shows the formant

frequencies. The first five formant frequencies are displayed in the graph. Figure shows

the formant frequency plot along with the original speech signal. The five formant

frequencies obtained are 230 Hz, 800 Hz, 1684 Hz, 2552Hz, 3159 Hz respectively.

Fig 5.4 Speech signal and its formants.

5.5.2 YULE-WALK ERS POWER SPECTRAL DENSITY

The Yule-Walker Method estimates the power spectral density (PSD) of the input

using the Yule-Walker AR method. This method, also called the autocorrelation method,

fits an autoregressive (AR) model to the windowed input data. It does so by minimizing

the forward prediction error in the least squares sense. This formulation leads to the Yule-

Walker equations, which the Levinson-Durbin recursion solves. Block outputs are always

nonsingular.

The input is a sample-based vector (row, column, or 1-D) or frame-based vector

(column only). This input represents a frame of consecutive time samples from a single-

channel signal. The block outputs a column vector containing the estimate of the power



spectral density of the signal at Nfft equally spaced frequency points. The frequency

points are in the range [0,Fs), where Fs is the sampling frequency of the signal.

Pyulear estimates the power spectral density (PSD) of the signal vector x[n] using

the Yule-Walker AR method. This method, also called the autocorrelation method, fits an

autoregressive (AR) model to the signal by minimizing the forward prediction error in the

least-squares sense. This formulation leads to the Yule-Walker equations, which are

solved by the Levinson-Durbin recursion. The spectral estimate returned by pyulear is the

magnitude squared frequency response of this AR model. The correct choice of the model

order p is important.

Pxx = pyulear(x,p,nfft) returns Pxx, the power spectrum estimate. x is the input

signal, p is the model order for the all-pole filter, and nfft is the FFT length . Pxx has

length (nfft/2+1) for nfft even, (nfft+1)/2 for nfft odd, and nfft if x is complex.

[Pxx,freq] = pyulear(x,p,nfft) returns Pxx, the power spectrum estimate, and freq,

a vector of frequencies at which the PSD was estimated. If the input signal is real-valued,

the range of freq is [0, ]. If the input signal is complex, the range of freq is [0,2 ].

[Pxx,freq] = pyulear(x,p,nfft,Fs) uses the signal's sampling frequency, Fs, to scale

both the PSD vector (Pxx) and the frequency vector (freq). Pxx is scaled by 1/Fs. If the

input signal is real-valued, the range of freq is [0,Fs/2]. If the input signal is complex, the

range of freq is [0,Fs]. Fs defaults to 1 if left empty.

[Pxx,freq] = pyulear(x,p,nfft,Fs,'range') specifies the range of frequency values to

include in freq. range can be:

half, to compute the PSD over the range [0,Fs/2] for real x, and [0,Fs] for

complex x. If Fs is left blank, the range is [0,1/2] for real x, and [0,1]for

complex x. If Fs is omitted entirely, the range is [0,pi] for real x, and [0,2*pi] for

complex x. half is the default range.



whole, to compute the PSD over the range [0,Fs] for all x. If Fs is left blank, the

range is [0,1] for all x. If Fs is omitted entirely, the range is[0,2*pi] for all x.

pyulear(...) plots the power spectral density in the first available figure window. The

frequency range on the plot is the same as the range of output freq for a given set of

parameters.

pyulear(...,'squared') plots the PSD directly, rather than converting the values to dB.

The following table indicates the length of Pxx and the range of the corresponding

normalized frequencies for this syntax.

Table 5.1 PSD Vector Characteristics for an FFT Length of 256 (Default)

Real/Complex Input Data

Length of Pxx

Range of the Corresponding Normalized Frequencies

Real-valued 129 [0, ]

Complex-valued 256 [0, 2 )

[Pxx,w] = pyulear(x,p) also returns w, a vector of frequencies at which the PSD is

estimated. Pxx and w have the same length. The units for frequency are rad/sample.

[Pxx,w] = pyulear(x,p,nfft) uses the Yule-walker method to estimate the PSD

while specifying the length of the FFT with the integer nfft. If you specify nfft as the

empty vector [], it adopts the default value of 256.

The length of Pxx and the frequency range for w depend on nfft and the values of

the input x. The following table indicates the length of Pxx and the frequency range

for w for this syntax.



Table 5.2 PSD and Frequency Vector Characteristics

Real/Complex Input Data nfft Even/Odd Length of Pxx Range of w

Real-valued Even (nfft/2 + 1) [0, ]

Real-valued Odd (nfft + 1)/2 [0, )

Complex-valued Even or odd nfft [0, 2 )

Power Spectral Density estimation by determination of the parameters of an auto-

regressive model based on the Yule Walker Equation solved by the Levinson Durbin

Recursion.

The signal to be analyzed is assumed to be generated by a white noise stimulus

driving a linear process with parameters ak where

The Yule-Walker method should not be used as a means of autoregressive

parameter estimation if the auto covariance matrix is poorly conditioned. In that case the

relatively small covariance estimate bias can lead to a large deviation in the estimated

parameters, resulting in an invalid model .A poor auto covariance matrix condition also

involves pole locations near the unit circle, as a result of which the autoregressive process

exhibits a kind of almost non stationary, pseudo-periodic behavior . The variance of the

stochastic process will be large due to the innovation process not being identically zero,

which is the case for a harmonic process.



CHAPTER 6

------------------------------------------------------------------------------------------------------------------------------------------------

WAVEFORM COMPARSION



6.1 WAVEFORM COMPARISION

Using the results and information learned from pitch and formant analysis. Speech

waveform files can be characterized based on various criteria. Average pitch and formant

peak position vectors are two such criteria that can be used to characterize a speech file.

The slow speech file was used as a reference file. Four sorting routines were then written

to compare the files. The sorting routines performed the following functions: sort and

compare the average pitch of the reference file with all 83 wav files, compare the formant

vector of the reference file to all wav files, sort for the top 20 average pitch correlations

and then sort these files by formant vectors, and finally to sort for the top 20 formant

vector correlations and then sort these by average pitch.

Fig 6.1 Comparison of PSD wave files



In order to create a speech recognition algorithm, criteria to compare speech files

must be established. This section of the project compares four different methods of

comparing the data. First, the wav files are compared to a reference file and sorted based

on the average pitch of the file only . The files were then compared and sorted based

entirely on the location of the formants present in the PSD of the signal. A third method

compared the average pitch present and ranked the matches in ascending order and then

compared the top 12 most likely matches by formant location in the PSD . Finally, the

inverse routine was performed where the files were compared and sorted by the location

of the formants present and then the top 12 most likely matches based on this data were

compared and sorted by pitch.



APPLICATIONS

The applications of speaker recognition technolgy are quite varied and continually

growing.Below is an out line of some broad areas where speaker recognition technology

has been or is currently used.

Access Control: Originally for physical facilities more recent applications are for

controlling access to computer networks and also used for automated password reset

services.

Transaction Authentication:For telephone banking ,in addition to account access

control higher levels of verification can be used for more sensitive transactions.More

recent applications are in user verification for remote electronic and mobile purchases.

Law Enforcement:Some applications are home-parole monitoring and prison call

monitoring.There has also been discussion of using automatic systems to corroborate

spectral inspections pf voice samples for forensic analysis.

Speech Data Management: In voice mail browsing or intelligent answering machines

use speaker recognition to label incoming voice mail with speaker name for browsing and

action.For speech skimming or audio mining applications,annotate recorded meetings or

video with speaker labels for quick indexing and filling.



FUTURE SCOPE

1. Controlling Of Device Through Voice Recognition Using MATLAB.

2. Speech Recognition using Digital Signal Processing.

3. It can be used for automatic recognition of speaker.

4. Gender recognition based on pitch using MATLAB.



APPENDIX



MATLAB FUNCTIONS

INTRODUCTION

The name MATLAB stands for Matrix Laboratory. MATLAB was written

originally to provide easy access to matrix software developed by the LINPACK (linear

system package)and EISPACK (Eigen system package) projects. MATLAB is a high-

performance language for technical computing. It integrates computation, visualization,

and programming environment. Furthermore, MATLAB is a modern programming

language environment: it has sophisticated data structures, contains built-in editing and

debugging tools, and supports object-oriented programming. These factors make

compared to conventional computer languages (e.g. ,C, FORTRAN) for solving technical

problems. MATLAB is an interactive system whose basic data element is an array that

does not require dimensioning. The software package has been commercially available

since 1984 and is now considered as a standard tool at most universities and industries

worldwide.

It has powerful built-in routines that enable a very wide variety of computations.

It also has easy to use graphics commands that make the visualization of results

immediately available. Specific applications are collected in packages referred to as

toolbox. There are toolboxes for signal processing, symbolic computation, control

theory ,simulation, optimization, and several other ¯fields of applied science and

engineering.

6.2.2 FUNCTIONS USED IN MATLAB

1. Reading a sound file

To compress a sound file, we first need to take its samples into a vector. Let ‘y’ be the vector. The command is



[y, fs ,bps] = wavread(‘path of the file’) ;

Description

This command stores the samples of the sound file in the vector y. The term ‘fs’ stores the sampling frequency of the file and ‘bps’ is the bits per sample. These 2 values are needed to reconstruct the .wav file using ‘wavwrite’ function.

2. Playing a sound file

Syntax

sound(y, Fs)

sound(y, Fs, bits)

Description

sound(y, Fs) sends audio signal y to the speaker at sample rate Fs. If you do not

specify a sample rate, sound plays at 8192 Hz. For single-channel (mono) audio, y is an

m-by-1 column vector, where m is the number of audio samples. If your system supports

stereo playback, y can be an m-by-2 matrix, where the first column corresponds to the

left channel, and the second column corresponds to the right channel. The sound function

assumes that y contains floating-point numbers between -1 and 1, and clips values outside

that range.

3. Generation of Gaussian Noise

Randn

Normally distributed pseudorandom numbers

Syntax

r = randn(n)



randn(size(A))

Description

r = randn(n) returns an n-by-n matrix containing pseudorandom values drawn

from the standard normal distribution. randn returns a scalar. randn(size(A)) returns an

array the same size as A.

The sequence of numbers produced by randn is determined by the internal state of

the uniform pseudorandom number generator that underlies rand, randi, and randn. randn

uses one or more uniform values from that default stream to generate each normal value.

Control the default stream using its properties and methods. See RandStream for details

about the default stream.

Resetting the default stream to the same fixed state allows computations to be

repeated. Setting the stream to different states leads to unique computations, however, it

does not improve any statistical properties. Since the random number generator is

initialized to the same state every time MATLAB software starts up, rand, randn, and

randi will generate the same sequence of numbers in each session until the state is

changed.

4. Applying Fast Fourier Transform

Syntax

Y = fft(X)

Y = fft(X,n)

Description

Y = fft(X) returns the discrete Fourier transform (DFT) of vector X, computed

with a fast Fourier transform (FFT) algorithm. If X is a matrix, fft returns the Fourier



transform of each column of the matrix. If X is a multidimensional array, fft operates on

the first non singleton dimension. Y = fft(X, n) returns the n-point DFT. If the length of

X is less than n, X is padded with trailing zeros to length n. If the length of X is greater

than n, the sequence X is truncated. When X is a matrix, the length of the columns are

adjusted in the same manner. Y = fft(X, dim) and Y = fft(X, n, dim) applies the FFT

operation across the dimension dim.

5.Shifting of Fast Fourier Transform

Shift zero-frequency component to center of spectrum

Syntax

Y = fftshift(X)

Y = fftshift(X,dim)

Description

Y = fftshift(X) rearranges the outputs of fft, fft2, and fftn by moving the zero-

frequency component to the center of the array. It is useful for visualizing a Fourier

transform with the zero-frequency component in the middle of the spectrum.

6. Absolute value and complex magnitude

Syntax

abs(X)

Description

abs(X) returns an array Y such that each element of Y is the absolute value of the

corresponding element of X.

If X is complex, abs(X) returns the complex modulus (magnitude), which is the

same as



sqrt(real(X).^2 + imag(X).^2)

7. Cross-correlation

Syntax

c = xcorr(x,y)

c = xcorr(x)

Description

xcorr estimates the cross-correlation sequence of a random process.

Autocorrelation is handled as a special case. xcorr must estimate the sequence because, in

practice, only a finite segment of one realization of the infinite-length random process is

available.

c = xcorr(x,y) returns the cross-correlation sequence in a length 2*N-1 vector,

where x and y are length N vectors (N>1). If x and y are not the same length, the shorter

vector is zero-padded to the l

8. Floor

Round toward negative infinity

Syntax

B = floor(A)

Description

B = floor(A) rounds the elements of A to the nearest integers less than or equal to

A. For complex A, the imaginary and real parts are rounded independently length of the

longer vector.

9. PSD using Yule-Walker AR method



Syntax

Pxx = pyulear(x,p)

Pxx = pyulear(x,p,nfft)

Description

Pxx = pyulear(x,p) implements the Yule-Walker algorithm, a parametric spectral

estimation method, and returns Pxx, an estimate of the power spectral density (PSD) of

the vector x. The entries of x represent samples of a discrete-time signal. p is the integer

specifying the order of an autoregressive (AR) prediction model for the signal, used in

estimating the PSD. This estimate is also an estimate of the maximum entropy. The

power spectral density is calculated in units of power per radians per sample. Real-valued

inputs produce full power one-sided (in frequency) PSDs (by default), while complex-

valued inputs produce two-sided PSDs.

APPENDIX A:SPEECH EDITING

clc;

clear all;

close all;

[y, fs, nbits]= wavread('bhanu');

sound(y,fs);

t = 0:1/fs:length(y)/fs-1/fs;

subplot(211)

plot(t,y)



xlabel('time in seconds');

ylabel('amplitude');

yfirst=y(1:77125);

ysecond=y(77126:154350);

save darren ysecond yfirst -ascii

load darren -ascii

subplot(212)

plot(t,darren,'r')



pause(2)

sound(darren,fs);

APPENDIX B: SPEECH DEGRADATION

clc;

clear all;

close all;

[y, fs, nbits] = wavread('prudhvi');


subplot(311)



plot(t,y)

Xlabel('time in seconds');


sigma = 0.02;

mu = 0;

n = randn(size(y))*sigma + mu*ones(size(y));

signal=n+y;

yfft=fft(y);

xfft=fft(signal);

f = -length(y)/2:length(y)/2-1;

ysfft=fftshift(yfft);

xsfft=fftshift(xfft);

subplot(312)

plot(f,abs(ysfft),'r');

xlabel('frequency in HZ’);


subplot(313)

plot(f,abs(xsfft),'g');

xlabel('frequency in HZ');




APPENDIX C: SPEECH ENHANCEMENT

clc;

clear all;

close all;

[y, fs, nbits] = wavread('surya');


subplot(311)

plot(t,y)



sound(y,fs)

yfft=fft(y);

f = -length(y)/2:length(y)/2-1;

ysfft=fftshift(yfft);

subplot(312)

plot(f,abs(ysfft),'r');





order = 3;

cut = 0.05;

[B, A] = butter(order, cut);

filtersignal = filter(B, A, ysfft);

subplot(313)

plot(f,21*abs(filtersignal));



APEENDIX D: PITCH ANALYSIS

clc;

clear all;

close all;

[y, fs, nbits] = wavread('bhanu');

[t, f0, avgF0] = pitch(y,fs);

plot(t,f0)

xlabel('time frame');

ylabel('pitch(HZ)');

avgF0

sound(y) ;



PITCH AUTOCORRELATION

function [f0] = pitchacorr(len, fs, xseg)

[bf0, af0] = butter(4, 900/(fs/2));

xseg = filter(bf0, af0, xseg);

i13 = len/3;

maxi1 = max(abs(xseg(1:i13)));

i23 = 2 * len/3;

maxi2 = max(abs(xseg(i23:len)));

if maxi1>maxi2

CL=0.68*maxi2;

else

CL= 0.68*maxi1;

end

clip = zeros(len,1);

ind1 = find(xseg>=CL);

clip(ind1) = xseg(ind1) - CL;

ind2 = find(xseg <= -CL);

clip(ind2) = xseg(ind2)+CL;

engy = norm(clip,2)^2;



RR = xcorr(clip);

m = len;

LF = floor(fs/320);

HF = floor(fs/60);

Rxx = abs(RR(m+LF:m+HF));

[rmax, imax] = max(Rxx);

imax = imax + LF;

f0 = fs/imax;

silence = 0.4*engy;

if (rmax > silence) & (f0 > 60) & (f0 <= 320)

f0 = fs/imax;

else f0 = 0;

end

APPENDIX E: FORMANT ANALYSIS

clc;

clear all;

close all;

[y, fs, nbits] = wavread('surya');

[P,F,I] = formant(y);

sound(y)



plot(F,P,'r')

ylabel('Amplitude(dB)');

xlabel('arbitary frequency scale');

PICKMAX

function [Y, I] = pickmax(y)

Y = zeros(5,1);

I = zeros(5,1);

xd = diff(y);

index = 1;

pos = 0;

for i=1:length(xd)

if xd(i)>0

pos = 1;

else

if pos==1

pos = 0;

Y(index) = xd(i);

I(index) = i-1;

index = index + 1;

if index>5



return

end

end

end

end

PITCH

function [t, f0, avgF0] = pitch(y, fs)

ns = length(y);

mu = mean(y);

y = y - mu;

fRate = floor(120*fs/1000);

updRate = floor(110*fs/1000);

nFrames = floor(ns/updRate)-1;

f0 = zeros(1, nFrames);

f01 = zeros(1, nFrames);

k = 1;

avgF0 = 0;

m = 1;

for i=1:nFrames

xseg = y(k:k+fRate-1);



f01(i) = pitchacorr(fRate, fs, xseg);

if i>2 & nFrames>3

z = f01(i-2:i);

md = median(z);

f0(i-2) = md;

if md > 0

avgF0 = avgF0 + md;

m = m + 1;

end

elseif nFrames<=3

f0(i) = a;

avgF0 = avgF0 + a;

m = m + 1;

end

k = k + updRate;

end

t = 1:nFrames;

t = 20 * t;

if m==1

avgF0 = 0;



else

avgF0 = avgF0/(m-1);

end

APPENDIX F: WAVEFORM COMPARISON

results=zeros(12,1);

diff=zeros(82,1);

formantdiff=zeros(12,1);

[y17, fs17, nbits17] = wavread('bhanu');

[t17, f017, avgF017] = pitch(y17,fs17);

[P17,F17,I17] = formant(y17);

plot(t17,f017)

avgF17 = avgF017

sound(y17)

pause(3)

for i=1:83

if i<10

filename = sprintf('prudhvi', i);

else

filename = sprintf('surya', i);

end



[y, fs, nbits] = wavread(filename);

[t, f0, avgF0] = pitch(y,fs);

plot(t,f0)

avgF0(i) = avgF0;

diff(i,1)=norm(avgF0(i)-avgF17);

i

[Y,H]=sort(diff)

for j=1:12

p=H(j) if p<10

filename = sprintf('prudhvi', p);

else

filename = sprintf('surya', p);

end

filename

[y, fs, nbits] = wavread(filename);

[P,F,I] = formant(y);

sound(y)

plot(F,P)

pause(3).

formantdiff(j,1)=norm(I17-I);



end

[Y1,H1]=sort(formantdiff)

for k=1:12

results(k,1)=H(H1(k));

end

H

H1

results



RESULTS

&

CONCLUSION



RESULTS

A. SPEECH EDITING

0 0.5 1 1.5 2 2.5 3 3.5-1

-0.5

0

0.5

time in seconds

am

plitu

de

original speech signal

0 0.5 1 1.5 2 2.5 3 3.5-1

-0.5

0

0.5

time in seconds

am

plitu

de

edited speech signal



B.SPEECH DEGRADATION

0 0.5 1 1.5 2 2.5 3-1

0

1

time in seconds

am

plit

ude

time domain plot

-5 -4 -3 -2 -1 0 1 2 3 4 5

x 104

0

500

1000

frequency in Hz

am

plit

ude

frequency domain plot

-5 -4 -3 -2 -1 0 1 2 3 4 5

x 104

0

500

1000

frequency in Hz

am

plit

ude

frequency domain plot with noise added



C.SPEECH ENHANCEMENT

0 2 4 6 8 10 12-1

0

1

time in seconds

am

plitu

de time domain plot

-6 -4 -2 0 2 4 6

x 104

0

200

400

frequency in Hz

am

plitu

de frequency domain plot

-6 -4 -2 0 2 4 6

x 104

0

200

400

frequency in Hz

am

plitu

de filtered frequency domain plot



D.PITCH ANALYSIS

0 100 200 300 400 500 6000

50

100

150

200

250

300

350

time frame

pitch(H

Z)

Pitch contour plot



E. FORMANT ANALYSIS

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-90

-80

-70

-60

-50

-40

-30

-20

-10

Am

plit

ude(d

B)

arbitary frequency scale

formant plot



F. WAVEFORM COMPARISION

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-110

-100

-90

-80

-70

-60

-50

-40

-30

-20

-10

Am

plitu

de(d

B)




0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-90

-80

-70

-60

-50

-40

-30

-20

-10

Am

plit

ude(d

B)




CONCLUSION

A crude speaker recognition code has been written using the MATLAB

programming language. This code uses comparisons between the average pitch of are

recorded wav file as well as the vector differences between formant peaks in the PSD of

each file. It was found that comparison based on pitch produced the most accuracy ,while

comparison based on formant peak location did produce results, but could likely be

improved. Experience was also gained in speech editing as well as basic filtering

techniques. While the methods utilized in the design of the code for this project are a

good foundation for a speaker recognition system, more advanced techniques would have

to be used to produce a successful speaker recognition system.



Proposed system successfully defines about various characteristics and behaviour

of speech signals and also entails upon the setting up of communication between human

speech signals with the machines. In proposed system we have generated codes with the

help of MATLAB Programming which requires .wav format speech signals. Thus, in

order to remove this limitation there is a requirement for the study of various formats of

speech signals which can be used for the communication with the machines.



BIBILOGRAPHY

REFERENCES

[1] Speaker Recognition Using MFCC and Vector Quantization Model.

[2] E. Darren Ellis Department of Computer and Electrical Engineering – University of

Tennessee, Knoxville Tennessee 37996 topic on “Design of a Speaker Recognition Code

using MATLAB”

[3] Topic on “Extraction of Pitch and Formants and its Analysis to identify 3 different

emotional states of a person”ijcsi.org/papers/IJCSI-9-4-1-296-299.pdf

[4] Topic on “Speech Recognition using Digital Signal Processing” ISSN:2277-9477,

Volume2,Issue 6.



[5] D.A. Reynolds, L.P. Heck, “Automatic Speaker Recognition”, AAAS 2000 Meeting,

Humans, Computers and Speech Symposium, 19 Feb 2000.

[6] J. Rosca, A. Kofmehl, “Cepstrum-like ICA Representations for Text Independent

Speaker Recognition”, ICA2003, pp. 999-1004, 2003.


main document

Documents

speech signal

speech waves

seconds of speech

speaker chapter

speaker related differences

speakers voice

speaker identification

information services