main document
Post on 12-Dec-2015
217 Views
Preview:
DESCRIPTION
TRANSCRIPT
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
CHAPTER-1________________________________________________________________________________________________
INTRODUCTION
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 1
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
1. 1 INTRODUCTION:
Speaker recognition is the process of recognizing automatically who is speaking
on the basis of individual information included in speech waves. This technique uses the
speaker's voice to verify their identity and provides control access to services such as
voice dialing, database access services, information services, voice mail, security control
for confidential information areas, remote access to computers and several other fields
where security is the main area of concern.
The speech signal contains many levels of information. Primarily a message is
conveyed via the spoken words. At other levels, speech conveys the information about
the language being spoken, the emotion, gender, and the identity of the speaker. The
automatic recognition of speaker and speech recognition are very closely related. While
speech recognition sets its goals at recognizing the spoken words in speech, the aim of
automatic speaker recognition is to identity the speaker by extraction, characterization
and recognition of the information contained in the speech signal.
Speech is a complicated signal produced as a result of several transformations
occurring at several different levels: semantic, linguistic, articulatory, and acoustic.
Differences in these transformations are reflected in the differences in the acoustic
properties of the speech signal. Besides there are speaker related differences which are a
result of a combination of anatomical differences inherent in the vocal tract and the
learned speaking habits of different individuals. In speaker recognition, all these
differences are taken into account and used to discriminate between speakers.
1.2 LITERATURE SURVEY:
Speech is a natural means of communication for humans. It is not surprising that
humans can recognize the identity of a person by hearing his voice. About 2-3 seconds of
speech is sufficient for a human to identify a voice. One review on human speech
recognition states that many studies of 8-10 speakers yield accuracy of more than 97% if
a sentence or more of the speech is heard. Performance falls if the length of the speech is
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 2
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
short and if the number of speakers is more. Speaker Recognition is one area of artificial
intelligence where machine performance can exceed human performance using short test
utterances and a large number of speakers in which case machine accuracy often exceed
that of humans. Research on Speaker Identification systems can be dated back to more
than fifty years .
1.2.1 EARLY SYSTEMS (1960-1980):
The first reported work on Speaker Recognition can be dedicated to Pruzansky at
Bell Labs , as early as 1963, who initiated research by using filter banks and correlating
two digital spectrograms for a similarity measure.
The system used several utterances of commonly spoken words by ten talkers and
converted it to time-frequency-energy patterns. Some of each talker's utterances were
used to form reference patterns and the remaining utterances served as test patterns. The
recognition procedure consisted of cross-correlating the test patterns with the reference
patterns and selecting the talker corresponding to the reference pattern with the highest
correlation as the talker of the test utterance.
1.2.2 MEDIEVAL SYSTEMS (1980-2000):
In this period there was lot of development in Speaker Identification technology.
These advances were both in the field of feature extraction and feature matching.
Voice pitch (F0) and formant frequencies (F1, F2, F3) extracted from time
aligned, un-coded and coded speech samples were compared to establish the statistical
distribution of error attributed to the coding system . The mel-warped cepstrum is a very
popular feature domain. The mel warping transforms the frequency scale to place less
emphasis on high frequencies. It is based on the nonlinear human perception of the
frequency of sounds . The cepstrum can be considered as the spectrum of the log
spectrum. Removing its mean reduces the effects of linear time-invariant filtering (e.g.,
channel distortion). Often, the time derivatives of the mel cepstra (also known as delta
cepstra) are used as additional features to model trajectory information.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 3
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Studies on automatically extracting the speech periods of each person separately
from a dialogue/conversation/meeting involving more than two people have appeared as
an extension of speaker recognition technology. Increasingly, speaker segmentation and
clustering techniques have been used to aid in the adaptation of speech recognizers and
for supplying metadata for audio indexing and searching.
As an alternative to the template-matching approach for text dependent speaker
recognition, the Hidden Markov Model (HMM)technique was introduced. HMMs have
the same advantages for speaker recognition as they do for speech recognition.
Remarkably robust models of speech events can be obtained with only small amounts of
specification or information accompanying training utterances. Speaker recognition
systems based on an HMM architecture used speaker models derived from a multi-word
sentence, a single word, or a phonemes.
1.2.3 RECENT TRENDS IN SPEAKER IDENTIFICATION (2000 ONWARDS):
We can divide the recent advances in Speaker Identification in two categories: Feature Extraction and Feature Matching.
1.2.3.(a) Feature Extraction:
Recently feature extraction techniques like MFCC, wavelet decomposition and
Transform domain techniques are being explored.
Mel-Frequency Cepstral Coefficients (MFCC):
There has been a shift from LPC parameters to Mel-Frequency Cepstral
Coefficients (MFCC) for feature extraction. MFCC’s are based on the known variation of
the human ear’s critical bandwidths with frequency. The MFCC technique makes use of
two types of filter, namely, linearly spaced filters and logarithmically spaced filters. To
capture the phonetically important characteristics of speech, signal is expressed in the
Mel frequency scale. This scale has linear frequency spacing below 1000 Hz and a
logarithmic spacing above 1000 Hz. As a reference point, the pitch of a 1 KHz tone, 40
dB above the perceptual hearing threshold is defined as1000 Mels.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 4
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Fig.1.1: Block Diagram Of MFCC Processor
Fig. 1.1 shows the block diagram representation of the process to convert the
speech signal into MFCC. Here the speech signal is first converted into frames and then
windowed (e.g. hamming window), to minimize the signal discontinuities at the
beginning and end of each frame. The next step is to convert the signal into frequency
domain by applying DFT on the windowed frames. Next step is the Mel-frequency
wrapping, where the Mel scale is used. Eq.2.1 shows the conversion of frequency (f) to
Mel Frequency. To implement this, filter bank approach is used. In the final step, the log
Mel spectrum is converted back to time, which is called the MFCC. This is done by using
DCT.
Mel( f ) = 2595×log10(1+ f / 700)
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 5
Frameblocking
Windowing
CEPSTRUM
MEL-frequencywrapping
FFTSpeech as Input
MEL- CEPSTRUM
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Wavelets:
Also another technique for feature extraction which is being explored is using
wavelet decomposition. Speech signals have a wide variety of characteristics, in both
time and frequency domains. To analyze the non-stationary signals like speech, both time
and frequency resolutions are important.
Therefore while extracting features; it would be useful to analyze the signal from
multi-resolution perspective. Wavelets provide both time as well as frequency resolution.
The wavelet analysis procedure is to adopt a wavelet prototype function, called an
analyzing wavelet or mother wavelet. Temporal analysis is performed with a contracted,
high frequency version of the prototype wavelet, while frequency analysis is performed
with a dilated, low frequency version of the same wavelet.
In, Speaker Identification using different levels of decomposition of the speech
signal using discrete wavelet transform (DWT) and Daubechies wavelets (mother
wavelet) has been shown. Fig. 2.2shows the how the speech signal is decomposed into
approximate(a1,…, a7) and detail coefficients (d1,…, d7) by using low pass and high
pass filters at each stage. The speech signal has been decomposed up to seven levels
using Discrete Wavelet Transform(DWT) by using different Daubechies mother
wavelets. The mean of the approximate and exact coefficients of every level have been
taken as the feature vector.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 6
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Fig.1.2: 7th Level Wavelet Decomposition of the speech signal intoapproximate and detail coefficients
1.2.3(b) Feature matching
Artificial Neural network:
As the techniques for feature matching have shifted from template matching to
statistical modeling (e.g. HMM), distance based to likelihood based method. The non-
parametric approach of VQ is still being used. The recent trend is the use of Artificial
Neural Network (ANN). Being widely used in pattern recognition tasks, neural networks
have also been applied in speaker recognition.
Dynamic Time Warping (DTW):
The most popular method to compensate for speaking-rate variability in template-
based systems is known as DTW . This method accounts for the variation over time
(trajectories) of parameters corresponding to the dynamic configuration of the articulators
a dynamic time warping (DTW) based text-dependent speaker verification system. The
investigation on Gaussian mixture model (GMM) by comparing it with some preliminary
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 7
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
experiments on multilayered perceptron network (MLP) with back propagation learning
algorithm (BKP) and dynamic time warping (DTW) techniques.
Although these advances have taken place, there are still many practical
limitations which hinder the widespread commercial deployment of applications and
services. A more sound understanding of the complex speech signal and its parameters is
through which this can be achieved.
1.3 PROPOSED SYSTEM:
This system entails the design of a speaker recognition code using MATLAB.
Signal processing in the time and frequency domain yields a powerful method for
analysis. MATLAB’s built in functions for frequency domain analysis as well as its
straight forward programming interface makes it an ideal tool for speech analysis
projects. Speech editing was performed as well as degradation of signals by the
application of Gaussian noise. Background noise was successfully removed from a signal
by the application of a 3rdorder Butterworth filter. A code was then constructed to
compare the pitch and formant of a known speech file to a bunch of unknown speech files
and choose the top twelve matches.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 8
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
CHAPTER 2
______________________________________________________________________________________________
_APPROACH
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 9
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
2.1 DESCRIPTION:
The physiological component of voice recognition is related to the physical shape
of an individual’s vocal tract, which consists of an air way and the soft tissue cavities
from which vocal sounds originate. To produce the speech the components work in
combination with the physical movement of jaw, tongue and the larynx and resonances in
the nasal passages.
Fig .2.1 Human vocal system
There are two forms of speaker recognition” text dependent” and” text
independent”. In a system using “text dependent” speech the individual presents either a
fixed or prompted phrase that is programmed in to the system and can improve
performance especially with cooperative users. A “text independent” system has no
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 10
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
advance knowledge of the presenters phrasing and is much more flexible in situations
where the individual submitting the sample may be unaware of the collection or
unwilling to cooperate which presents more difficult challenge.
The speaker recognition system analyzes the frequency content of the speech and
compares characteristics such as quality, duration intensity dynamics and pitch of the
signal.
2.2 CLASSIFICATION:
Speaker recognition can be classified into a number of categories. Figure below
provides the various classifications of speaker recognition.
Fig. 2.2: Classification of Speaker Recognition
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 11
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
2.2.1 OPEN SET (vs) CLOSED SET:
Speaker recognition can be classified into open set and closed set speaker
recognition. This category of classification is based on the set of trained speakers
available in a system. Let us discuss them in details.
1. Open Set: An open set system can have any number of trained speakers. We have an
open set of speakers and the number of speakers can be anything greater than one.
2. Closed Set: A closed set system has only a specified (fixed) number of users registered
to the system.
2.2.2 IDENTIFICATION (vs )VERIFICATION :
This category of classification is the most important among the lot. Automatic
speaker identification and verification are often considered to be the most natural and
economical methods for avoiding unauthorized access to physical locations or computer
systems. Let us discuss them in detail:-
1. Speaker identification: It is the process of determining which registered speaker
provides a given utterance.
2. Speaker verification: It is the process of accepting or rejecting the identity claim of a
speaker. Figure 2.3.1 below and figure 2.3.2in the next page illustrate the basic
differences between speaker identification and verification systems.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 12
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Fig 2.3: Practical examples of Identification and Verification Systems
2.2.3 TEXT-DEPENDENT vs TEXT-INDEPENDENT
This is another category of classification of speaker recognition systems. This
category is based upon the text uttered by the speaker during the identification process.
Let us discuss each in details:-
1. Text-Dependent: In this case, the test utterance is identical to the text used in the
training phase. The test speaker has prior knowledge of the system.
2. Text-Independent: In this case, the test speaker doesn’t have any prior knowledge
about the contents of the training phase and can speak anything.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 13
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
2.3 DESIGN APPROACH:
This multi faceted design project can be categorized into different sections:
speech editing, speech degradation, speech enhancement, pitch analysis, formant analysis
and waveform comparison. The resulting discussion will be segmented based on these
delineations.
2.3.1 SPEECH EDITING
The file recorded with my slower speech was found from the ordered list of
speakers. It was determined that the length of the vector representing this speech file had
a magnitude of 30,000. Thus the vector was partitioned into two separate vectors of equal
length and the vectors were written to a file in opposite order. The file was then read and
played back.
2.3.2 SPEECH DEGRADATION
The file recorded with my faster speech was found from the ordered list of
speakers. Speech degradation was performed by adding Gaussian noise generated by the
MATLAB function randn() to this file. A comparison was then made between the clean
file and the signal with the addition of Gaussian noise.
2.3.3 SPEECH ENHANCEMENT
The file recorded with my slower speech and noise in the background was found
from the ordered list of speakers. This signal was then converted to the frequency domain
through the use of a shifted FFT and correctly scaled frequency vector. The higher
frequency noise components were then removed by application of a 3rd order
Butterworth low pass filter with the cutoff chosen to remove as much of the noise signal
as possible while still preserving the original signal.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 14
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
2.3.4 PITCH ANALYSIS
The file recorded with my slower speech was found from the ordered list of
speakers. Pitch analysis was conducted and relevant parameters were extracted. The
average pitch of the entire wav file was computed and found to have a value of154.8595
Hz. The graph of pitch contour versus time frame was also created to see how the pitch
varies over the wav file. The results of pitch analysis can be used in speaker recognition,
where the differences in average pitch can be used to characterize a speech file.
2.3.5 FORMANT ANALYSIS
Formant analysis was performed on my slow speech file. The vector position of
the peaks in the power spectral density were calculated and can be used to characterize a
particular voice file. This technique is used in the waveform comparison.
Formant frequencies have rarely been used as acoustic features for speech
recognition, in spite of their phonetic significance. For some speech sounds one or more
of the formants may be so badly defined that it is not useful to attempt a frequency
measurement. Also, it is often difficult to decide which formant labels to attach to
particular spectral peaks.
2.3.6 WAVEFORM COMPARISON
Using the results and information learned from pitch and formant analysis, a
waveform comparison code was written. Speech waveform files can be characterized
based on various criteria. Average pitch and formant peak position vectors are two such
criteria that can be used to characterize a speech file. The slow speech file was used as a
reference file. Four sorting routines were then written to compare the files. The sorting
routines performed the following functions: sort and compare the average pitch of the
reference file with all 83 wav files, compare the formant vector of the reference file to all
wav files, sort for the top 20 average pitch correlations and then sort these files by
formant vectors, and finally to sort for the top 20 formant vector correlations and then
sort these by average pitch. Sample code for the case of comparing the average pitch and
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 15
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
then comparing the top 12 most likely matches by formant peak difference vectors is
given .
BLOCK DIAGRAM:
Fig 2.4 Block Diagram of Speaker Recognition System
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 16
SPEECHDEGRADATI
ON
FORMANT ANALYSIS
WAVEFORMCOMPARISI
ON
PITCH ANALYSIS
SPEECHENHANCEM
ENT
SPEECH EDITING
GAUSSIAN NOISE
3rd orderButter worth filter
FFT
Voice.wav file
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
CHAPTER 3
________________________________________________________________________________________________
SPEECH EDITING
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 17
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Editing speech data is currently time-consuming and error-prone. Speech editors
rely on acoustic waveform representations, which force users to repeatedly sample the
underlying speech to identify words and phrases to edit. Instead we developed a semantic
editor that reduces the need for extensive sampling by providing access to meaning.
3.1 INTRODUCTION:
Speech is an important informational medium. Large amounts of valuable spoken
information are exchanged in meetings, voicemail and public debates .Speech also has
general benefits over text, being both expressive and easy to produce . Speech archives
are now becoming increasingly prevalent, but until recently it was hard to exploit these
archives because of a shortage of effective tools for accessing and manipulating speech
data.
A set of transformations of f0 contours, energy, duration and spectral content, for
the manipulation of affect in speech signals. This set includes operations like selective
extension, shrinking, and actions like ‘cut and paste’.
3.2 AFFECT EDITOR:
A complete system may allow the user to choose either a desired target
expression that will be automatically translated into operators and contours, or choose the
operators and manipulations manually. The editing tool should offer a variety of editing
operators, such as changing the intonation , speech rate, the energy in different frequency
bands and time frames, or add special effects.
3.3 IMPLEMENTATION:
The editor requires a preprocessing stage before editing an utterance. Post
processing is also necessary for reproducing a new speech signal. The input signal is
preprocessed in a way that allows processing of different features separately . The time-
frequency domain is used because it allows for local changes of limited durations, and of
specific frequency bands. From human computer interaction point of view, it allows
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 18
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
visualization of the changeable features, and gives the user graphical feedback for most
operations.
The file recorded with my slower speech was found from the ordered list of
speakers. It was determined that the length of the vector representing this speech file had
a magnitude of 30,000. Thus the vector was partitioned into two separate vectors of equal
length and the vectors were written to a file in opposite order
Fig 3.1: speech editing waveform
Effective editing can extract and summarize the main points of a speech record,
allowing others to access key information without having to listen to all of it. Most
current speech editors rely on an acoustic representation. To edit speech, users listen to
the underlying speech and then manipulate the acoustic representation. This is a laborious
process that involves multiple editing actions, repeatedly sampling the speech to precisely
identify the beginning and end of regions of interest.
One important design implication is that we need to move away from general-
purpose acoustic tools for processing speech. Acoustic editors are designed to deal with
all forms of audio data, but speech editing has specific demands, that are not well met by
such general tools. By building tools that are specifically tailored to represent meaning,
we can provide more effective ways to process speech.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 19
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Further design implications arise from user comments about the semantic editor.
One challenge is to indicate to users that a transcript is inaccurate. One possibility is that
we might use confidence information from the speech recognizer to signal this . Regions
of low Automatic Speaker Recognition confidence could be grayed in the transcript to
alert users to areas of potentially poor quality.
Users also wanted to be able to correct transcripts and comment on their edits. We
have therefore extended our semantic editor to: (a) allow users to correct original
transcription errors; (b) combine edited transcripts with explanatory user textual
comments.
In this technique we will going to record a set of the speech signal in ‘.wav’ (dot)
wave format and taking a speech signal from the set of recorded speech waves on which
we will going to perform Speech Editing. Here the length of the vector representing this
speech file must have a magnitude of 30,000. However, this vector is then divided into
two separate vectors having equal length & in opposite order. Now with the help of
MATLAB Programming & Tools we will going to develop a code by which the given
wave file is read and then the same file is played in reverse order. The general
representation of the speech editing waveform in forward mode as well as reverse mode
is shown in figure below.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 20
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Fig 3.2: Example for speech editing
In conclusion, speech editing is better and faster for accurate Automatic Speaker
Recognition(ASR) and more efficient than acoustic editing even when transcription is
poor. These results are highly promising, suggesting that speech editing may remove a
major barrier to making speech into useful data.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 21
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
CHAPTER 4
------------------------------------------------------------------------------------------------------------------------------------------------
SPEECH DEGRADATION
&
SPEECH ENHANCEMENT
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 22
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
4.1 INTRODUCTION TO SPEECH DEGRADATION
The human auditory system is faced with the formidable challenge of segregating
signals of importance, such as speech, from interference caused by either internal or
external noise contributions. Internal noise contributions are inherent in, for example,
certain speech impediments affecting the production of speech sounds, whereas external
noise contributions are unrelated to the signal of interest (say, a speech message obscured
by noise in cocktail-party settings). Noise can also be either transient or continuous, that
is, it coincides with the signal of interest, or is continuous background noise contributed
by an unrelated source. One might expect that the way the auditory cognitive system
extracts the relevant features of speech from noise contribution should also be reflected in
brain dynamics, specifically in changes in the amplitude, latency and source location of
cortical activation. Previous research, having focused on how particular aspects of noise
are reflected in brain dynamics, has, unfortunately, left us with a fragmentary picture on
how the human brain segregates stimulus features of interest from noise. Given this
shortcoming, we aim here for a comprehensive look at how internal vs. external noise of
either transient or continuous nature is reflected in cortical activity in humans.
In natural auditory environments, the intelligibility of speech is often reduced by
distortions from external sound sources, which may include informational (e.g.,
concurrent speakers) masking in addition to the energetic (noise) distortion . External
distortions, such as stochastic noise, are independent from the speech sounds.
Consequently, the detrimental effects of external sources on speech perception can be
mitigated by using acoustic features specific to the sound sources, such as pitch, timbre,
intensity and spatial cues ,to segregate the speech from distortions. These temporally
synchronized, spectral features characteristic to speech sounds are integrated into a
coherent whole, thus leading to successful perception of speech. In contrast, external
distortions typically have a different spectral structure and are temporally unsynchronized
with speech, and this facilitates the segregation of speech from noise contributions.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 23
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
4.2 IMPLEMENTATION
The file recorded with my faster speech was found from the ordered list of
speakers. Speech degradation was performed by adding Gaussian noise generated by the
MATLAB function randn() to this file. A comparison was then made between the clean
file and the signal with the addition of Gaussian noise.
Noise plays a vital role in speech degradation. Thus noise estimation is one of the
major part while performing the speech recognition task. Therefore, it is understood if the
estimated noise is low it will not affect the speech signal but if the noise is high then
speech will get distorted and loss intelligibility. Moreover, this process not only help us
in making comparison between the clean file and the signal with the added Gaussian
noise, it also can be further viewed as which filter in DSP (Digital signal processing )
such as Chebychev Filter, Butterworth Filter etc. can be worked better to remove this
Gaussian noise.
Fig 4.1 Block diagram for speech degradation
4.2.1 CONTINUOS DISTORTIONS IN SPEECH
The speech sounds were degraded by presenting either the quantization noise or
the stochastic noise continuously in the background while measuring auditory evoked
responses to the undistorted speech sound. For the quantization noise, this was
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 24
UnknownLinear
filteringClea
nspeech
Unknown
Additive NoiseDegrade
d Speech
Compensation
Compensated
Speech
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
accomplished by first extracting the time-domain distortion waveform created by the
reduction of the amplitude resolution from the degraded speech sound. Because the
undistorted speech sound was a periodic voiced utterance, the distortion waveform (i.e.,
the quantization error signal) was also periodic. Therefore, a continuous version of the
quantization noise could be easily generated by concatenating one cycle of the periodic
distortion waveform. In the case of the continuous stochastic noise, a pseudo-random
number generator was first used to generate two copies of a 10-second white noise
sequence. Finally, two continuous stochastic noise sequences with the duration of 10 min
were obtained by concatenating the corresponding energy-scaled noise sequences.
4.2.3 GAUSSIAN NOISE MODELS
Gaussian white noise models have become increasingly popular as a canonical
type of model in which to address certain statistical problems. We briefly review some
statistical problems formulated in terms of Gaussian "white noise", and pursue a
particular group of problems connected with the estimation of monotone functions. These
new results are related to the recent development of likelihood ratio tests for monotone
functions are studied . We conclude with some open problems connected with
multivariate interval censoring.
This is one of the non-parametric methods for speaker identification. When
feature vectors are displayed in d-dimensional feature space after clustering, they some-
how resemble Gaussian distribution. It means each corresponding cluster can be viewed
as a Gaussian probability distribution and features belonging to the clusters can be best
represented by their probability values. The only difficulty lies in efficient classification
of feature vectors.
Here, we briefly review a slice of the past and current research work on "white
noise models" and we present some results on estimation of a monotone function
observed "in white noise", and study a canonical version of the problem which arises
repeatedly in the asymptotic distribution theory for nonparametric estimators of
monotone functions. It carries through an analogous estimation problem in which some
additional knowledge of the monotone function is available, namely its value at one
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 25
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
point. This arises naturally when address in the problem of finding a likelihood ratio test
of the hypothesis H : f(to) = θo where f is monotone.
Fig 4.2 Speech file with and without Gaussian noise added to it.
The Gaussian Noise Generator block generates discrete-time white Gaussian
noise. You must specify the Initial seed vector in the simulation.
The Mean Value and the Variance can be either scalars or vectors. If either of
these is a scalar, then the block applies the same value to each element of a sample-based
output or each column of a frame-based output. Individual elements or columns,
respectively, are uncorrelated with each other.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 26
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
When the Variance is a vector, its length must be the same as that of the Initial
seed vector. In this case, the covariance matrix is a diagonal matrix whose diagonal
elements come from the Variance vector. Since the off-diagonal elements are zero, the
output Gaussian random variables are uncorrelated.
When the Variance is a square matrix, it represents the covariance matrix. Its off-
diagonal elements are the correlations between pairs of output Gaussian random
variables. In this case, the Variance matrix must be positive definite, and it must be N-by-
N, where N is the length of the Initial seed.
The Initial seed parameter initializes the random number generator that the
Gaussian Noise Generator block uses to add noise to the input signal. For best results, the
Initial seed should be a prime number greater than 30. Also, if there are other blocks in a
model that have an Initial seed parameter, you should choose different initial seeds for all
such blocks. You can choose seeds for the Gaussian Noise Generator block using the
Communications Block set randseed function. This returns a random prime number
greater than 30. Entering randseed again produces a different prime number. If you
supply an integer argument, randseed always returns the same prime for that integer. For
example, randseed(5) always returns the same answer.
4.3 INTRODUCTION TO SPEECH ENHANCEMENT
Speech enhancement aims to improve speech quality by using various algorithms.
It may sound simple. It can be at least clarity and intelligibility, pleasantness, or
compatibility with some other method in speech processing. Intelligibility and
pleasantness are difficult to measure by any mathematical algorithm. Usually listening
tests are employed. However ,since arranging listening tests may be expensive, it has
been widely studied how to predict the results of listening tests. No single philosopher’s
stone or minimization criterion has been discovered so far and hardly ever will. The
central methods for enhancing speech are the removal of background noise, echo
suppression and the process of artificially bringing certain frequencies into the speech
signal.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 27
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
We shall focus on the removal of background noise after briefly discussing what
the other methods are all about . First of all, every speech measurement performed in a
natural environment contains some amount of echo. Echoless speech, measured in a
special anechoic room, sounds dry and dull to human ear. Echo suppression is needed in
big halls to enhance the quality of the speech signal, especially if the distance between
the microphone and the speaker is large. 1In the current telephone networks speech is
band limited between 300–3400 Hz. Sooner or later the markets will be dominated by
third generation phones in which the frequency band of the speech is, for instance, 50-
7500 Hz. The delight of this wideband speech will be tamed unless the entire
conversation is travelling in a wideband network. Artificial bandwidth expansion can be
utilized to restore the frequencies that disappear on the route. These methods are also
useful in speech compression.
When the background noise is suppressed, it is crucial not to harm or garble the
speech signal. Or at least not very badly. Another thing to remember is that quiet natural
background noise sounds more comfortable than more quiet unnatural twisted noise. If
the speech signal is not intended to be listened by humans, but driven for instance to a
speech recognizer, then the comfortness is not the issue. It is crucial then to keep the
background noise low. Background noise suppression has many applications. Using
telephone in a noisy environment like in streets of in a car is an obvious application.
Traditionally, the background noise has been suppressed when sending speech from the
cockpit of an airplane to the ground or to the cabin. It is easy to come up with similar
examples.
It is also a good idea to enhance speech for coding and recognition purposes.
Speech codecs have been optimized for speech and they usually make the background
noise sound weird. Moreover, enhanced speech can be compressed in fewer bits than
non-enhanced. Speech recognition systems whose operation relies on the features
extracted from speech will be disturbed by extra noise sounds .Active noise suppression is
a method in which the idea is to produce anti-noise into the listener’s ear to cancel the
noise. The delay must be kept very small to avoid producing more noise instead of
cancelling the existing noise. For this reason, most of the methods for active noise
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 28
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
suppression are fully analog: A/D and D/A transforms inevitably produce some amount
of delay. The operation of all the speech enhancement methods in the following sections
is based on the spectra calculated from adjacent frames of speech. In practice, the frames
are a little bit overlapping and the frame size is a couple of dozens of milliseconds. The
windowed speech frame is padded with zeros to make its length equal to the nearest
power of two
4.3.1 BACKGROUND NOISE ESTIMATION
All the speech enhancement methods aimed at suppressing the background noise
are (naturally) based in one way or the other on the estimation of the background noise. If
the background noise is evolving more slowly than the speech, i.e., if the noise is more
stationary than the speech, it is easy to estimate the noise during the pauses in speech.
Finding the pauses in speech is based on checking how close the estimate of the
background noise is to the signal in the current window. Voiced sections can be located
by estimating the fundamental frequency. Both methods easily fail on unstressed
unvoiced or short phonemes, taking them as background noise. On the other hand, this is
not very dangerous because the effect of these faint phonemes on the background noise
estimate is not that critical.
A working VAD (voice activity detection) in hand, giving values of zero and one
as indicators of the voice activity in each frame, enables us to update the estimate of the
background noise spectrum during the frames that have zero VAD, using the formulae
where is the spectrum of the noisy speech, _is a decay rate coefficient
flattening the spectrum, and the index refers to the current frame.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 29
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
4.4 IMPLEMENTATION
The speech enhancement technique enlightens upon the major use Speech
Degradation technique i.e .removal of Gaussian noise from the original speech wave. In
this technique firstly the degraded signal i.e original signal mixed with Gaussian noise is
first converted to the frequency domain with the help of FFT tool in MATLAB
Programming. Then higher frequency noise components are then removed with the help
of 3rdorder Butterworth low pass filter, according to the equation,
Where, D (u,v) is the rms value of u and v, Do determines the cut-off frequency, n is the
filter order. The reason to choose butter worth filter here because it has the capability to
filter the Gaussian noise more closely &approximates an ideal low pass filter as the order,
n, is increased. The resulting filtered signal was then scaled and plotted with the original
noisy signal to compare the filtering result and the general representation of speech
enhancement type wave form is shown in figure below.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 30
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Fig 4.3 Comparison of natural and LPF filtered signal
4.4.1 FAST FOURIER TRANSFORMS
The Fast Fourier Transform (FFT) resolves a time waveform into its sinusoidal
components. The FFT takes a block of time-domain data and returns the frequency
spectrum of the data. The FFT is a digital implementation of the Fourier transform. Thus,
the FFT does not yield a continuous spectrum. Instead, the FFT returns a discrete
spectrum, in which the frequency content of the waveform is resolved into a finite
number of frequency lines, or bins.
Number of Samples
The sampled time waveform input to an FFT determines the computed spectrum.
If an arbitrary signal is sampled at a rate equal to fs over an acquisition time T, N samples
are acquired. Compute T with the following equation:
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 31
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
where
T is the acquisition time
N is the number of samples acquired
fs is the sampling frequency
Compute N with the following equation:
N = T · fs
Frequency Resolution
For FFT, the spectrum computed from the sampled signal has a frequency
resolution df. Calculate the frequency resolution with the following equation:
Maximum Resolvable Frequency
The sampling rate of a time waveform determines the maximum resolvable
frequency. According to the Shannon Sampling Theorem, the maximum resolvable
frequency must be half the sampling frequency. To calculate the maximum resolvable
frequency, use the following equation:
where
fmax is the maximum resolvable frequency
fNyquist is the Nyquist frequency
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 32
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
fs is the sampling frequency
fftshift:
Y = fftshift(X) rearranges the outputs of fft, fft2, and fftn by moving the zero-
frequency component to the center of the array. It is useful for visualizing a Fourier
transform with the zero-frequency component in the middle of the spectrum. For vectors,
fftshift(X) swaps the left and right halves of X.
Relationship with FFT.
fft transforms input images into frequency domain; while fftshift reorganize the
outputs of fft by moving zeroth lag to the center of spectrum. Thus, fftshift is usually
used together with the Fourier transform
4.4.2 BUTTERWORTH FILTERS
Here we describe the commonly-used nth-order butterworth low pass filter. First we
show how to use known design specifications to determine filter order and 3dB cut-off
frequency. Then, we show how to determine filter poles and the filter transfer function.
Along the way ,we describe the use of common MATLAB signal processing tool box
functions that are useful in designing Butterworth low pass filters.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 33
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Fig 4.4 Frequency response of a Butterworth filter of order n.
A Butterworth filter has the maximally flat response in the pass-band. At the cut-
off frequency, ωc, the attenuation is -3dB. Above the -3dB point the attenuation is
relatively step with a roll off of -20dB/decade/pole. The figure below shows the
frequency response of such a filter.
The poles of a Butterworth filter are located on a circle with radius ωc and are
spaced apart by an angle 180o/n in which n is the order of the filter (number of poles).
The first pole is located 180o/2n from the jω axis, as shown in the figure below.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 34
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Fig 4.5 Poles of a Butterworth filter.
Syntax
[n,Wn] = buttord(Wp,Ws,Rp,Rs)
[n,Wn] = buttord(Wp,Ws,Rp,Rs,'s')
Description
buttord calculates the minimum order of a digital or analog Butterworth filter
required to meet a set of filter design specifications.
Digital Domain
[n,Wn] = buttord(Wp,Ws,Rp,Rs) returns the lowest order, n, of the digital
Butterworth filter that loses no more than Rp dB in the pass band and has at least Rs dB
of attenuation in the stopband. The scalar (or vector) of corresponding cutoff frequencies,
Wn, is also returned. Use the output arguments n and Wn in butter.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 35
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Table 4.1:Description of stop band and pass band filter parameters
wp Pass band corner frequency Wp, the cutoff frequency, is a scalar or a two-element vector with values between 0 and 1, with 1 corresponding to the normalized Nyquist frequency, π radians per sample.
Ws Stop band corner frequency Ws, is a scalar or a two-element vector with values between 0 and 1, with 1 corresponding to the normalized Nyquist frequency.
Rp Pass band ripple, in decibels. This value is the maximum permissible passband loss in decibels.
Rs Stop band attenuation, in decibels. This value is the number of decibels the stop band is down from the pass band.
Analog Domain
[n,Wn] = buttord(Wp,Ws,Rp,Rs,'s') finds the minimum order n and cutoff
frequencies Wn for an analog Butterworth filter. You specify the frequencies Wp and Ws
similar those described in the Description of Stop band and Pass band Filter Parameters
table above, only in this case you specify the frequency in radians per second, and the
pass band or the stop band can be infinite.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 36
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
CHAPTER 5
________________________________________________________________________________________________
PITCH ANALYSIS
&
FORMANT ANALYSIS
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 37
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
5.1 INTRODUCTION TO PITCH ANALYSIS
At a linguistic level, speech can be viewed as a sequence of basic sound units
called phonemes. A phoneme is a sound or group of different sounds perceived to have
the same function by the speakers of a language. An example of a phoneme is /k/ sound
in the words kit and skill. The same phoneme may give rise to many different sounds or
allophones at the acoustic level, depending on the phonemes which surround it. Different
speakers producing the same string of phonemes convey the same information yet sound
different as a result of differences in dialect and vocal tract length and shape.
Speech is a means of communication and exchange of thoughts between
individuals. The spoken word comprises of vowels and consonant which are the speech
sound units. The speaker characteristics are identified from speech data and are analyzed
using suitable analysis techniques. The analysis technique aims at selecting proper frame
size along with some overlap and extracting the relevant features from speech. Lots of
study has been carried out to investigate acoustic indicators to detect emotions in speech .
The characteristics that are mostly commonly considered include Fundamental frequency
F0, duration, intensity ,spectral variation and wavelet based features. In this paper linear
feature extraction techniques and their extraction algorithms are explained. These
features are then used to identify if the person is in neutral, happy or sad emotional state.
There is a substantial amount of work on the frequency of the voice fundamental
(F0) in the speech of speakers who differ in age and sex. The data reported nearly always
include an average measure of F0,usually expressed in Hz. Typical values obtained for
F0are 120 Hz for men and 200 Hz for women. The mean values change slightly with age.
Many methods and algorithms are in use for pitch detection divided into two main camps,
time domain analysis and frequency domain analysis.
Pitch in terms of speech analysis can be defined as a technique which allows the
ordering of sounds on a frequency-related scale. Pitch analysis helps us in identifying the
state of speech of a person.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 38
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
The considered states are neutral, happy, sad. Therefore it is very important to
understand the concept of pitch analysis. Proposed system describes a technique that
involves the extraction of basic parameters of Pitch Analysis. Now, the calculation of the
average pitch of the entire .wav format speech file that are recorded in our data base of
different speakers was done and found to have a certain value which can be used in voice
recognition, where the differences in average pitch can be used to characterize a voice
file.
5.2 PITCH ANALYSIS
Pitch is defined as the fundamental frequency of the excitation source. Hence an
efficient pitch extractor and an accurate pitch estimate calculation can be used in an
algorithm for gender identification. A pitch detection algorithm (PDA) is an algorithm
designed to estimate the pitch or fundamental frequency of a digital recording of speech
or a musical note or tone. This can be done in the time domain or the frequency domain.
Time domain signals were converted to Frequency domain for better analysis and
noise removal. Noise were removed by application of a low-pass filter (here it was
Butterworth low-pass filter).
Fig 5.1 Time domain plots of female and male voice sample
Pitch analysis was conducted and relevant parameters were extracted and the
graph of pitch contour versus time frame was also plotted to see how the pitch varies over
the speech signal. The average pitch of the speech signal was calculated.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 39
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
All the convolutions computed during this analysis were based on FFT/IFFT
algorithm implemented in MATLAB software. Appropriate rectangular windows were
designed and used for the analysis.
The file recorded with my slower speech was found from the ordered list of
speakers. Pitch analysis was conducted and relevant parameters were extracted .The
average pitch of the entire wav file was computed and found to have a value of154.8595
Hz. The graph of pitch contour versus time frame was also created to see how the pitch
varies over the wav file. The results of pitch analysis can be used in speaker recognition,
where the differences in average pitch can be used to characterize a speech file.
There are a large set of methods that have been developed in the speech
processing area for the estimation of pitch. Among them the three mostly used methods
include, autocorrelation of speech, cepstrum pitch determination and single inverse
technique (SIFT) pitch estimation. Success of these methods is due to the involvement of
simple steps for the estimation of pitch. Even though autocorrelation method is of
theoretical interest, it produce a frame work for SIFT methods.
Fig 5.2 Pitch contour plot
The main limitation of pitch estimation by the auto correlation of speech is that
there may be peaks larger than the peak corresponding to the pitch period T0 due to the
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 40
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
of the vocal tract, As a result there may be picking of peaks and hence wrong estimation
of pitch. The approach to minimize such errors is to separate the vocal tract and
excitation source related information in the speech signal and there use the source
information for pitch estimation.
5.3 PITCH AUTOCORRELATION
The correlation between two waveforms is a measure of their similarity. The
waveforms are compared at different time intervals, and their “sameness” is calculated at
each interval. The result of a correlation is a measure of similarity as a function of time
lag between the beginnings of the two waveforms. The autocorrelation function is the
correlation of a waveform with itself. One would expect exact similarity at a time lag of
zero, with increasing dissimilarity as the time lag increases.
The pitch extraction process of a speech signal can be based on computing the
short-time autocorrelation function of the speech signal. Short-Time Auto-correlation for
a speech signal is given by
Where, Rn(k) = Short-Time Auto-correlation
x = Speech Signal
w = Window
k = Sample time at which auto-correlation was calculated
For voiced segments of speech, the Short-Time Auto-correlation function shows
periodicity of the speech. Rn(k) decreases with k as the computation process continues.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 41
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
A commonly used method to estimate pitch (fundamental frequency) is based on
detecting the highest value of the Autocorrelation function (ACF) in the region of
interest. For a given discrete time signal x(n), defined for all n, the autocorrelation
function is generally defined as
If x(n) is assumed to be exactly periodic with period P, i.e., x(n)=x(n+P) for all n,
then it is easy to show that the autocorrelation Rx (_)= Rx(_+P) is also periodic with the
same period. Conversely, periodicity in the autocorrelation function indicates periodicity
in the signal. For non-stationary signals, such as speech, the concept of a long-time
autocorrelation measurement as given by above equation is not really suitable. In
practice, short speech segments, consisting of only N samples, are operated with .That is
why a short-time autocorrelation function, given by below equation , is used instead.
where N is the length of analyzed frame, T is the number of autocorrelation points to be
computed.
5.4 INTRODUCTION TO FORMANT ANALYSIS
Formant frequencies have rarely been used as acoustic features for speech
recognition, in spite of their phonetic significance. For some speech sounds one or more
of the formants may be so badly defined that it is not useful to attempt a frequency
measurement. Also, it is often difficult to decide which formant labels to attach to
particular spectral peaks. Proposed system describes a new method of formant analysis
which includes techniques to overcome both of the above difficulties. Using the same
data and HMM model structure, results are compared between a recognizer using
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 42
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
conventional cepstrum features and one using three formant frequencies, combined with
fewer cepstrum features to represent general spectral trends.
It has been known for many years that formant frequencies are important in
determining the phonetic content of speech sounds. Several authors have therefore
investigated formant frequencies as speech recognition features, using various methods
for basic analysis, such as linear prediction analysis by synthesis with Fourier spectra and
peak picking on cepstrally smoothed spectra. However, using formants for recognition
can sometimes cause problems, and they have not yet been widely adopted. It is obvious,
for example, that formant frequencies cannot discriminate between speech sounds for
which the main differences are unrelated to formants. Thus they are unable to distinguish
between speech and silence or between vowels and weak fricatives. Whenever any
formants are poorly defined in the signal (e.g. in fricatives), measurements will be
unreliable, and it is therefore essential that their estimated frequencies should be given
little weight in the recognition process.
It is impossible to determine from the spectrum of some speech sounds whether a
particular peak should be associated with one formant or with a pair, and sometimes a
formant may be so weak as a consequence of weak excitation that it causes no peak in the
spectrum. Either of these situations can cause all higher-frequency formants to be
wrongly labeled , with disastrous effects on the recognition .In such cases alternative
labellings must be produced, and any uncertainties that cannot be resolved in other ways
must be resolved within the recognition algorithm.
To be useful as features for automatic speech recognition, formant frequencies
must be supplemented by signal level and general spectral shape information, such as
provided by low-order cepstrum features, for example. However, whenever the speech
spectrum has a peaky structure, the phonetic detail is better described by formant
frequencies than by the more usual higher-order cepstrum features, which have no simple
relationship with formant frequencies.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 43
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
5.5 IMPLEMENTATION
Formants in normal language can be defined as the spectral peaks of the sound
spectrum. With the help of above discussed Pitch and Formant Analysis, a waveform
comparison code was written with the help of MATLAB Programming. Thus, based on
this code we can easily characterized Speech waveform files. In this process a
reference .wav file was used which is then compared with the remaining .wav files.
Moreover, a sorting routine is performed in which sorting and comparison of the average
pitch of the reference file with all the other 5 .wav files. The technique further includes
the comparison of formant vector of the reference file to all .wav files, and thus sorting
for the top 3 average pitch correlations and then again sort these files by formant vectors
correlations and then sort these by average pitch. In this way, we can easily recognize the
speaker.
In Formant Analysis technique, we will going to perform on any of the .wav
format speech file taken from the set of recorded .wav speech signal. Further with the
help of MATLAB Programming we had prepared a code for Formant Analysis. With the
help of this code we can easily calculate the first five formants that are present in .wav
speech file, calculation of difference between the vector peak positions of these five
formants, vector position of the peaks in the power spectral density were easy calculated
and can be used to determine the speech file. The general waveform of formant analysis
can be shown in figure below.
Formant analysis was performed on my slow speech file . The first five peaks in
the power spectral density were returned and the first three can be seen . Also, the vector
position of the peaks in the power spectral density were calculated and can be used to
characterize a particular voice file. This technique is used in the waveform comparison
section.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 44
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Fig 5.3 Plot of the first few formants of a speech file
5.5.1 EXTRACTION OF FORMANT FREQUENCIES
Formants are defined as the spectral peaks of sound spectrum, of the voice, of a
person. In speech science and phonetics, formant frequencies refer to the acoustic
resonance of the human vocal tract . They are often measured as amplitude peaks in the
frequency spectrum of the sound wave. We have considered the first 3 formantsf1, f2, f3
for analysis of emotions. For different vowels, the range of f1 lies between 270 to 730 Hz
while the range off2 and f3 lie between 840 to 2290 and 1690 to 3010 Hz respectively.
Formant frequencies are very much important in the analysis of the emotional state of a
person.
The Linear predictive coding technique (LPC) has been used for estimation of the
formant frequencies . The analog signal is converted in .wav digital format. The signal is
transformed to frequency domain using FFT and the power spectrum is further
calculated. Then the signal is passed through a Linear Predictive Filter (LPC) with
11coefficients and the absolute values are considered. The roots of the polynomial are
obtained which contain both real and imaginary parts.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 45
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
The phase spectrum is further displayed which clearly shows the formant
frequencies. The first five formant frequencies are displayed in the graph. Figure shows
the formant frequency plot along with the original speech signal. The five formant
frequencies obtained are 230 Hz, 800 Hz, 1684 Hz, 2552Hz, 3159 Hz respectively.
Fig 5.4 Speech signal and its formants.
5.5.2 YULE-WALK ERS POWER SPECTRAL DENSITY
The Yule-Walker Method estimates the power spectral density (PSD) of the input
using the Yule-Walker AR method. This method, also called the autocorrelation method,
fits an autoregressive (AR) model to the windowed input data. It does so by minimizing
the forward prediction error in the least squares sense. This formulation leads to the Yule-
Walker equations, which the Levinson-Durbin recursion solves. Block outputs are always
nonsingular.
The input is a sample-based vector (row, column, or 1-D) or frame-based vector
(column only). This input represents a frame of consecutive time samples from a single-
channel signal. The block outputs a column vector containing the estimate of the power
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 46
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
spectral density of the signal at Nfft equally spaced frequency points. The frequency
points are in the range [0,Fs), where Fs is the sampling frequency of the signal.
Pyulear estimates the power spectral density (PSD) of the signal vector x[n] using
the Yule-Walker AR method. This method, also called the autocorrelation method, fits an
autoregressive (AR) model to the signal by minimizing the forward prediction error in the
least-squares sense. This formulation leads to the Yule-Walker equations, which are
solved by the Levinson-Durbin recursion. The spectral estimate returned by pyulear is the
magnitude squared frequency response of this AR model. The correct choice of the model
order p is important.
Pxx = pyulear(x,p,nfft) returns Pxx, the power spectrum estimate. x is the input
signal, p is the model order for the all-pole filter, and nfft is the FFT length . Pxx has
length (nfft/2+1) for nfft even, (nfft+1)/2 for nfft odd, and nfft if x is complex.
[Pxx,freq] = pyulear(x,p,nfft) returns Pxx, the power spectrum estimate, and freq,
a vector of frequencies at which the PSD was estimated. If the input signal is real-valued,
the range of freq is [0, ]. If the input signal is complex, the range of freq is [0,2 ].
[Pxx,freq] = pyulear(x,p,nfft,Fs) uses the signal's sampling frequency, Fs, to scale
both the PSD vector (Pxx) and the frequency vector (freq). Pxx is scaled by 1/Fs. If the
input signal is real-valued, the range of freq is [0,Fs/2]. If the input signal is complex, the
range of freq is [0,Fs]. Fs defaults to 1 if left empty.
[Pxx,freq] = pyulear(x,p,nfft,Fs,'range') specifies the range of frequency values to
include in freq. range can be:
half, to compute the PSD over the range [0,Fs/2] for real x, and [0,Fs] for
complex x. If Fs is left blank, the range is [0,1/2] for real x, and [0,1]for
complex x. If Fs is omitted entirely, the range is [0,pi] for real x, and [0,2*pi] for
complex x. half is the default range.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 47
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
whole, to compute the PSD over the range [0,Fs] for all x. If Fs is left blank, the
range is [0,1] for all x. If Fs is omitted entirely, the range is[0,2*pi] for all x.
pyulear(...) plots the power spectral density in the first available figure window. The
frequency range on the plot is the same as the range of output freq for a given set of
parameters.
pyulear(...,'squared') plots the PSD directly, rather than converting the values to dB.
The following table indicates the length of Pxx and the range of the corresponding
normalized frequencies for this syntax.
Table 5.1 PSD Vector Characteristics for an FFT Length of 256 (Default)
Real/Complex Input Data
Length of Pxx
Range of the Corresponding Normalized Frequencies
Real-valued 129 [0, ]
Complex-valued 256 [0, 2 )
[Pxx,w] = pyulear(x,p) also returns w, a vector of frequencies at which the PSD is
estimated. Pxx and w have the same length. The units for frequency are rad/sample.
[Pxx,w] = pyulear(x,p,nfft) uses the Yule-walker method to estimate the PSD
while specifying the length of the FFT with the integer nfft. If you specify nfft as the
empty vector [], it adopts the default value of 256.
The length of Pxx and the frequency range for w depend on nfft and the values of
the input x. The following table indicates the length of Pxx and the frequency range
for w for this syntax.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 48
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Table 5.2 PSD and Frequency Vector Characteristics
Real/Complex Input Data nfft Even/Odd Length of Pxx Range of w
Real-valued Even (nfft/2 + 1) [0, ]
Real-valued Odd (nfft + 1)/2 [0, )
Complex-valued Even or odd nfft [0, 2 )
Power Spectral Density estimation by determination of the parameters of an auto-
regressive model based on the Yule Walker Equation solved by the Levinson Durbin
Recursion.
The signal to be analyzed is assumed to be generated by a white noise stimulus
driving a linear process with parameters ak where
The Yule-Walker method should not be used as a means of autoregressive
parameter estimation if the auto covariance matrix is poorly conditioned. In that case the
relatively small covariance estimate bias can lead to a large deviation in the estimated
parameters, resulting in an invalid model .A poor auto covariance matrix condition also
involves pole locations near the unit circle, as a result of which the autoregressive process
exhibits a kind of almost non stationary, pseudo-periodic behavior . The variance of the
stochastic process will be large due to the innovation process not being identically zero,
which is the case for a harmonic process.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 49
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
CHAPTER 6
------------------------------------------------------------------------------------------------------------------------------------------------
WAVEFORM COMPARSION
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 50
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
6.1 WAVEFORM COMPARISION
Using the results and information learned from pitch and formant analysis. Speech
waveform files can be characterized based on various criteria. Average pitch and formant
peak position vectors are two such criteria that can be used to characterize a speech file.
The slow speech file was used as a reference file. Four sorting routines were then written
to compare the files. The sorting routines performed the following functions: sort and
compare the average pitch of the reference file with all 83 wav files, compare the formant
vector of the reference file to all wav files, sort for the top 20 average pitch correlations
and then sort these files by formant vectors, and finally to sort for the top 20 formant
vector correlations and then sort these by average pitch.
Fig 6.1 Comparison of PSD wave files
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 51
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
In order to create a speech recognition algorithm, criteria to compare speech files
must be established. This section of the project compares four different methods of
comparing the data. First, the wav files are compared to a reference file and sorted based
on the average pitch of the file only . The files were then compared and sorted based
entirely on the location of the formants present in the PSD of the signal. A third method
compared the average pitch present and ranked the matches in ascending order and then
compared the top 12 most likely matches by formant location in the PSD . Finally, the
inverse routine was performed where the files were compared and sorted by the location
of the formants present and then the top 12 most likely matches based on this data were
compared and sorted by pitch.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 52
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
APPLICATIONS
The applications of speaker recognition technolgy are quite varied and continually
growing.Below is an out line of some broad areas where speaker recognition technology
has been or is currently used.
Access Control: Originally for physical facilities more recent applications are for
controlling access to computer networks and also used for automated password reset
services.
Transaction Authentication:For telephone banking ,in addition to account access
control higher levels of verification can be used for more sensitive transactions.More
recent applications are in user verification for remote electronic and mobile purchases.
Law Enforcement:Some applications are home-parole monitoring and prison call
monitoring.There has also been discussion of using automatic systems to corroborate
spectral inspections pf voice samples for forensic analysis.
Speech Data Management: In voice mail browsing or intelligent answering machines
use speaker recognition to label incoming voice mail with speaker name for browsing and
action.For speech skimming or audio mining applications,annotate recorded meetings or
video with speaker labels for quick indexing and filling.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 53
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
FUTURE SCOPE
1. Controlling Of Device Through Voice Recognition Using MATLAB.
2. Speech Recognition using Digital Signal Processing.
3. It can be used for automatic recognition of speaker.
4. Gender recognition based on pitch using MATLAB.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 54
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
APPENDIX
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 55
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
MATLAB FUNCTIONS
INTRODUCTION
The name MATLAB stands for Matrix Laboratory. MATLAB was written
originally to provide easy access to matrix software developed by the LINPACK (linear
system package)and EISPACK (Eigen system package) projects. MATLAB is a high-
performance language for technical computing. It integrates computation, visualization,
and programming environment. Furthermore, MATLAB is a modern programming
language environment: it has sophisticated data structures, contains built-in editing and
debugging tools, and supports object-oriented programming. These factors make
compared to conventional computer languages (e.g. ,C, FORTRAN) for solving technical
problems. MATLAB is an interactive system whose basic data element is an array that
does not require dimensioning. The software package has been commercially available
since 1984 and is now considered as a standard tool at most universities and industries
worldwide.
It has powerful built-in routines that enable a very wide variety of computations.
It also has easy to use graphics commands that make the visualization of results
immediately available. Specific applications are collected in packages referred to as
toolbox. There are toolboxes for signal processing, symbolic computation, control
theory ,simulation, optimization, and several other ¯fields of applied science and
engineering.
6.2.2 FUNCTIONS USED IN MATLAB
1. Reading a sound file
To compress a sound file, we first need to take its samples into a vector. Let ‘y’ be the vector. The command is
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 56
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
[y, fs ,bps] = wavread(‘path of the file’) ;
Description
This command stores the samples of the sound file in the vector y. The term ‘fs’ stores the sampling frequency of the file and ‘bps’ is the bits per sample. These 2 values are needed to reconstruct the .wav file using ‘wavwrite’ function.
2. Playing a sound file
Syntax
sound(y, Fs)
sound(y, Fs, bits)
Description
sound(y, Fs) sends audio signal y to the speaker at sample rate Fs. If you do not
specify a sample rate, sound plays at 8192 Hz. For single-channel (mono) audio, y is an
m-by-1 column vector, where m is the number of audio samples. If your system supports
stereo playback, y can be an m-by-2 matrix, where the first column corresponds to the
left channel, and the second column corresponds to the right channel. The sound function
assumes that y contains floating-point numbers between -1 and 1, and clips values outside
that range.
3. Generation of Gaussian Noise
Randn
Normally distributed pseudorandom numbers
Syntax
r = randn(n)
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 57
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
randn(size(A))
Description
r = randn(n) returns an n-by-n matrix containing pseudorandom values drawn
from the standard normal distribution. randn returns a scalar. randn(size(A)) returns an
array the same size as A.
The sequence of numbers produced by randn is determined by the internal state of
the uniform pseudorandom number generator that underlies rand, randi, and randn. randn
uses one or more uniform values from that default stream to generate each normal value.
Control the default stream using its properties and methods. See RandStream for details
about the default stream.
Resetting the default stream to the same fixed state allows computations to be
repeated. Setting the stream to different states leads to unique computations, however, it
does not improve any statistical properties. Since the random number generator is
initialized to the same state every time MATLAB software starts up, rand, randn, and
randi will generate the same sequence of numbers in each session until the state is
changed.
4. Applying Fast Fourier Transform
Syntax
Y = fft(X)
Y = fft(X,n)
Description
Y = fft(X) returns the discrete Fourier transform (DFT) of vector X, computed
with a fast Fourier transform (FFT) algorithm. If X is a matrix, fft returns the Fourier
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 58
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
transform of each column of the matrix. If X is a multidimensional array, fft operates on
the first non singleton dimension. Y = fft(X, n) returns the n-point DFT. If the length of
X is less than n, X is padded with trailing zeros to length n. If the length of X is greater
than n, the sequence X is truncated. When X is a matrix, the length of the columns are
adjusted in the same manner. Y = fft(X, dim) and Y = fft(X, n, dim) applies the FFT
operation across the dimension dim.
5.Shifting of Fast Fourier Transform
Shift zero-frequency component to center of spectrum
Syntax
Y = fftshift(X)
Y = fftshift(X,dim)
Description
Y = fftshift(X) rearranges the outputs of fft, fft2, and fftn by moving the zero-
frequency component to the center of the array. It is useful for visualizing a Fourier
transform with the zero-frequency component in the middle of the spectrum.
6. Absolute value and complex magnitude
Syntax
abs(X)
Description
abs(X) returns an array Y such that each element of Y is the absolute value of the
corresponding element of X.
If X is complex, abs(X) returns the complex modulus (magnitude), which is the
same as
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 59
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
sqrt(real(X).^2 + imag(X).^2)
7. Cross-correlation
Syntax
c = xcorr(x,y)
c = xcorr(x)
Description
xcorr estimates the cross-correlation sequence of a random process.
Autocorrelation is handled as a special case. xcorr must estimate the sequence because, in
practice, only a finite segment of one realization of the infinite-length random process is
available.
c = xcorr(x,y) returns the cross-correlation sequence in a length 2*N-1 vector,
where x and y are length N vectors (N>1). If x and y are not the same length, the shorter
vector is zero-padded to the l
8. Floor
Round toward negative infinity
Syntax
B = floor(A)
Description
B = floor(A) rounds the elements of A to the nearest integers less than or equal to
A. For complex A, the imaginary and real parts are rounded independently length of the
longer vector.
9. PSD using Yule-Walker AR method
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 60
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Syntax
Pxx = pyulear(x,p)
Pxx = pyulear(x,p,nfft)
Description
Pxx = pyulear(x,p) implements the Yule-Walker algorithm, a parametric spectral
estimation method, and returns Pxx, an estimate of the power spectral density (PSD) of
the vector x. The entries of x represent samples of a discrete-time signal. p is the integer
specifying the order of an autoregressive (AR) prediction model for the signal, used in
estimating the PSD. This estimate is also an estimate of the maximum entropy. The
power spectral density is calculated in units of power per radians per sample. Real-valued
inputs produce full power one-sided (in frequency) PSDs (by default), while complex-
valued inputs produce two-sided PSDs.
APPENDIX A:SPEECH EDITING
clc;
clear all;
close all;
[y, fs, nbits]= wavread('bhanu');
sound(y,fs);
t = 0:1/fs:length(y)/fs-1/fs;
subplot(211)
plot(t,y)
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 61
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
xlabel('time in seconds');
ylabel('amplitude');
yfirst=y(1:77125);
ysecond=y(77126:154350);
save darren ysecond yfirst -ascii
load darren -ascii
subplot(212)
plot(t,darren,'r')
xlabel('time in seconds');
ylabel('amplitude');
pause(2)
sound(darren,fs);
APPENDIX B: SPEECH DEGRADATION
clc;
clear all;
close all;
[y, fs, nbits] = wavread('prudhvi');
t = 0:1/fs:length(y)/fs-1/fs;
subplot(311)
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 62
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
plot(t,y)
Xlabel('time in seconds');
ylabel('amplitude');
sigma = 0.02;
mu = 0;
n = randn(size(y))*sigma + mu*ones(size(y));
signal=n+y;
yfft=fft(y);
xfft=fft(signal);
f = -length(y)/2:length(y)/2-1;
ysfft=fftshift(yfft);
xsfft=fftshift(xfft);
subplot(312)
plot(f,abs(ysfft),'r');
xlabel('frequency in HZ’);
ylabel('amplitude');
subplot(313)
plot(f,abs(xsfft),'g');
xlabel('frequency in HZ');
ylabel('amplitude');
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 63
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
APPENDIX C: SPEECH ENHANCEMENT
clc;
clear all;
close all;
[y, fs, nbits] = wavread('surya');
t = 0:1/fs:length(y)/fs-1/fs;
subplot(311)
plot(t,y)
xlabel('time in seconds');
ylabel('amplitude');
sound(y,fs)
yfft=fft(y);
f = -length(y)/2:length(y)/2-1;
ysfft=fftshift(yfft);
subplot(312)
plot(f,abs(ysfft),'r');
xlabel('time in seconds');
ylabel('amplitude');
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 64
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
order = 3;
cut = 0.05;
[B, A] = butter(order, cut);
filtersignal = filter(B, A, ysfft);
subplot(313)
plot(f,21*abs(filtersignal));
xlabel('time in seconds');
ylabel('amplitude');
APEENDIX D: PITCH ANALYSIS
clc;
clear all;
close all;
[y, fs, nbits] = wavread('bhanu');
[t, f0, avgF0] = pitch(y,fs);
plot(t,f0)
xlabel('time frame');
ylabel('pitch(HZ)');
avgF0
sound(y) ;
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 65
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
PITCH AUTOCORRELATION
function [f0] = pitchacorr(len, fs, xseg)
[bf0, af0] = butter(4, 900/(fs/2));
xseg = filter(bf0, af0, xseg);
i13 = len/3;
maxi1 = max(abs(xseg(1:i13)));
i23 = 2 * len/3;
maxi2 = max(abs(xseg(i23:len)));
if maxi1>maxi2
CL=0.68*maxi2;
else
CL= 0.68*maxi1;
end
clip = zeros(len,1);
ind1 = find(xseg>=CL);
clip(ind1) = xseg(ind1) - CL;
ind2 = find(xseg <= -CL);
clip(ind2) = xseg(ind2)+CL;
engy = norm(clip,2)^2;
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 66
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
RR = xcorr(clip);
m = len;
LF = floor(fs/320);
HF = floor(fs/60);
Rxx = abs(RR(m+LF:m+HF));
[rmax, imax] = max(Rxx);
imax = imax + LF;
f0 = fs/imax;
silence = 0.4*engy;
if (rmax > silence) & (f0 > 60) & (f0 <= 320)
f0 = fs/imax;
else f0 = 0;
end
APPENDIX E: FORMANT ANALYSIS
clc;
clear all;
close all;
[y, fs, nbits] = wavread('surya');
[P,F,I] = formant(y);
sound(y)
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 67
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
plot(F,P,'r')
ylabel('Amplitude(dB)');
xlabel('arbitary frequency scale');
PICKMAX
function [Y, I] = pickmax(y)
Y = zeros(5,1);
I = zeros(5,1);
xd = diff(y);
index = 1;
pos = 0;
for i=1:length(xd)
if xd(i)>0
pos = 1;
else
if pos==1
pos = 0;
Y(index) = xd(i);
I(index) = i-1;
index = index + 1;
if index>5
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 68
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
return
end
end
end
end
PITCH
function [t, f0, avgF0] = pitch(y, fs)
ns = length(y);
mu = mean(y);
y = y - mu;
fRate = floor(120*fs/1000);
updRate = floor(110*fs/1000);
nFrames = floor(ns/updRate)-1;
f0 = zeros(1, nFrames);
f01 = zeros(1, nFrames);
k = 1;
avgF0 = 0;
m = 1;
for i=1:nFrames
xseg = y(k:k+fRate-1);
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 69
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
f01(i) = pitchacorr(fRate, fs, xseg);
if i>2 & nFrames>3
z = f01(i-2:i);
md = median(z);
f0(i-2) = md;
if md > 0
avgF0 = avgF0 + md;
m = m + 1;
end
elseif nFrames<=3
f0(i) = a;
avgF0 = avgF0 + a;
m = m + 1;
end
k = k + updRate;
end
t = 1:nFrames;
t = 20 * t;
if m==1
avgF0 = 0;
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 70
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
else
avgF0 = avgF0/(m-1);
end
APPENDIX F: WAVEFORM COMPARISON
results=zeros(12,1);
diff=zeros(82,1);
formantdiff=zeros(12,1);
[y17, fs17, nbits17] = wavread('bhanu');
[t17, f017, avgF017] = pitch(y17,fs17);
[P17,F17,I17] = formant(y17);
plot(t17,f017)
avgF17 = avgF017
sound(y17)
pause(3)
for i=1:83
if i<10
filename = sprintf('prudhvi', i);
else
filename = sprintf('surya', i);
end
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 71
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
[y, fs, nbits] = wavread(filename);
[t, f0, avgF0] = pitch(y,fs);
plot(t,f0)
avgF0(i) = avgF0;
diff(i,1)=norm(avgF0(i)-avgF17);
i
[Y,H]=sort(diff)
for j=1:12
p=H(j) if p<10
filename = sprintf('prudhvi', p);
else
filename = sprintf('surya', p);
end
filename
[y, fs, nbits] = wavread(filename);
[P,F,I] = formant(y);
sound(y)
plot(F,P)
pause(3).
formantdiff(j,1)=norm(I17-I);
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 72
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
end
[Y1,H1]=sort(formantdiff)
for k=1:12
results(k,1)=H(H1(k));
end
H
H1
results
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 73
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
RESULTS
&
CONCLUSION
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 74
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
RESULTS
A. SPEECH EDITING
0 0.5 1 1.5 2 2.5 3 3.5-1
-0.5
0
0.5
time in seconds
am
plitu
de
original speech signal
0 0.5 1 1.5 2 2.5 3 3.5-1
-0.5
0
0.5
time in seconds
am
plitu
de
edited speech signal
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 75
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
B.SPEECH DEGRADATION
0 0.5 1 1.5 2 2.5 3-1
0
1
time in seconds
am
plit
ude
time domain plot
-5 -4 -3 -2 -1 0 1 2 3 4 5
x 104
0
500
1000
frequency in Hz
am
plit
ude
frequency domain plot
-5 -4 -3 -2 -1 0 1 2 3 4 5
x 104
0
500
1000
frequency in Hz
am
plit
ude
frequency domain plot with noise added
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 76
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
C.SPEECH ENHANCEMENT
0 2 4 6 8 10 12-1
0
1
time in seconds
am
plitu
de time domain plot
-6 -4 -2 0 2 4 6
x 104
0
200
400
frequency in Hz
am
plitu
de frequency domain plot
-6 -4 -2 0 2 4 6
x 104
0
200
400
frequency in Hz
am
plitu
de filtered frequency domain plot
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 77
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
D.PITCH ANALYSIS
0 100 200 300 400 500 6000
50
100
150
200
250
300
350
time frame
pitch(H
Z)
Pitch contour plot
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 78
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
E. FORMANT ANALYSIS
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-90
-80
-70
-60
-50
-40
-30
-20
-10
Am
plit
ude(d
B)
arbitary frequency scale
formant plot
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 79
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
F. WAVEFORM COMPARISION
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-110
-100
-90
-80
-70
-60
-50
-40
-30
-20
-10
Am
plitu
de(d
B)
arbitary frequency scale
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 80
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-90
-80
-70
-60
-50
-40
-30
-20
-10
Am
plit
ude(d
B)
arbitary frequency scale
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 81
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
CONCLUSION
A crude speaker recognition code has been written using the MATLAB
programming language. This code uses comparisons between the average pitch of are
recorded wav file as well as the vector differences between formant peaks in the PSD of
each file. It was found that comparison based on pitch produced the most accuracy ,while
comparison based on formant peak location did produce results, but could likely be
improved. Experience was also gained in speech editing as well as basic filtering
techniques. While the methods utilized in the design of the code for this project are a
good foundation for a speaker recognition system, more advanced techniques would have
to be used to produce a successful speaker recognition system.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 82
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
Proposed system successfully defines about various characteristics and behaviour
of speech signals and also entails upon the setting up of communication between human
speech signals with the machines. In proposed system we have generated codes with the
help of MATLAB Programming which requires .wav format speech signals. Thus, in
order to remove this limitation there is a requirement for the study of various formats of
speech signals which can be used for the communication with the machines.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 83
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
BIBILOGRAPHY
REFERENCES
[1] Speaker Recognition Using MFCC and Vector Quantization Model.
[2] E. Darren Ellis Department of Computer and Electrical Engineering – University of
Tennessee, Knoxville Tennessee 37996 topic on “Design of a Speaker Recognition Code
using MATLAB”
[3] Topic on “Extraction of Pitch and Formants and its Analysis to identify 3 different
emotional states of a person”ijcsi.org/papers/IJCSI-9-4-1-296-299.pdf
[4] Topic on “Speech Recognition using Digital Signal Processing” ISSN:2277-9477,
Volume2,Issue 6.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 84
TO RECOGNIZE VOICE FREQUENCY OF THE SPEAKER
[5] D.A. Reynolds, L.P. Heck, “Automatic Speaker Recognition”, AAAS 2000 Meeting,
Humans, Computers and Speech Symposium, 19 Feb 2000.
[6] J. Rosca, A. Kofmehl, “Cepstrum-like ICA Representations for Text Independent
Speaker Recognition”, ICA2003, pp. 999-1004, 2003.
SRINIVASA INSTITUTE OF ENGINEERING & TECHNOLOGY Page 85
top related