speech recognition using neural network

37
SPEECH RECOGNITION USING NEURAL NETWORK K.M. Peshan Sampath, P.W.D.C Jayathilake, R. Ramanan, S. Fernando, Suthrjan Dr. Chatura De Silva Department of Computer Science and Engineering, University of Moratuwa, Sri Lanka Abstract All the speech recognition systems now in the market are based on the statistical techniques. The work presented in this paper is an alternative approach to do it in the way as human sensors recognize, using Neural Networks. Since the recognizer Neural Network must have fixed number of input, here it addresses the problem of solving the variable size of the feature vector of an isolated word into a constant size. It consists of three distinct blocks feature extractor, Constant Trajectory map part, and Recognizer. The feature extractor uses a standard LPC Cepstrum coder, which converts the incoming speech signal captured by DirectSoundCapture com interface into LPC Cepstrum feature space. The SOM Neural Network makes each variable length LPC trajectory of an isolated word into a fixed length LPC trajectory and thereby making the fixed length feature vector, to be fed into to the recognizer. The design of recognizer uses two types of Neural Networks. Different structures of Multi Layer Perceptron approach tested with 3, 4, 5 hidden layers with Transfer functions of Tanh Sigmoid, Sigmoid of multiple outputs, signal output in the case of recognition of feature vector of isolated words. The performance of the Radial Basis Functions Neural Net tested for isolated word recognition. Comparison among different structures of Neural Networks conducted here gives a better understanding of the problem and its possible solutions. The feature vector was normalized and decorrelated by pruning techniques. Fast training method of Neural Network has been implemented using pruning techniques to Neural Network. Training process uses momentum to find the global minima of error surface avoiding the oscillations in local minima. The main contribution of this paper is use of Multi Layer Perceptron for isolated word recognition is a completely new idea being implemented.

Upload: ptruong87

Post on 27-Apr-2015

410 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: SPEECH RECOGNITION USING NEURAL NETWORK

SPEECH RECOGNITION USING NEURAL NETWORK

K.M. Peshan Sampath, P.W.D.C Jayathilake, R. Ramanan, S. Fernando, Suthrjan Dr. Chatura De Silva

Department of Computer Science and Engineering,University of Moratuwa, Sri Lanka

AbstractAll the speech recognition systems now in the market are based on the statistical

techniques. The work presented in this paper is an alternative approach to do it in the way as human sensors recognize, using Neural Networks. Since the recognizer Neural Network must have fixed number of input, here it addresses the problem of solving the variable size of the feature vector of an isolated word into a constant size. It consists of three distinct blocks feature extractor, Constant Trajectory map part, and Recognizer. The feature extractor uses a standard LPC Cepstrum coder, which converts the incoming speech signal captured by DirectSoundCapture com interface into LPC Cepstrum feature space. The SOM Neural Network makes each variable length LPC trajectory of an isolated word into a fixed length LPC trajectory and thereby making the fixed length feature vector, to be fed into to the recognizer. The design of recognizer uses two types of Neural Networks. Different structures of Multi Layer Perceptron approach tested with 3, 4, 5 hidden layers with Transfer functions of Tanh Sigmoid, Sigmoid of multiple outputs, signal output in the case of recognition of feature vector of isolated words. The performance of the Radial Basis Functions Neural Net tested for isolated word recognition. Comparison among different structures of Neural Networks conducted here gives a better understanding of the problem and its possible solutions. The feature vector was normalized and decorrelated by pruning techniques. Fast training method of Neural Network has been implemented using pruning techniques to Neural Network. Training process uses momentum to find the global minima of error surface avoiding the oscillations in local minima. The main contribution of this paper is use of Multi Layer Perceptron for isolated word recognition is a completely new idea being implemented.

1 Introduction

Speech is produced when air is forced from the lungs through the vocal cords (glottis) and along the vocal tract. Speech is split into a rapidly varying excitation signal and a slowly varying filter. The envelope of the power spectra contains the vocal tract information

Speech produced analytically consists of convolution in the time domain of two waves generated by formant structure of the vocal tract and excitation of vocal tract called the pitch of the sound. In the case of recognizing the word uttered we need to focus only on the shape of Vocal Tract which is unique for the each and every word uttered. On the other hand recognizing the speaker focus has to be in excitation of the vocal tract.

Page 2: SPEECH RECOGNITION USING NEURAL NETWORK

Figure 1: The source-filter model of speech production

2 Speech Analysis

To get better understanding of speech production is to analyze different algorithms of determining formant distribution and pitch contour and how Vocal Tract excitation and wave due to the shape of Vocal Tract are combined in different domains like time, frequency, qferency. It also need get to know of different coding methods of speech representation.

2.1 Formant analysis

Formant analysis help to identify the word being uttered since it is heavily based on the resonances of the vocal tract and shape of the vocal tract creation. Peak picking on the LP spectrum algorithm was implanted in Matlab.

2.2 Pitch analysis

Pitch analysis makes it possible to recognize the speaker and to recognize the expressive way of speaker speaking. SIFT, AMDF algorithms ware implemented in Matlab.

Formant 1 Pitch

Formant 2

White noise

Impulse @ F0

VocalTractFilter

Page 3: SPEECH RECOGNITION USING NEURAL NETWORK

Figure 2: Pitch and Formant distribution 2.3 Frequency domain analysis

Frequency domain analysis is done in order to extract the information in frequency domain. And well known FFT and IFFT algorithms have been implanted in C++.

(a) (b)Figure 3: Plots in frequency domain (a) Spectrum (b) Spectrogram

2.4 Cepstrum analysis

Cepstrum analysis on qferency domain and in this domain it is shown that convolved above two waves in time domain now has been combined in linearly separable manner and liftering has been done in order to separate the two waves. All analysis has been done in Matlab.

2.5 Linear Predictive coding analysis

Linear predictive code analysis is another representation of the vocal tract coefficients in order to represent the uttered word. Set of routines auto correlation, Durbin’s recursion was implemented in C++ in order to calculate the LPC coefficients.

2.6 Different coding schemes

Coding methods are necessary to represent the speech waves in steady state manner. Here listed some coding methods widely used for speech representation.

LPC AnalysisLPCC coefficients

Cepstrum analysisMel Cepstral Coefficients

Rasta ProcessingRasta PLP Coefficients

Page 4: SPEECH RECOGNITION USING NEURAL NETWORK

2.7 Framing and windowing

Since speech is dynamic, the coding of the speech cannot be done for entire time. The tactic of tackling this problem is to calculate the coefficients of speech done on small size frames with overlapping with 2/3 of frame size. Windowing is done in order to smooth edge effect that arises due to framing

.

Figure 4: Overlapping of frames

3 Speech Recognition and Neural Networks

The use of Neural Network for speech recognition is pre-mature than the existing template methods like Dynamic Time Warping and statistical modeling methods like Hidden Markov model. When considering history of speech recognition attempts Spenix-3.2 system (1999) uses HMM for continuous speech recognition.

3.1 Biological Neural Networks and speech recognition

In the case of speech recognition process human read from ear and then converts that signal into an electrical representation. Then that signal propagates to the brain via spiral. Then that signal propagates through the billions of biological neurons by stimulating each other of previously created path so that if there is a previous path (early learned) then it recognizes the word.

Zoom in

Frame

Overlap

Page 5: SPEECH RECOGNITION USING NEURAL NETWORK

3.2 Artificial Neural Networks and speech recognition

In the case of speech recognition process computer reads the signal from the sound card and converts it into a discrete representation limited by the sampling rate and the number of the bits per sample. In the same fashion as biological neuron system work here also it has different paths created on previously trained words. So that it recognizes the word. If new word being added it presents with word and the path on which it propagates via giving probability.

Figure 5: Artificial Neuron

In an artificial neuron, numerical values are used as inputs to the “dendrites.” Each input is multiplied by a value called weight, which simulates the response of a real dendrite. All the results from the “dendrites” are added and threshold in the “soma.” Finally, the threshold result is sent to the “dendrites” of other neurons through an “axon.” This sequence of events can be expressed in mathematical terms as

(1)

3.3 Multi Layer Perceptron (MLP)

This is how the representation of the biological neuron system by the artificial neuron system. Here it has several layers named input layer, hidden layers, and output layer. Each layer consists of several neurons which has individually processing unit. Two layers and combined together by weights same as in biological system there is a loss of signal while propagating between one neuron dendrite to the other. Here learning algorithm is back propagation. The learning strategy of this type of neural network occurred is called supervised learning since it tells what to learn. It is up to the network to carry out how to learn process.

Page 6: SPEECH RECOGNITION USING NEURAL NETWORK

Figure 6: Multi Layer Perceptron

3.4 Self Organizing Map (SOM)

This neural network mainly is transforming an n-dimensional input vector space into a discretized m-dimensional space while preserving the topology of the input data. The structure of this neural network is two layered i.e. input space and output space. The training procedure is unsupervised and it is called the competitive learning and expressed as winner takes all. Compared to biological neural network here it is totally a statistical approach.

Figure 7: Self Organizing Map

3.5 Radial Basis Function (RBF)

This neural network is the most powerful pattern classifying one which considered to be separated any pattern by constructing any hyper planes among the different classes of patterns. In RBF initialization of the centers taken place in unsupervised manner looking at the data pattern which I have used is called modified k-means algorithm. Upon spreading the centers with relevant to the data set then it trained with the supervised manner to mimic the human brain as same as back propagation which is know as extended back propagation variation of the LMS algorithm. Since it uses both the combination of while try to mimic the human brain it also uses statistical approach in the initialization process.

Input layer

Hidden layer

Output layer

Page 7: SPEECH RECOGNITION USING NEURAL NETWORK

4 Algorithms Enhanced

In here it describes the various algorithms found in earlier attempts via research papers and the modification done to these algorithms in order to come up with a new algorithm that suit to the problem at hand. All the algorithms described here was implemented in C++ and heavily tested which leads to expected results.

4.1 Enhanced Back Propagation algorithm

In the process of training the Multi Layer Perceptron well known the back propagation algorithm was used earlier. To avoid the oscillation at the local minima I applied the momentum constant so that it goes to the global minima of the error surface leading to the best solution. I have extended the back propagation algorithm to be used for multiple outputs.1) InitializationThe weights of each layer have been initialized to random number.2) Forward computationThe induced local filed for neuron j in layer l is

(2)

The output signal of neuron j in layer l is (3)

If the neuron j is in the first hidden layer

(4)If the neuron j is in the output layer and L is the depth of network

(5)The error signal is

(6) where is the desired output for

3) Backward computationCompute the local gradients ( s) of the network as follows:

for neuron j in output layer L. (7)

for neuron j in hidden layer l.

The weights update is taken place in accordance to following formulae:

(8)where learning is rate and is momentum constant.

Page 8: SPEECH RECOGNITION USING NEURAL NETWORK

4.2 Enhanced LMS algorithm

In the process of training the Radial Basis Network it was used the LMS algorithm. For suiting to our problem we have to deal with multiple outputs so I have enhanced and extended to use with multiple outputs. The algorithm used earlier applies for only with no hidden layers so I have made the learning procedure the combination of LMS and back propagation. Further while locating clusters with the LMS and the updating weights using back propagation algorithms were done.

Adaptation algorithm for Linear Weights and the Positions and Spread of Centers for RBF Network1) Linear Weights (output layer)

(9)

(10)

2) Positions of centers

(11)

(12)

3) Spread of centers

(13)

(14)

(15)

4.3 Clustering algorithms

The clustering of data leads to how related data are categorizes into different classes. The Code Book is the term used to define whole data set in case of sets universal set. The code book is consisting of set of Code Words. The Code Word is the representation of different categories. The clustering has most done in order to reduce the data set by removing the repeating data.

4.3.1 K-means clustering algorithm

Page 9: SPEECH RECOGNITION USING NEURAL NETWORK

This is a time independent data clustering method. K-means clustering algorithm (Duda and Hart, 1973) was implemented in RBF neural network to pre-initialize the data sets into code words in the case of code book of RBF. 1) InitializationThe random values were chose for the centers of different values.2) SamplingA sample vector from the input space was drawn and it was input to the algorithm at iteration n.3) Similarity matchingLet denote the index of best matching (winning) center for input vector x.

(16)where is the center of radial-basis function at iteration n.4) Updating Adjust the centers of the radial basis function, using the following rule:

(17)

where is the learning rate and

4.3.2 SOM neural network as clustering algorithm

Modified version of K-means clustering algorithm was implemented in order to cluster data without changing the time trajectories. The algorithm is as follows:1) Initialization The random values were chosen for initial weight vector .2) SamplingThe vector x was chosen from the input space with certain probability and it represents the activation pattern applied to the lattice.3) Similarity matching The best similarity matching neuron i(x), winner was found at the step n as follows:

(18)

4) UpdatingThe updating the weights have been taken with the following formulae:

(19)

where is the learning rate and is the neighborhood function around the winning neuron at the iteration.

Page 10: SPEECH RECOGNITION USING NEURAL NETWORK

5 Implementation of the Neural Speech Recognizer

Whole system has been implemented in C++. The following diagram shows the abstract view of the system modules and later it describes the algorithms implemented in each module.

Figure 8: Speech Recognition System

5.1 Design of Feature Extractor

In speech recognition problem FE block has to process the incoming signal i.e. speech signal, such that its output eases the work of classification stage. It consists of following modules

5.1.1 Sound Recording

For capturing the sound to be recognized was implemented via DirectSoundCapture because of DirectSound has major features like low latency, its capability of hardware acceleration compared to other mechanisms like OCX ActiveX. Basically DirectSound accesses the hardware through the DirectSound Hardware Abstraction Layer which is an interface to the sound hardware. This recorded the sound with the sampling rate of 16000 Hz and 16 bits per sample.

But here it had to be carefully designed because it some times has to be dealt with two pointers for the buffer.

SpeechSignal

SoundRecording

FIRFiltering

Framing

WindowingLPC Analysis

CepstrumAnalysis

SOM Recognizer Text

Page 11: SPEECH RECOGNITION USING NEURAL NETWORK

(a) (b)

Figure 9: DirectSoundCapture Buffer (a) Deal with single pointer(b) Deal with two pointers

5.1.2 Pre-emphasis Filter

A pre-emphasis filter was applied to the digitized speech to spectrally flatten the signal and diminish the effects of finite numerical precision in further calculations.

The transfer function of pre-emphasis filter corresponds to the first order FIR filter is defined by as follows:

(20)The value of was obtained by experimentally and it is equal to 0.85.

The time domain coefficients of filter has been calculated and the multiplied with time domain wave form.

5.1.3 Speech coding

After the captured speech signal is sampled, the utterance were isolated, and the spectrum was flattened, each signal was divided into a sequence of frames, each frame 21ms length and 7ms apart. Then each frame is multiplied by a Hamming window, in order to remove the leakage effects and to smooth the edges.

(21)

A vector of 12 Linear Predicting coding Cepstrum coefficients was calculated from each data block using Durbin’s method and the recursive expressions developed by Furi. The procedure is as follows:

Pointer 1

Locked Data

UnlockedData

Pointer 1 Pointer 2

Locked Data

Locked Data

UnlockedData

Page 12: SPEECH RECOGNITION USING NEURAL NETWORK

1. Auto correlated coefficients were obtained as follows:

(22)

where corresponds to the speech sample located at the position of the frame.2. Then LPC values were obtained using following recursion:

(23)

3. Then the LPC Cepstrum was obtained by following formula.

(24)

where was used in order to achieve 12 LPC Cepstrum coefficients per frame.

Since sample rate is 16000 Hz, the frame size is 21ms and the each frame is 7ms apart. For each frame LPC Cepstrum features of dimension 12 was calculated.

5.2 Design of constant trajectory mapping module

Using the Self Organizing Map the variable length each and every LPC trajectory is mapped to a constant trajectory of 6 clusters while preserving the input space. The implemented algorithm is consisting of three parts.

5.2.1 Competitive process

Let x be the m-dimensional input vector then,

Page 13: SPEECH RECOGNITION USING NEURAL NETWORK

Let be the synaptic weight vector of neuron j then,

j = 1, 2 …, l

The index of the best match neuron is

, j = 1, 2…, l (25)

where l is the total number of neurons in the network.

5.2.2 Cooperative process

The lateral distance excited neuron j and winning neuron i is,

(26)

where is the position of neuron j and is the position of the neuron i.The width of the topological neighborhood shrinks with the time as follows:

, n = 0, 1, 2… (27)

The variation of the topological neighborhood is,

, n = 0, 1, 1… (28)

5.2.3 Adaptation process

The changing of the learning rate is as follows:

, n = 0, 1, 2… (29)

The adaptation of weights is as follows: (30)

where , , and

The size of the center is changed dynamically with number of frames. Since it has 6 clustered centered all are initialized to random weight initially and allow the variable length trajectory of each LPC coefficient to arrange to the size of six unique shape preserving time domain feature sequence.

5.3 Design of Recognizer

The recognizer was designed to recognize the 10 digits and each digit input to the recognizer of size of 72 features, feature vector.

5.3.1 Multi Layer Perceptron Approach

Page 14: SPEECH RECOGNITION USING NEURAL NETWORK

A new layered approach was used in the process of designing the Neural Network and the structure of the Neural Node made independent of both the training process and recognition process. It consists of 72 input nodes, variable number of hidden nodes and 10 output nodes. And it has the adaptability of changing its transfer function of Sigmoid

and Tanh Sigmoid . Values for A=1, a=1.1759, b=0.6667.It was designed with earlier enhanced back propagation algorithm in section 4.1 extended to deal with the problem. It has been used the sequence training method and its state is stored internally every time its weights adjusted in order to avoid the inconsistency of weights lead to infinity. Then training process was made automated so that both test set as well as training set present to the layered network so that it stops the training process when the test set satisfy the condition checked at the end of presence of each epoch of training set. At the end of training the whole state of the layered network is stored so that it can retrieve the state in the process of recognition.

Instead of having multiple outputs it was designed another different one of similar to all above except that it has only one output and the output for different digits were set to sub ranges of the whole range that can be taken among the transfer function used.

Then above two neural networks are extended to have two hidden and three hidden layers by extending the layered structure designed as well as extending the enhanced back propagation algorithm. So finally it was designed MLP one hidden layer, two hidden layer and three hidden layer networks.

For each digit there are 60 training examples and all these have been presented to the neural network in mixed of ten digits samples i.e. 0 1 2 … etc sequence. Testing sample consists of 10 samples for each digit.

Heuristic techniques applied for the network is as follows:1) Learning rate has been reduced with the each epoch number

(31)2) Use of Momentum for avoiding the oscillation at the local minima.3) Normalization of the input vector

Page 15: SPEECH RECOGNITION USING NEURAL NETWORK

Figure 10: Normalization of Input data

5.3.2 Radial Basis Function Approach

RBF network was implemented which consisting of 72 input nodes and 20 hidden nodes (20 clusters) and 10 output nodes. The preparation of input vector was taken by the SOM trajectory. The training examples are presented here as same manner as the MLP design. The RBF network has two processes while training.

1) Initialization on centers has been implemented as the same way as explained in the K-means clustering algorithm in 4.3.1 for whole training examples.

2) The learning and adaptation of weights, centers and covariance matrices has been implemented with enhance LMS algorithm in section 4.2 by extending that idea to multiple outputs.Heuristic techniques like normalization of input vectors and adjustment of learning rate as above were applied.

MeanRemoval

Decorrelation

CovarianceEqualization

Page 16: SPEECH RECOGNITION USING NEURAL NETWORK

6 Results of Research

The experiment was done using digital computer with 2.5GHz speed and 250MB memory IBM machine.

6.1 Success

Capturing the speech via sound card via DirectX was succeeded. Here it shows the graph of recorded sound for one pronounced in Sinahala “eka”.

Figure 11: Time domain representation of captured “eka” sound

And feature extractor module is also gives result 100% as it was implemented well known the Durbin’s recursion. The LPC coefficient variation with time is as follows:

LPC Trajectory

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Frame Number

Am

plitu

de

LPC 1

LPC 2

LPC 3

LPC 4

LPC 5

LPC 6

LPC 7

LPC 8

LPC 9

LPC 10

LPC 11

LPC 12

Figure 12: Variation of LPC coefficient trajectories with frames for sound “eka”

Page 17: SPEECH RECOGNITION USING NEURAL NETWORK

The attempt to make size of variable length feature vector constant while preserving the input feature space has been succeeded. The reducing the length of each LPC trajectory has been made constant while preserving the input features with the time.

LPC 1 Reduced Trajectory

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 3 4 5 6

LPC 1 Trajectory

(a1)

SOM Trajectory

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Clustered Center

(a2)

LPC 2 Reduced Trajectory

-0.146

-0.144

-0.142

-0.14

-0.138

-0.136

-0.134

-0.132

1 2 3 4 5 6

LPC 2 Trajectory

(b1)

SOM Trajectory

-0.146

-0.144

-0.142

-0.14

-0.138

-0.136

-0.134

-0.132

-0.13

Clustered Center

(b2)

SOM Trajectory

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

Clustered Center

(c1) (c2)

LPC 3 Reduced Trajectory

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

1 2 3 4 5 6

LPC 3 Trajectory

Page 18: SPEECH RECOGNITION USING NEURAL NETWORK

SOM Trajectory

-0.2

-0.18

-0.16

-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

Clustered Center

(d1) (d2)

LPC 5 Reduced Trajectory

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

1 2 3 4 5 6

LPC 5 Trajectory

(e1)

SOM Trajectory

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Clustered Center

(e2)

LPC 6 Reduced Trajectory

0.21

0.2105

0.211

0.2115

0.212

0.2125

0.213

0.2135

0.214

1 2 3 4 5 6

LPC 6 Trajectory

(f1)

SOM Trajectory

0.205

0.206

0.207

0.208

0.209

0.21

0.211

0.212

0.213

0.214

0.215

Clustered Center

(f2)

LPC 4 Reduced Trajectory

-0.18

-0.16

-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

1 2 3 4 5 6

LPC 4 Trajectory

Page 19: SPEECH RECOGNITION USING NEURAL NETWORK

LPC 7 Reduced Trajectory

0

0.05

0.1

0.15

0.2

0.25

0.3

1 2 3 4 5 6

LPC 7 Trajectory

(g1)

SOM Trajectory

0

0.05

0.1

0.15

0.2

0.25

0.3

Clustered Center

(g2)

LPC 8 Reduced Trajectory

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6

LPC 8 Trajectory

(h1)

SOM Trajectory

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Clustered Center

(h2)

LPC 9 Reduced Trajectory

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

1 2 3 4 5 6

LPC 9 Trajectory

(i1)

SOM Trajectory

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Clustered Center

(i2)

Page 20: SPEECH RECOGNITION USING NEURAL NETWORK

LPC 10 Reduced Trajectory

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

1 2 3 4 5 6

LPC 10 Trajectory

SOM Trajectory

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

Clustered Center

(j1) (j2)

LPC 11 Reduced Trajectory

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6

LPC 11 Trajectory

(k1)

SOM Trajectory

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Clustered Center

(k2)

LPC 12 Reduced Trajectory

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6

LPC 12 Trajectory

SOM Trajectory

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

Clustered Center

(l1) (l2)Figure 13: It shows the different LPC coefficients’ trajectory after doing SOM algorithm for making constant length trajectory. Left figure shows the trajectory of length 6 and the right figure shows the spread of centers to convergence for the sound “eka”.

Page 21: SPEECH RECOGNITION USING NEURAL NETWORK

One tool we have used earlier for pattern recognition is Matlab Neural Network tool box. Since it has lots of internal limitations like speed wise performance and lack of control for going into more powerful classification of patterns, I have decided to use our own implemented Neural Network in C++ which has more speech wise achievement.

So I have come up with my own design for implementing neural network in which separation of structure, training procedure, and transfer function are very easy. My first design was the MLP with one hidden layer. Then I have extended it to consist of two hidden layers and three hidden layers by extending and came up with my own back propagation algorithm. These all three are classifying separable patterns with 100% accuracy. So here I present some parameters which gave optimum accuracy in the case of recognizing the digit being uttered.Here it presents the results of various neural network configurations gives the result with different structures. Experimented with different internal parameters of Multi Layer Perceptron Neural Network, learning rate and momentum constant it is found that

The optimum value of the sigmoid transfer function of is

It is found that optimum values for transfer function of is

Here it presents the recognizing accuracy of types of Multi Layer Perceptron Neural Networks. All the examples it uses 60 training examples for each digit with altogether 600 training examples. Each Neural Network was trained with 10000 epochs of the training example with the sequence training.

MLP with one hidden layer (config1) MLP with one hidden layer (config2)Transfer function Transfer function Input node number = 72 Input node number = 72Hidden node number = 15 Hidden node number =120Output node number =10 Output node number = 10

MLP with two hidden layer (config3) MLP with two hidden layer (config4)Transfer function Transfer function Input node number = 72 Input node number = 72Hidden1 node number=60 Hidden1 node number= 100Hidden2 node number=25 Hidden2 node number= 50

Page 22: SPEECH RECOGNITION USING NEURAL NETWORK

Output node number =10 Output node number = 10

MLP with one hidden layer (config5) MLP with one hidden layer (config6)Transfer function Transfer function Input node number = 72 Input node number = 72Hidden1 node number = 200 Hidden1 node number = 200Hidden2 node number = 100 Hidden2 node number = 40Hidden3 node number = 50 Hidden3 node number = 20Output node number = 10 Output node number = 10

Digit Config1RecognitionAccuracy

Config2RecognitionAccuracy

Config3RecognitionAccuracy

Config4RecognitionAccuracy

Config5RecognitionAccuracy

Config6RecognitionAccuracy

0 71 98 71 98 90 891 99 8 15 73 13 962 96 8 31 6 20 903 84 32 92 82 74 754 86 17 3 7 28 895 92 2 5 6 95 906 95 14 4 23 10 917 0 3 1 16 78 228 1 32 98 85 52 909 87 35 29 84 30 90Table 1: Digit Recognition of 100 of each for each configuration

Learning curve

0

100

200

300

400

500

600

700

800

900

Number of Epoch

Ag

gre

gat

e E

rro

r

Error at epoch

Figure 14: Learning curve for MLP layer 3 (config1)

Page 23: SPEECH RECOGNITION USING NEURAL NETWORK

Learning Curve

0

100

200

300

400

500

600

700

800

900

1000

Epoch Number

Ag

gre

gat

e E

rro

r

Error at end of epch

Figure 15: Learning curve for MLP layer 5 (config6)

Here it presents the results of identifying the zero (“binduwa”) with different configurations. Here it classifies whether uttered sound to be zero or not zero.

MLP with one hidden layer (config7) MLP with one hidden layer (config8)Transfer function Transfer function Input node number = 72 Input node number = 72Hidden node number = 5 Hidden node number = 32Output node number =2 Output node number = 2

MLP with two hidden layer (config9) MLP with two hidden layer (config10)Transfer function Transfer function Input node number = 72 Input node number = 72Hidden1 node number= 30 Hidden1 node number= 100Hidden2 node number= 5 Hidden2 node number= 30Output node number =2 Output node number = 2

MLP with one hidden layer (config11) MLP with one hidden layer (config12)Transfer function Transfer function Input node number = 72 Input node number = 72Hidden1 node number = 35 Hidden1 node number = 100Hidden2 node number = 20 Hidden2 node number = 50Hidden3 node number = 5 Hidden3 node number = 20Output node number = 2 Output node number = 2

Page 24: SPEECH RECOGNITION USING NEURAL NETWORK

Digit Config7 Accuracy

Config8 Accuracy

Config9Accuracy

Config10 Accuracy

Config11 Accuracy

Config12 Accuracy

0 (being 0) 93 99 89 98 97 651 (not being 0) 21 92 78 83 83 912 (not being 0) 16 61 22 69 80 903 (not being 0) 12 55 77 42 79 944 (not being 0) 18 21 61 32 96 835 (not being 0) 93 71 83 76 95 746 (not being 0) 94 68 11 65 10 837 (not being 0) 81 65 89 40 94 828 (not being 0) 96 87 89 96 96 859 (not being 0) 75 65 57 37 12 88Table 2: Classifying being 0 or not being 0 for 100 of each for each configuration.

Bellow it shows the learning curves of the Network while training with wave data. And it shows the convergence that is better separation of hyper planes with relevant data sets. It presents the aggregate error at each end of whole epoch of training set.

Figure 16: Learning curve for MLP 5 layer (config11)

It is not always possible to get a learning curve which minimizes the error. There is a possibility of getting learning curve which maximizes the error. Here it shows the Multi Layer Perceptron which

Page 25: SPEECH RECOGNITION USING NEURAL NETWORK

Learning Curve

0

500

1000

1500

2000

Epock Number

Err

or

Error Change with Epock Number

Figure 17: Learning Curve for MLP 3 layer (72, 5, and 10)

6.2 Attempted Failure

I have tried with MLP using Tanh Sigmoid with single output then the output range [-1,1], and dividing the output range into differing intervals like -0.9 for 0, -0.7 for 1, -0.5 for 2, -0.3 for 3, -0.1 for 4, 0.1 for 5,

0.3 for 6, 0.5 for 7, 0.7 for 8 and 0.9 for 9.With this configuration although I tried with different configurations with changing the number of layers and changing the number of neurons within the layer I couldn’t be able to train with the neural network which every time its weights has gone to infinity.

In the case of RBF Neural network it is covariant matrix of size so in the training process it has to inverse this matrix for every training example. In our problem it is

since the feature vector is size of 72. Also I have written the module which calculates the inverse of any matrix using recursion. It takes long time as the case it involves the lot of floating point calculation. So the results for recognition couldn’t be able to acquire.

7 Conclusion & Future Work

The attempt to mimic a human being by focusing on one sensor of hearing has been succeeded with number of limitation of the today’s digital computer has as well as some undiscovered things in science. So it was able to use totally new approach for recognizing isolated word. In this paper it was presented the results for recognition accuracy. It has high accuracy of recognizing individual digits.

Page 26: SPEECH RECOGNITION USING NEURAL NETWORK

Apart from the speech technologies in this paper it presents the different neural network architecture design for the patter at hand. It is concluded that neural networks having more hidden layers are able to solve the problems very easily. By comparing error curves and recognition accuracy of digits it is concluded that Multi Layer Perceptron with 5 layers is more generic approach rather than to Multi Layer Perceptron with 3 hidden layers. The speech is an analog wave form and the processing of artificial human being must have to be done on a digital computer. In the case of converting analog to digital representation is limited by the sampling rate and the number of bits each sample carry. The limitation being seen here is quantization errors that can be happened in this process. So I think that next generation of computer, Quantum computers will address this problem so remarkably.In the capturing process in the sound card and because of the environment it introduces lots of noise to the process. The problem face here is while trying to reduce the one type of noise there introduces another type of noise. It places a greater problem in the recognition accuracy.In the case of RBF neural network the calculating the inverse matrix of matrix was limited by the floating point calculation capability of speed wise achievement. Although the use of Neural Networks for speech recognition is not a matured technique compared to Hidden Markov Model, here it discovered new approach for recognition of isolated word as HMM using combination of Multi Layer Perceptron and Self Organizing Map. The results of recognizing accuracy is good and can be accepted to go into the continuous speech recognition. The future suggestions for continuous speech recognition is again to use the pattern separating Neural Network to break into words with feeding the energy variation pattern with the number of zero crossing pattern.

8 References

[1] Christopher M. Bishop “Neural Networks for Pattern Recognition” Oxford University 1995.[2] L. Rabiner and G. Juang, “Fundamentals of Speech Recognition,” Prentice-Hall, 1993.[3] N. Negroponte, “Being Digital,” Vintage Books, 1995.[4] C. R. Jankowski Jr., H. H. Vo, and R. P. Lippmann, “A Comparison of Signal Processing Front Ends for Automatic Word Recognition,” IEEE Transactions on Speech and Audio processing, vol. 3, no. 4, July 1995.[5] J. Tebelskis, “Speech Recognition Using Neural Networks,” PhD Dissertation, Carnegie Mellon University, 1995.[6] S. Furui, “Digital Speech Processing, Synthesis and Recognition,” Marcel Dekker Inc., 1989.[7] K. Torkkola and M. Kokkonen, “Using the Topology-Preserving Properties of SOMs in Speech Recognition,” Proceedings of the IEEE ICASSP, 1991.

Page 27: SPEECH RECOGNITION USING NEURAL NETWORK

[8] K-F Lee, H-W Hon, and R. Reddy, “An Overview of the SPHINX Speech Recognition System,” IEEE Transactions on Acoustic, Speech, and Signal Processing, vol. 38, no. 1, January 1990.[9] H. Hasegawa, M. Inazumi, “Speech Recognition by Dynamic Recurrent Neural Networks,” Proceedings of 1993 International Joint Conference on NeuralNetworks.[10] M. Jamshidi, “Large-Scale Systems: Modelling and Control,” North-Holland, 1983.[11] P. Zegers, “Reconocimiento de Voz Utilizando Redes Neuronales,” Engineer Thesis, Pontificia Universidad Católica de Chile, 1992.[12] M. Woszczyna et al, “JANUS 93: Towards Spontaneous Speech Translation,” IEEE Proceedings Conference on Neural Networks, 1994.[13] T. Zeppenfeld and A. Waibel, “A Hybrid Neural Network, Dynamic Programming Word Spotter,” IEEE Proceedings ICASSP, 1992.[14] Y. Gong, “Stochastic Trajectory Modeling and Sentence Searching for Continuous Speech Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 1, January 1997.[15] D. Richard, C. Miall, and G. Mitchison, “The Computing Neuron,” Addison-Wesley, 1989.[16] S. Haykin, “Neural Networks: A Comprehensive Foundation,” Macmillan College Publishing Company, 1994.[17] T. Kohonen, “Self-Organization and Associative Memory,” Springer-Verlag, 1984.[18] T. Kohonen, “Self-Organizing Maps,” Springer-Verlag, 1995.[19] T. Kohonen et al, “Engineering Applications of the Self-Organizing Map,” Proceedings of the IEEE, 1996.[20] H. Sagan, “Space-Filling Curves,” Springer-Verlag, 1994.[21] G. Cybenko, “Approximation by Supperpositions of a Sigmoidal Function,” Mathematics of Control, Signals and Systems, vol. 2, 1989.[22] K. Funahashi, “On the Approximate Realization of Continuous Mappings by Neural Networks,” Neural Networks, vol. 2, 1989.[23] K-I Funahashi and Y. Nakamura, “Approximation of Dynamical Systems by Continuous Time Recurrent Neural Networks,” Neural Networks, vol. 6, 1993.[24] T. Kohonen, “The Neural Phonetic Typewriter,” Computer, vol. 21, no. 3, 1988.[25] K. J. Lang and A. H. Waibel, “A Time-Delay Neural network Architecture for Isolated Word Recognition,” Neural networks, vol. 3, 1990.[26] E. Singer and R. P. Lippmann, “A Speech Recognizer using Radial Basis Function Neural Networks in an HMM Framework,” IEEE Proceedings of the ICASSP, 1992.[27] H. Hild and A. Waibel, “Multi-Speaker/Speaker-Independent Architectures for the Multi-State Time Delay Neural Network,” IEEE Proceedings of the ICNN, 1993.[28] R. M. Gray, “Vector Quantization,” IEEE ASSP Magazine, April 1984.[29] G. Z. Sun et al, “Time Warping Recurrent Neural Networks,” IEEE Proceedings of the ICNN, 1992.[30] A. Papoulis, “Probability, Random Variables, and Stochastic Processes,” McGraw-Hill, 1991.[31] S. I. Sudharsanan and M. K. Sundareshan, “Supervised Training of Dynamical Neural Networks for Associative Memory Design and Identification of Nonlinear Maps,” International Journal of Neural Systems, vol. 5, no. 3, September 1994.

Page 28: SPEECH RECOGNITION USING NEURAL NETWORK

[32] B. A. Pearlmutter, “Gradient Calculations for Dynamic Recurrent Neural Networks,” IEEE Transactions on Neural Networks, vol. 6, no. 5, September 1995.[33] M. K. Sundareshan and T. A. Condarcure, “Recurrent Neural-Network Training by a learning Automaton Approach for Trajectory Learning and Control SystemDesign,” IEEE Transactions on Neural Networks, vol. 9, no. 3, May 1998.[34] D. B. Fogel, “An Introduction to Simulated Evolutionary Optimization,” IEEE Transactions on Neural Networks, vol. 5, no. 1, January 1994.[35] Simon Haykin “Neural Networks” Pearson Education inst. 1999.