speaker independent arabic speech recognition using support vector machine
DESCRIPTION
Speaker Independent Arabic Speech Recognition Using Support Vector Machine. By. Eng. Shady Yehia El-Mashad. Supervised By. Dr. Mohamed Ibrahim Sharawy. Ass. Prof. Dr. Hala Helmy Zayed. Agenda. Introduction Characteristics of Speech Signal History of Speech & Previous Research - PowerPoint PPT PresentationTRANSCRIPT
Speaker Independent Arabic Speech Recognition Using Support Vector
Machine
Ass. Prof. Dr. Hala Helmy Zayed Dr. Mohamed Ibrahim Sharawy
Supervised By
Eng. Shady Yehia El-Mashad
By
Introduction
Characteristics of Speech Signal
History of Speech & Previous Research
The Proposed System
Results and Conclusions
Agenda
Introduction Characteristics of Speech Signal
History of Speech & Previous Research
The Proposed System
Results and Conclusions
Recognition
Is one of the basic memory tasks. It involves identifying objects or events that have been encountered before. It is the easiest of the memory tasks.
It is easier to recognize something, than to come up with it on your own
Speech Recognition System Also Known as Automatic Speech Recognition or Computer Speech
Automatic Speech Recognition (ASR) is the process of converting captured speech signals into the corresponding sequence of words in text.
ASR systems accomplish three basic tasks: 1- Pre-processing 2- Recognition 3-Communication
How do humans do it?
• Articulation produces sound waves which the ear conveys to the brain for processing
Acoustic Signal
How might computers do it?
Acoustic Waveform
Speech Recognition
Introduction
Characteristics of Speech Signal
History of Speech & Previous Research
The Proposed System
Results and Conclusions
There are two main types of speaker models:
(1) Speaker independent Speaker independent models recognize the speech patterns of a
large group of people.
(2) Speaker dependent Speaker dependent models recognize speech patterns from only one person.
Types of Speech Recognition
(1) Isolated Word Recognition Is the simplest speech type because it requires the user to pause between each word.
(2) Connected Word Recognition Is capable of analyzing a string of words spoken together, but not at normal speech rate.
(3) Connected Speech Recognition (Continuous Speech Recognition) Allows for normal conversational speech.
Speech Recognition Usually Concern Three Types of Speech
- Speaker gender - Speaker identity - Speaker language
- Psychological conditions - Speaking style
- Environmental conditions
Factors that affect the speech signal
- Digitization: Converting analogue signal into digital representation
- Signal processing: Separating speech from background noise - Phonetics:
Variability in human speech
- Continuity: Natural speech is continuous.
Some of the difficulties related to speech recognition
Three-state representation is one way to classify events in speech. The events of interest for the three-state representation are:
• Silence (S) - No speech is produced.
• Unvoiced (U) - Vocal cords are not vibrating, resulting in an aperiodic or random speech waveform.
• Voiced (V) - Vocal cords are vibrating periodically, resulting in a speech waveform that is quasi-periodic.
The Three-State Representation
Fig. Three State Speech Representation
(1) Security
(2) Education
(3) Control
(4) Diagnosis
(5) Dictation
Applications of Speech Recognition
Introduction
Characteristics of Speech Signal
History of Speech & Previous Research
The Proposed System
Results and Conclusions
History of Speech
Previous Research(Arabic Speech)
TitleCombined Classifier Based Arabic Speech
Recognition
Comparative Analysis of Arabic
Vowels using Formants and an
Automatic Speech Recognition System
HMM AUTOMATIC SPEECH
RECOGNITION SYSTEM OF
ARABIC ALPHADIGITS
Phonetic Recognition
of Arabic Alphabet letters using Neural
Networks
SourceINFOS2008, March 27-29, 2008 Cairo-Egypt ©
2008 Faculty of Computers &
Information-Cairo University
International Journal of Signal Processing,
Image Processing and Pattern Recognition-
Vol. 3, No. 2; June-2010
the Arabian Journal for Science and
Engineering, Volume 35, Number 2C; December-
2010
International Journal of Electric & Computer
Sciences IJECS-IJENS, Vol: 11, No: 01; February-2011
The technique used
ANN HMM HMM ANN
Type of neural network
combined classifier ----------------- ----------------- PCA
The scope of speech
6 isolated word from Holy Quran
Arabic vowels(10 words)
Alpha Digits “Saudi Accented “
Arabic Alphabet
Performance 93% 91.6% 76% 96%
Previous Research(Arabic Digits)Title
Recognition of Spoken Arabic Digits Using Neural Predictive
Hidden Markov Models
Efficient System for Speech Recognition
using General Regression Neural
Network
Speech Recognition System of Arabic Digits based on A Telephony
Arabic Corpus
Radial Basis Functions With Wavelet Packets
For RecognizingArabic Speech
Source
The International Arab Journal of Information
Technology, Vol. 1; July-2004
International Journal of Intelligent Systems and
Technologies 1; 2 © www.waset.org Spring
2006
Intensive Program on Computer Vision
(IPCV'08), Joensuu, Finland; August-2008
CSECS '10 Proceedings of the 9th WSEAS
international conference on Circuits,
systems, electronics, control and signal
processing; 2010
The technique used
Neural Network and Hidden
Markov ModelANN HMM ANN
Type of neural network MLP
general regression neural network
(GRNN)--------------- RBF
The scope of speech
Arabic Digits Arabic Digits Arabic Digits “Saudi accented “
Arabic Digits
Performance 88% 85- 91% 93.72% 87-93 %
Introduction
Characteristics of Speech Signal
History of Speech & Previous Research
The Proposed System
Results and Conclusions
The Proposed System
The Proposed System 1.Recording System
Creating of a speech database is important for the development researcher.
For English language: we don't need to create a database because there is already more than one have been created to help the researcher on their research like sphinx1,2,3&4 and Australian English For Arabic language, we should try to create a database that help us.
The Proposed System 2. Data Set
The Proposed System 2. Data Set
3. The Segmentation System
The Proposed System
Segmentation process is implemented by two techniques; semi-automatic and fully-automatic.
Semi-automatic technique:
We adopt the segmentation parameters which are window size, minimum amplitude, minimum frequency, maximum frequency, minimum silence, minimum speech, and minimum word manually by trial and error. In this technique, we achieve only 70 percent performance, which is not very high and with this technique we can’t continue in our system because we still have two stages after that which is the feature extraction and the recognition
3. The Segmentation System
The Proposed System
X0 1 6 5 3 2 0 4 0 3
0 1 6 5 3 2 0 4 0 3 -
3. The Segmentation System
The Proposed System
Fully-automatic techniques
These parameters are set automatically to get better performance by using the K-Mean clustering. By this technique we achieve nearly 100 percent in the segmentation of the digits.
The K-Means Algorithm process is as follows:•The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters.•For each data point:
• Calculate the distance from the data point to each cluster.• If the data point is closest to its own cluster, we leave it, and if not
move it into the closest cluster. •Repeating the above step until a complete pass through all the data points resulting in no data point moving from one cluster to another.
The feature vector must contain information that is
- useful to identify and differentiate speech sounds - identify and differentiate between speakers
When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant (much data, but not much information) then the input data will be transformed into a reduced representation set of features (also named features vector).
Transforming the input data into the set of features is called Feature extraction.
There are some methods such as FFT, LPC, Real Cepstrum and MFCC.
The Proposed System
4. Feature Extraction
Mel Frequency Cepstrum Coefficients (MFCC):
The Proposed System
Take the Fourier transform of a signal. Map the powers of the spectrum obtained above onto
the mel scale, using triangular overlapping windows.
Take the logs of the powers at each of the mel frequencies. Take the discrete cosine transform of the list of mel log
powers, as if it were a signal. The MFCCs are the amplitudes of the resulting spectrum.
5. Neural Network Classifier
There are many Neural Models, Each model has advantages and disadvantages depending on the application. According to our application we choose
Support Vector Machine (SVM)
The Proposed System
Support Vector Machine (SVM):
A Support Vector Machine (SVM) is implemented using the kernel Adatron algorithm which constructs a hyperplane or set of hyperplanes in a high dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.
The Proposed System
Support Vector Machine (SVM):
The Proposed System
H3 (green) doesn’t separate the two classes; H1 (blue) separates the two classes but with a small margin and H2 (red) separates the two classes with the maximum margin.
Introduction
Characteristics of Speech Signal
History of Speech & Previous Research
The Proposed System
Results and Conclusions
Training and TestingSupport Vector Machine (SVM)
Results and Conclusions
We use the SVM network and adapting its parameter as follows:no hidden layers. The output layer has 10 neurons. And we train with maximum epochs of 1000.We have 10000 samples, we divide them into: Training: 70%
Cross Validation: 15% Testing: 15%
Results and ConclusionsResults
Cross Validation Confusion Matrix of the SVM
0 1 2 3 4 5 6 7 8 9
0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1 0.00 96.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00
2 0.00 0.00 90.00 0.00 0.00 0.00 0.00 0.00 10.00 0.00
3 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00
4 4.00 0.00 0.00 0.00 94.00 0.00 0.00 0.00 2.00 0.00
5 7.00 0.00 0.00 0.00 0.00 89.00 0.00 0.00 0.00 4.00
6 0.00 0.00 0.00 0.00 0.00 0.00 92.00 0.00 0.00 8.00
7 0.00 0.00 0.00 0.00 0.00 2.00 4.00 90.00 0.00 4.00
8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
9 0.00 0.00 0.00 0.00 0.00 0.00 6.00 0.00 0.00 94.00
Results and ConclusionsResults
The Testing Confusion Matrix of the SVM
Output / Desired 0 1 2 3 4 5 6 7 8 9
0 255 5 0 4 0 0 6 0 0 31 0 313 0 0 1 0 0 0 6 02 2 0 122 0 0 0 0 0 0 03 2 4 0 105 0 0 1 0 2 04 0 0 0 0 135 0 0 6 0 15 0 0 0 0 0 90 3 0 1 16 4 0 0 1 0 0 83 0 0 07 0 6 2 2 1 4 0 119 2 28 0 5 0 5 0 0 6 0 95 09 0 0 0 0 0 0 0 0 0 95
Results and Conclusions
Results
Performance =
(255+313+122+105+135+90+83+119+95+95) / 1500
= 1412 / 1500 = 94.13 %
Results and Conclusions•Conclusion A spoken Arabic digits recognizer is designed to investigate the process of automatic digits recognition.The Segmentation process is implemented by two techniques; semi-automatic and fully-automatic.The feature extracted by using MFCC technique.This system is based on NN and by using Colloquial Egyptian dialect within a noisy environment and carried out by neuro solution tools.The performance of the system is nearly 94% when we use (SVM).
THANK YOU!