35622953 speech for dummies

19
Int J Speech Technol (2006) 9: 133–150 DOI 10.1007/s10772-008-9009-1 Arabic speech recognition using SPHINX engine Hussein Hyassat · Raed Abu Zitar Received: 1 October 2008 / Accepted: 9 October 2008 / Published online: 28 October 2008 © Springer Science+Business Media, LLC 2008 Abstract Although the Arab world has an estimated number of 250 million Arabic speakers, there has been little research on Arabic speech recognition when compared to other languages of similar importance (e.g. Mandarin). Due to the lack of diacritic Ara- bic text and the lack of Pronunciation Dictionary (PD), most of previous work on Arabic Automatic Speech Recognition has been concentrated on devel- oping recognizers using Romanized characters i.e. let the system recognizes the Arabic word as an English one, then map it to Arabic word from lookup table that maps the Arabic word to its Romanized pronunciation. In this work, we introduce the first SPHINX- IV-based Arabic recognizer and propose an auto- matic toolkit, which is capable of producing (PD) for both Holly Qura’an and standard Arabic lan- guage. Three corpuses are completely developed in this work, namely the Holly Qura’an Corpus HQC-1 about 18.5 hours, the command and control corpus CAC-1 about 1.5 hours and Arabic digits corpus ADC less than one hour of speech. The building process is H. Hyassat Arab Academy of Business and Financial Sciences, Amman, Jordan R. Abu Zitar ( ) School of Computing and Engineering, New York Institute of Technology, Amman, Jordan e-mail: [email protected] completely described. Fully diacritic Arabic transcrip- tions, for all the three corpuses were developed too. SPHINX-IV engine was customized and trained, for both the language model and the lexicon modules shown in the frame work architecture block diagram on next page. Using the three mentioned corpuses; the (PD) de- veloped by our automatic tool with the transcripts, SPHINX-IV engine is trained and tuned in order to develop three acoustic models, one for each corpus. Training is based on an HMM model that is built on statistical information and random variables distribu- tions extracted from the training data itself. New algo- rithm is proposed to add unlabeled data to the training corpus in order to increase the corpus size. This algo- rithm is based on Neural Network confidence scorer and then is used to annotate the decoded speech in or- der to decide whether the proposed transcript is ac- cepted and can be added to the seed corpus or not. The model parameters were fine-tuned using simu- lated annealing algorithm; optimum values were tested and reported. Our major contribution is mainly using the open source SPHINX-IV model in Arabic speech recognition by building our own language and acoustic models without Romanization for the Arabic speech. The system is fine-tuned and data are refined for train- ing and validation. Optimum values for number of Gaussian mixtures distributions and number of states in HMM’s have been found according to specified per- formance measures. Optimum values for confidence

Upload: frupz

Post on 27-Dec-2015

15 views

Category:

Documents


1 download

DESCRIPTION

356229533562295335622953356229533562295335622953356229533562295335622953356229533562295335622953Speech for dummiesSpeech for dummiesSpeech for dummiesSpeech for dummiesSpeech for dummiesSpeech for dummiesSpeech for dummies

TRANSCRIPT

Page 1: 35622953 Speech for dummies

Int J Speech Technol (2006) 9: 133–150DOI 10.1007/s10772-008-9009-1

Arabic speech recognition using SPHINX engine

Hussein Hyassat · Raed Abu Zitar

Received: 1 October 2008 / Accepted: 9 October 2008 / Published online: 28 October 2008© Springer Science+Business Media, LLC 2008

Abstract Although the Arab world has an estimatednumber of 250 million Arabic speakers, there hasbeen little research on Arabic speech recognition whencompared to other languages of similar importance(e.g. Mandarin). Due to the lack of diacritic Ara-bic text and the lack of Pronunciation Dictionary(PD), most of previous work on Arabic AutomaticSpeech Recognition has been concentrated on devel-oping recognizers using Romanized characters i.e. letthe system recognizes the Arabic word as an Englishone, then map it to Arabic word from lookup table thatmaps the Arabic word to its Romanized pronunciation.

In this work, we introduce the first SPHINX-IV-based Arabic recognizer and propose an auto-matic toolkit, which is capable of producing (PD)for both Holly Qura’an and standard Arabic lan-guage. Three corpuses are completely developed inthis work, namely the Holly Qura’an Corpus HQC-1about 18.5 hours, the command and control corpusCAC-1 about 1.5 hours and Arabic digits corpus ADCless than one hour of speech. The building process is

H. HyassatArab Academy of Business and Financial Sciences,Amman, Jordan

R. Abu Zitar (�)School of Computing and Engineering, New York Instituteof Technology, Amman, Jordane-mail: [email protected]

completely described. Fully diacritic Arabic transcrip-tions, for all the three corpuses were developed too.

SPHINX-IV engine was customized and trained,for both the language model and the lexicon modulesshown in the frame work architecture block diagramon next page.

Using the three mentioned corpuses; the (PD) de-veloped by our automatic tool with the transcripts,SPHINX-IV engine is trained and tuned in order todevelop three acoustic models, one for each corpus.Training is based on an HMM model that is built onstatistical information and random variables distribu-tions extracted from the training data itself. New algo-rithm is proposed to add unlabeled data to the trainingcorpus in order to increase the corpus size. This algo-rithm is based on Neural Network confidence scorerand then is used to annotate the decoded speech in or-der to decide whether the proposed transcript is ac-cepted and can be added to the seed corpus or not.

The model parameters were fine-tuned using simu-lated annealing algorithm; optimum values were testedand reported. Our major contribution is mainly usingthe open source SPHINX-IV model in Arabic speechrecognition by building our own language and acousticmodels without Romanization for the Arabic speech.The system is fine-tuned and data are refined for train-ing and validation. Optimum values for number ofGaussian mixtures distributions and number of statesin HMM’s have been found according to specified per-formance measures. Optimum values for confidence

Page 2: 35622953 Speech for dummies

134 Int J Speech Technol (2006) 9: 133–150

scores were found for the training data. Althoughmuch more work need to be done to complete thework with this size, we consider the corpus used in oursystem is enough to validate our approach. SPHINXhas never been used before in this manner for Arabicspeech recognition. The work is an invitation for allopen source speech recognition developers and groupsto take over and capitalize on what we have started.

Keywords SPHINX engine · PronunciationDictionary · Diacritic Arabic

1 Introduction

Large Vocabulary Continuous Speech Recognizers(LVCSR) are commercially available from differentvendors. Along with this increased availability comesthe demand for recognizers in many different lan-guages that often were not focused on the speechrecognition research. So far, Arabic language is oneof these languages. With the increasing role of com-puters in our life, there is a desire to communicatewith them naturally. Speech processing by computerprovides one vehicle for natural communication be-tween man and machine. Interactive networks provideeasy access to a wealth of information and servicesthat will fundamentally affect how people work, playand conduct their daily affairs.

The average citizen needs to communicate withthese networks using natural communication skills us-ing everyday devices, such as telephones either mo-bile or fixed and televisions. Without fundamental ad-vances in user-centered interfaces, a large portion ofsociety will be prevented from participating in the info

era, resulting in further stratification of society andtragic loss in human potential. Automatic SpeechRecognition (ASR) field is one of these interfaces,which witnesses over the last decade enormousprogress, and can be reliably done on large vocab-ularies, on continuous speech and speaker indepen-dently. The word error rate of these recognizers underspecial conditions often is below 10 percent (Palletet al. 1999) and for general purposes Large VocabularyContinuous Speech Recognizers (LVCSR) the bestword error rates were as high as 23.9% (Rosti 2004;Hain et al. 2003) for English language.

With an estimated number of 250 million nativespeakers, Arabic language is the sixth most widelyspoken language in the world. But research on ASRfor Arabic is too limited compared to other languageswith similar importance like mandarin (Kirchhoff et al.2002).

Most of previous work on Arabic ASR aims atdeveloping recognizer, for either Modern StandardArabic (MSA) or Egyptian Colloquial Arabic (ECA).Some results of Word Error Rates (WER) obtainedfrom both MSA and ECA are shown in Table 1.

From Table 1 we see that the performance of theArabic ASR for ECA are very poor compared to otherASR for other languages like English, this result is an-other motivation for this research.

Most previous works on Arabic ASR was doneby training the system using two formats, either theRomanized format or standard Arabic script with-out Romanized transcript. Arabic ASR are concen-trated on developing recognizers for Modern StandardArabic (MSA), which is a formal linguistic standardused throughout the Arabic-speaking world and is em-ployed in the media (e.g. broadcast news), lectures,courtrooms, etc. or colloquial Arabic Language.

Page 3: 35622953 Speech for dummies

Int J Speech Technol (2006) 9: 133–150 135

Table 1 WER (%)obtained for both MSA andECA

Arabic language type Year Word error rate (WER) Reference

MSA 1997 15–20% Billa et al. (2002a, 2002b)

ECA 96/97 61–56% Kirchhoff et al. (2002), Zavagliakoset al. (1998)

ECA 2002 55.1–54.9% Kirchhoff et al. (2002)

SPHINX-IV engine will be customized in this re-search. SPHINX-IV is an open source speech recog-nition engine built for research purposes by speech re-search group at Carnegie Mellon University (CMU)(CMU SPHINX Open Source Speech Recognition En-gines 2007; Huang et al. 2003). Many theses aroundthe world tackled SPHINX for Speech Recognition(Rosti 2004; Nedel 2004; Doh 2000; Ohshima 1993;Raj 2000; Huerta 2000; Rozzi 1991; Liu 1994; Gou-vêa 1996; Seltzer 2000; Siegler 1999), but not forArabic. Reasons for selecting this engine will be pre-sented latter. SPHINX-IV architecture consists of se-ries of processes independent of each other as it will beshown in next sections. Each block in this figure rep-resents a series of process, independent of each other.

2 Review of speech recognition engines

In this section some of the famous speech recogni-tion engines are reviewed, SPHINX, Hidden MarkovToolkit (HTk) and Center for Spoken Language Un-derstanding Toolkit (CSLU).

2.1 SPHINX engine

SPHINX is a large-vocabulary, speaker-independent,Hidden Markov Model (HMM)-based continuousspeech recognition system. SPHINX was developedat CMU in 1988 (Russell et al. 1995; Christensen1996; Rabiner and Juang 1993) and was one of thefirst systems to demonstrate the feasibility of accu-rate, speaker-independent, large-vocabulary continu-ous speech recognition. SPHINX-II (Russell et al.1995) was one of the first systems to employ semi-continuous HMMs.

SPHINX is a collection of several ASR; it wascreated in collaboration between the SPHINX groupat CMU, Sun Microsystems Laboratories, MitsubishiElectric Research Labs (MERL), and Hewlett Packard(HP), with contributions from the University of Cal-ifornia at Santa Cruz (UCSC) and the Massachusetts

Institute of Technology (MIT). The current workingengines of SPHINX are SPHINX I, II, III, IV andpocket SPHINX, in addition to these engines SPHINXhas one trainer, this trainer is capable to produceacoustic model, this model can be used in all SPHINXversion except SPHINX-I, every SPHINX engine hasits unique characteristics and usage.

2.2 Hidden Markov Model Toolkit (HTK)

S. Young, presented a frame work for the HTK TOOLKIT (Hermansky 1990) he stated that The HiddenMarkov Model Toolkit (HTK) is a portable toolkitfor building and manipulating hidden Markov mod-els. HTK is primarily used for speech recognition re-search although it has been used for numerous otherapplications. It was originally developed at the Ma-chine Intelligence Laboratory (formerly known as theSpeech Vision and Robotics Group) of the CambridgeUniversity Engineering Department (CUED) where ithas been used to build large vocabulary speech recog-nition systems. It consists of a set of library modulesand a set of more than 20 tools. A HTK-based recog-nizer was included in both the ARPA September 1992Resource Management Evaluation and the November1993 Wall Street Journal CSR Evaluation, where inboth cases performance was comparable with the sys-tems developed by the main ARPA contractors.

In the year 1999, the current version of HTK wasV2.2 and all rights to HTK rested with Entropic. Atthis time Entropic’s major business focus was voice-enabling the Web and Microsoft purchased Entropic inNovember 1999. Recently Microsoft decided to makethe core HTK toolkit available again and licensed thesoftware back to researchers and academic usage, sothat it could distribute and develop the software forthese purposes.

2.3 Hybrid systems

A. Ganapathiraju et al. described the use of a powerfulmachine learning scheme, Support Vector Machines

Page 4: 35622953 Speech for dummies

136 Int J Speech Technol (2006) 9: 133–150

(SVM) (Lee et al. 1990), within the framework of Hid-den Markov Model (HMM) based speech recognition.They developed the hybrid SVM/HMM system basedon their public domain toolkit. The hybrid system hasbeen evaluated on the OGI Alpha-digits corpus andperforms at 11.6% WER, as compared to 12.7% witha triphone mixture-Gaussian HMM system, while us-ing only a fifth of the training data used by triphonesystem. Several important issues that arise out of thenature of SVM classifiers have been addressed.

3 Arabic language speech recognition research

Katrin Kirchhoff et al. on their project at the 2002Johns Hopkins Summer Workshop (Kirchhoff et al.2002), which focused on the recognition of dialectalArabic. Three problems were addressed:

1. The lack of short vowels and other pronunciationinformation in Arabic texts.

2. The morphological complexity of Arabic.3. The discrepancies between dialectal and formal

Arabic.

They used the only standardized corpus of dialectalArabic currently available (2002), the LDC Call Home(CH) corpus of ECA. The corpus is accompanied bytranscriptions in two formats: standard Arabic scriptwithout diacritics and a “Romanized” version, whichis close to a phonemic transcription. Example of theRomanized form used in their experiments is shownin Table 2, they stated that Romanized Arabic is un-natural and difficult to read for native speakers; more-over, script-based recognizers (where acoustic modelsare trained on graphemes rather than phonemes) haveperformed well on Arabic ASR tasks in the past.

3.1 Automatic Romanizing Tool (ART)

Once Katrin Kirchhoff et al. evaluate their system,a WER of 59.9% and 55.8% is obtained (evaluated

Table 2 ECA transliterated and Romanized sentence represen-tations (Kirchhoff et al. 2002)

ECA

Transliterated script AlHmd llh kwlsB w Antl Azlk

Romanized word forms llHamdulillA kuwayyisaB wi inti

izzayik

against the script and Romanized transcriptions, re-spectively), they concluded that it would be advan-tageous to have large amount of Romanized trainingdata for the development of future Arabic ASR sys-tems, and focused on building an (ART), rather thantry to explore the reasons behind these results.

In my opinion the real reasons for this result isthat, it is unfair to compare these two systems, asthey are totally two different systems. The first onewas trained on standard Arabic script without diacrit-ics, while the other was trained using Romanized tran-scription which includes vowels information, doing soyou fool the system by hiding important informationlike the short vowels in the former, while this infor-mation is presented in the later. Due this reason, thefact that Romanized Arabic is unnatural and difficultto read for native speakers and the failure of usingout-of-corpus data that have proved successful in otherlanguages—according to Katrin Kirchhoff et al., wethink that, research for Arabic ASR should be doneon original fully or partially diacritized Arabic cor-pus not Romanized and to start developing (APDT),rather than developing ART as stated by Sir ThomasElliot where he stated that “If physicians be angry, thatI have written physics in English, let them rememberthat the Greeks wrote in Greek, the Romans in Latin,Avicenna and the other in Arabic, which were theirown and proper maternal tongues” (CMU SPHINXtrainer 2008).

Modular recurrent Elman neural networks(MRENN) for Arabic isolated speech recognition isimplemented (Young 1994). This is a special kind of arecurrent network. The Elman network, originally de-veloped for speech recognition, is a two-layer networkin which the hidden layer is recurrent. The inputs tothe hidden layer are the present inputs and the outputsof the hidden layer are saved from the previous time-step in buffers called context units. Their work is aduplicate of a previous work done by (Ganapathirajuet al. 2000) but for English language. They describeda novel method of using recurrent neural networks(RNN) for isolated word recognition. Each word in thetarget vocabulary is modeled by a fully connected re-current network. To recognize an input utterance, thebest matching word is determined based on its tem-poral output response. The system is trained in twostages. First, the RNN speech models are trained inde-pendently to capture the essential static and temporalcharacteristics of individual words. This is performed

Page 5: 35622953 Speech for dummies

Int J Speech Technol (2006) 9: 133–150 137

by using an iterative re-segmentation training algo-rithm which gives the optimal phonetic segmentationautomatically for each training utterance. The second-stage involves mutually discriminative training amongthe RNN speech models aiming at minimizing theprobability of misclassification. M. M. El Choubassiet al. used a separate Elman network for each wordin the vocabulary set, although they obtained promis-ing results (accuracy ∼87%) for their very small iso-lated system, but this approach may be suitable forisolated very small vocabulary size and not suitablefor (LVCSR), a memory and performance problemswill be faced. In Baugh and Cable (1978), a methodfor automatically segmenting Holly Qura’anic Arabicis presented; a linguistic segmentation estimate wasused when the recognition failed to provide a strongclassification. In El Choubassi et al. (2003), authorspresented work for Data Recording, Transcription,and Speech Recognition for Egyptian colloquial, theirwork is based on Romanized transcription of Arabic,table below represent below represent the Romanizedsymbols used in their work. They stated that Roman-ized Grapheme-to-Phoneme tool for Standard Arabic(collected in Tunisia and Palestine) already developedat CMU. They use Egypt Call Home data and pronun-ciation dictionaries of (LDC).

4 Pronunciation Dictionaries

(PD) is a human-generated or machine table of pos-sible words and their permissible phonetic sequences-their pronunciations-. Since there are many possi-ble sequences of phonetic units that do not compriseactual words the lexical model prevents many pho-netic sequences from being explored during recogni-tion process (Lee et al. 1998).

Often the creation of a (PD) is not a trivial task.It can be created manually by a human expert in themodeled language. But especially with large vocabu-lary recognizers in which we deal with ten of thou-sands of words. This approach can be very expensiveand time consuming and therefore it is often not a fea-sible option. So the process has to be at least in part beautomatized. There are many researches on develop-ing (PD) for ASR, in this section we will review someof these researches.

4.1 Automatic generation of a PronunciationDictionary

In Essa (1998), it is stated that it has been con-firmed that an appropriate (PD) constructed by handor by a rule-based system improves recognition per-formance. But such dictionaries require time and ex-pertise to construct. Since relation between graphemeto phoneme is not direct, the researchers proposeda method for automatically generating of a (PD) forJapanese language, based on a pronunciation neuralnetwork that is able to predict plausible pronunciationsfrom the canonical pronunciation. They use a multi-layer perceptron to predict alternative pronunciationfrom the canonical pronunciations based on maximumoutput of the network.

In Schultz (2002), authors use a statistical proce-dure to determine the phonetic realization of phonemesbase forms -expected pronunciation-. Taking accountof lexical stress and word boundary information, theygenerated statistics for phonemes in word base formfrom a phonetically labeled speech corpus. The esti-mates derived from this corpus were then used to gen-erate pronunciation networks from base forms in the(DARPA) Resource Management (RM) task. A sig-nificant improvement in recognition accuracy was ob-tained on (RM) task using the pronunciation networksthus derived, relative to the base form pronunciations.In Fukada et al. (1999), authors described the proce-dure for generating the phoneme sequence automati-cally using their general purpose phonetic front end. Inorder to generate a pronunciation string for each word,a neural network first assigned a score fro each of 39phonemes to each 6 msec frame in the word. Then aViterbi search finds the best scoring sequence. Theyfound that the difference between systems using net-works derived from hand labels and those using ma-chine labels is not significant, which means that theyrecommend the use of automatic generated (PD).

In Fukada et al. (1999), authors stated that sev-eral approaches have been adopted over the yearsfor grapheme-to-phone conversion for European Por-tuguese: hand derived rules, neural networks, classifi-cation and regression trees, etc. their first approach tographeme-to-phone conversion was a rule-based sys-tem with about 200 rules. Later, this rule-based ap-proach was compared with a neural net approach. Fi-nally they described the development of a grapheme-to-phone conversion module based on Weighted Finite

Page 6: 35622953 Speech for dummies

138 Int J Speech Technol (2006) 9: 133–150

State Transducers. They investigated both the use ofknowledge-based and data-driven approaches.

4.2 Arabic speech sounds and properties

Arabic is a Semitic language, and it is one of the old-est languages in the world today. It is the fifth widelyused language nowadays. Arabic alphabets are used inseveral languages, such as Persian and Urdu (Hiyassatet al. 2005) Arabic linguistics came into being in theeighth century with the beginning of the expansion ofIslam. This early start can be explained in terms of thetremendous need felt by the members of the new com-munity to know the language of the Holly Qura’an,which had become the official language of the youngIslamic state (Al-Zabibi 1990).

Arabic linguistics exerted huge effort, explaininglinguistic rules and Arabic grammar; however, this lin-guist did not last long especially in the information era(Alghamdi et al. 2004).

The relative regularity of the syntax presents someadvantages for its formalization. In addition, the Ara-bic language has the following characteristic: fromone root the derivational and inflectional systems areable to produce a large number of words, or lexicalforms, each of which has specific patterns and seman-tics. In a certain sense, the Arabic language seems bet-ter suited for computers than English or French (Hadj-Salah 1983).

Contemporary Standard Arabic, a modernized ver-sion of classical Arabic, is the language commonlyin use in all Arab speaking lands today. It is the lan-guage of science and learning, of literature and thetheater, and of the press, radio and television. Notwith-standing the unanimous acceptability of Contempo-rary standard Arabic and its general adoption as thecommon medium of communication throughout theArab world, it is not every day speech of the people(Alghamdi et al. 2004).

4.3 Grapheme based Pronunciation Dictionary forArabic

Grapheme-to-phoneme conversion is an importantprerequisite for many applications involving speechsynthesis and recognition (Lee et al. 1998). For ASRthis process is important in developing the (PD), nor-mally as mentioned earlier this (PD) is hand crafted.In this section, a thorough description of grapheme

Based (PD) will be presented, first the importance ofPD for ASR will be described, then rules of phonolog-ical Arabic system will be presented followed by Or-thographic to Phonetic Transcription description. Thesection will be concluded by description of the gener-ation of (PD) for both MSA and Holly Qura’an LargeVocabulary Continuous Speech Recognizers (LVCSR)are commercially available from different vendors.Along with this increased availability comes the de-mand for recognizers in many different languages thatoften were not focused enough for the speech recog-nition research so far. It is estimated that today asmuch as four to six thousand different languages exist(Alghamdi 2001). Therefore, over the last period in-creased thought has been given to creating methods forautomating the design of speech recognition systemsfor new languages while making use of the knowledgethat has been gathered from already studied languages.

One of the core components of a speech recognitionsystem is the PD. The main purpose of it is to map theorthographic representation of a word to its pronunci-ation. The search space of the recognizer is the (PD)(Andersen et al. 1996). The performance of a recogni-tion system depends on the choice of subunits and theaccuracy of the PD. An accurate mapping of the or-thographic representation of a word onto a subunit se-quence is important to ensure recognition quality, oth-erwise the acoustic models trained with the wrong dataor during decoding the calculation of the scores for ahypothesis is falsified by applying the wrong models(Schultz 2002; Schultz et al. 2004).

The PD lists the most likely pronunciation or cita-tion form of all words that are contained in the speechcorpus. The pronunciation of the corpus can rangefrom very simple and achievable with automatic pro-cedures to very complex and time-consuming (Fukadaet al. 1999).

The creation of a PD is not a trivial task as men-tioned earlier and the process has to be at least in partbe automated. With sufficient knowledge of the targetlanguage, one can try to build a set of rules that mapthe orthography of a word to its pronunciation. Forsome languages this might work very well, for othersthis might be almost impossible. Arabic language isan example of languages with a very close graphemeto phoneme relation (Hadj-Salah 1983). Thus compar-atively few rules suffice to build a PD containing thecanonical information.

Page 7: 35622953 Speech for dummies

Int J Speech Technol (2006) 9: 133–150 139

4.4 Automatic versus Hand Crafted PronunciationDictionaries

The recognition quality is maintained by maintainingthe quality of the PD with which it maps the orthog-raphy of a word to the way it is pronounced by thespeakers. Best dictionaries such as CMU-dict (pronun-ciation dictionary created by Carnage Mellon Univer-sity) are usually achieved with hand-crafted dictionar-ies (Fukada et al. 1999; Killer et al. 2003). However,manually created dictionaries require an expert in thetarget language (Killer et al. 2004). However; this isa time consuming and costly approach, especially forlarge vocabulary speech recognition. If no languageexpert knowledge is available or affordable, methodsare needed to automate the PD creation process. Sev-eral different methods have been introduced over time.Most of them are based on the conversion of the ortho-graphic transcription to a phonetic one, using eitherrule based (Killer et al. 2003) or statistical approaches(Killer et al. 2004).

In order to reduce, both the cost and time requiredto develop LVCSR systems, the problem of creating(PD) must be solved. In the following sections devel-oping automatic tool for (PD) for Standard Arabic lan-guage will be described.

4.5 Segmenting Arabic utterance

The first basic rule that operates in the phonologicalsystem of Arabic without exception is that the numberof syllables in an utterance is equal to the number ofvowels. The issue, then, is not the number of syllablesin an utterance, since this is automatic, but rather theboundaries that are signaled either by zero, one or twoconsonants (Alghamdi et al. 2004).

The second basic rule of Arabic phonology is thatthe onset of the syllable equals the beginning of an ut-terance. Thus, both can begin with a single consonantexample “ ” the first phoneme is consonant “n ”and the second is the short vowel ]“ ”], (Alghamdiet al. 2004).

The third rule is that the coda of the syllable is iden-tical with the end of an utterance, coinciding infinitelywith the codas of the six syllable types previously dis-cussed. Accordingly, syllables in Arabic can be eitheropen or closed, i.e., they can end in one or two conso-nants, respectively.

Clearly, then, one should use the three rules juststated to begin the process of segmentation in Arabic.

When properly applied, these rules enable one to seg-ment almost any utterance in Arabic correctly and eas-ily, for they make the division between the coda andthe onset of nearly all contiguous syllables clear-cut.

4.6 Orthographic to phonetic transcription

Conversion of Arabic phonetic script into rules is oneof the major obstacles facing the researchers on Ara-bic text to speech systems and speech recognition. Al-though Arabic is one of the oldest languages that itssounds and phonological rules were extensively stud-ied and documented (more than 12 centuries ago) (Al-ghamdi et al. 2004), these valuable studies need to becompiled from scattered literatures and formulated ina modern mathematical frame work. The aim of thissection is to formulate the relation between graphemeto phoneme relationship for Arabic.

Arabic language is an algorithmic language, at leastfrom the phonology, writing and derivatives point ofview, for example no law can explain the pronuncia-tion of “g” in English in the following words “laugh,through, good and geography”, while Arabic languagehas direct grapheme to phoneme mapping for mostgrapheme. In general, Arabic text with diacritics ispronounced as it is written using certain rules. Con-trary to English, Arabic does not have words with dif-ferent orthographic forms and the same pronunciation.

There are sixteen essential rules in orthographicto phonetic transcription (Al-Zabibi 1990; Hadj-Salah1983).1 These rules are:

1. The sokon sign ( ), is not symbol of any phoneme,but it is meaning is this consonant is followedby another consonant, without intermediate vowel,(i.e. if it exist or not it will not affect the pronun-

ciation of the consonant itself). Example “ ” thismeans that “ ” will be pronounced as is withoutintroducing any vowles.

2. The “ ” after group waw “ ” as in “ ” is notpronounced.

3. Pharyngealization (emphasis): There are Pharyn-gealized consonants in standard Arabic were youstress the consonant when pronouncing it.Example is the word ( ) “count” the sign “ ” here,

used as stress when the “ ” is pronounced “ ”.

1URL:http://www.phonetik.unimuenchen.de/Forschung/BITS/TP1/Cookbook/node145.html (2006).

Page 8: 35622953 Speech for dummies

140 Int J Speech Technol (2006) 9: 133–150

Table 3 Pronunciationrules for laam ( )

Pronunciation rule Successor letter

Moonlam ( ) pronounced ifit is followed by these letters

Sunlam ( ) assimilated if it isfollowed by these letters

4. The pronunciation of alef is totally dependent onthe successors characters as follows:a. Not pronounced if followed by two consonant

as in “ ” (in the school), this pronouncedas “ ”.

b. Pronounced if it is part of the laam of definitearticle as in “ ” the “ ” will be pronouncedas follows “ ”.

c. Pronounced as the short vowel “ ”, if it is thefirst of a verb, with the third character of it hasthe short vowel “ ” as in “ ”, in this case itis pronounced as “ ”.

d. If the above rules did not apply, then the alefpronounced as the short vowel “ ”, as in “ ” ispronounced as “ ”.

5. The alef almaqsorah “ ”, its predecessor is alwaysthe short vowel “ ” as in “ ”, it is pronounced as

“ ”.6. Feminine Taa (“ ” ) as in which is

used in Arabic at the end of a noun to modify itsgender from masculine to feminine if the word con-taining feminine Taa found as the last word in sen-tence then the Taa pronounced as Haa “ ” other-wise it is pronounced as “ ”.

7. The letter laam ( ) in ( ) is the Laam of the def-inite article, prefixed to nouns; they are added tothe structure of the word. There are two typesof Laam The moonlam Alqmar ( ) pronounced“ a l q a m r” and the sunlam Alshams( ) pronounced “Ashmes ” the Laamis not pronounced her. The letter Laam in ( ) ei-ther pronounced or assimilated depending on thesuccessor character as shown in Table 3.

8. The “Hamz”–Glottal-(( ) ):The “Hamza” is pronounced when it comes aftera pause or at the beginning of an utterance. But itis not pronounced in all other cases as shown inTable 4.

9. The rules of pronunciation two successive wordswhen the last character of the first and the first char-acter of the second word are not vowelized. Thegeneral rule is that the short vowel \i\ should be in-troduced after the last character of the first word.

Table 4 Pronunciationexamples

10. The pause: an utterance in Arabic is never termi-nated by a short vowel this means that the shortvowel of the last word of the sentence is not pro-nounced.

11. The rule for pronouncing the three “Tanween”double diacritic signs ( ) namely ( , , ) this di-acritics are pronounced as “N” ( ) as in “ ” itis pronounced as “ ” if it is not finally orpronounced as “ ” otherwise.

12. The lengthening alef “ ”, as in “ ”, is pronounced

as “ ”.13. If the predecessor of the vowel waw “ ”, is the

short vowel “ ” as in “ ”, then it is pronouncedas “ ”.

14. If the predecessor of vowel Yaa “ ” is the shortvowel “ ” as in “ ”, then the Yaa is pronouncedas “ ”.

15. If there is three successive consonant, as in“ ”, then short vowel “ ” is introduced as“ ”.

16. The laam ( ) always in traqeeq except in the pro-nunciation of the name of Allah ( almighty ) “ ”or “ ” it is tafkheem if comes in at the begin-ning of the utterance or its predecessor either oneof the two short vowels “ ”, “ ”. Example it is pro-nounced respectively.

4.7 Generation of Pronunciation Dictionaries

Pronunciation dictionaries are built by splitting a wordinto its graphemes which are used as subunits. For ex-ample the word “ ” would be simply followed byits fragmentation into subunits (graphemes) as in the

following example “ ”. This is a veryeasy and fast approach to produce pronunciation dic-tionaries. The questions arise how well graphemes aresuited as subunits, to what extent are they inferior to

Page 9: 35622953 Speech for dummies

Int J Speech Technol (2006) 9: 133–150 141

phonemes or do they perform comparably well, howdo we cluster graphemes into poly graphemes, how dowe generate the questions to built up the decision tree?

Apart from dialectic Arabic, there are two kinds ofpronunciations for Arabic language, the MSA pronun-ciation and The Holy Qura’an pronunciation (Baughand Cable 1978). The standard Arabic pronunciationis governed by the above rules mentioned earlier inSect. 4.5, while The Holly Qura’an pronunciation isgoverned by what is called Tajweed rules which willbe described in Sect. 5.3. The proposed (PD) dealswith both of these pronunciations.

5 Experimental environment

In this section the development of Arabic Corpusesand Baseline System used in experimenting the (PD)developed will be presented. Namely the HollyQura’an Corpus (HQC-1), Command And ControlCorpus (CAC-1) and Arabic Digits Corpus (ADC).The focus of our research is on developing these cor-puses in order to facilitate testing the (PD) alreadydeveloped. Selecting the SPHINX-IV engine, whichis built on open architecture, will make the results wepresent independent of the specific recognition engineused. The particular aspects of the speech databasesare presented to provide the reader with useful contextinformation for interpreting our results and to provideother researchers with enough information to repeatand validate our experiments.

5.1 Arabic Corpus and Baseline System

Most of the research done on SPHINX-IV used ei-ther Wall Street Journal (s3-94, s0-94), and/or Re-source Management Corpus (RM) (Rosti 2004; CMUSPHINX Open Source Speech Recognition Engines2007; Huang et al. 2003; Nedel 2004; Ohshima 1993;CMU SPHINX trainer 2008; Young 1994; Hiyassatet al. 2005; Al-Zabibi 1990). These corpuses are usedfor other Latin languages such as French or Italian dueto similarities between these languages and the Eng-lish language from phoneme point of view.

Unfortunately, there are great differences betweenEnglish language and Arabic language from phonemepoint of view due to the existence of some specialphoneme such as Dhad “ ”, Dh “ ”, Tah, “ ”, aeen

“ ”, ghaeen “ ”, haa “ ”, ssad “ ”, KHaa “ ” and

Qaaf “ ”. Although some researchers have alreadyused English corpus for Arabic Speech Recognitionpurposes but most of these approaches did not offergood performance results (according to Katrin Kirch-hoff et al. (2002) WER obtained is 59.9% for Roman-ized Arabic which is not comparable to English ASRWER) (Kirchhoff et al. 2002). To this effect, we de-cided to build pure formal Arabic corpus that will beused in testing our algorithm. This corpus will may beused as benchmark for future researches.

In building a corpus for any language certain do-main should be selected, a domain dependent tran-scription should be obtained. Recording this transcrip-tion is done using different speakers in a sound iso-lated booth and sampled using deferent sampling rates(Rosti 2004; Huang et al. 2003; Raj 2000; Alghamdiet al. 2004; Killer et al. 2003).

Of course, such tasks are exhaustive in both timeand cost and beyond individual’s capabilities. Usuallysuch tasks are done through bodies such as DefenseAdvanced Research Projects Agency (DARPA), JonHopkins University (JHU), Carnegie Mellon Univer-sity (CMU) or Harvard Tool Kit (HTK) and Networkfor Euro-Mediterranean Language Resources (NEM-LAR) (CMU SPHINX Open Source Speech Recogni-tion Engines 2007; Huang et al. 2003; Fukada et al.1999; Mimer et al. 2004; Black et al. 1998).

As mentioned earlier The Arabic alphabet onlycontains letters for long vowels and consonants. Shortvowels and other pronunciation phenomena, such asconsonant doubling, can be indicated by diacritics(short strokes placed above or below the precedingconsonant). However, Arabic texts are almost neverfully diacritized and are thus potentially unsuitable forrecognizer training except The Holly Qura’an and fewother school text books. The Holly Qura’an is consid-ered as the most important reference for Arabic lan-guage.

5.2 Corpuses design criteria

Developing speech corpus is not a trivial task and needresources to be allocated, unfortunately such resourcesdoes not exists for the purpose of this research, in someresearches a figure of hundreds of thousand of dollarsis considered limited budget. Some researchers con-sider that the size of 40 hours of broadcast News isenough, due to the high cost of the resources (Mimeret al. 2004).

Page 10: 35622953 Speech for dummies

142 Int J Speech Technol (2006) 9: 133–150

5.3 The Holly Qura’an Corpus HQC-1

The development process of the HQC-1 is started bycollecting the recording for different reciters. The au-dio file collected, are of different format, some of itare MP3 format while the others are of wav format.All of the collected audio files were converted usingthis process using Sound Forge software which is opensource sound processing software (Black et al. 1998).Audio files have the following characteristics:

Format mono .wav files

Sampling frequency 16000 Hz; 16 bit

5.3.1 Filenames conventions

Each file have been give unique file name, filenames ofthe resources bear three ordered types of information:

a. Reciter name.b. Sora (chapter) name.c. Serial number identifying audio file.

Example of the file name is: Huthaifi-Isra-001.wav,the first part of the file name, is stand for the name ofreciter. In this case al-Huthiafi (one of the famous re-citers for Holly Qura’an), the second part is sora name,in this example the Isra sora (the sora number seven-teen of the Holly Qura’an) and the last part is the serialnumber of the audio file with respect to both the reciterand the sora number. The file extension (.wav) standsfor file format.

5.3.2 Directory structure

Directory of the audio file was divided into subdirecto-ries, each reciter in separate directory, the second levelof sub directories are sora of the Holly Qura’an. Ex-ample: HQC/Huthaifi/Isra/filename.wav.

5.4 Languages and character sets encodings

Language used in the transcription files is the dia-critized Arabic, the encoding used is Unicode (UTF-8)and end line with line feed only format.

Fig. 1 Example of control file content and format

5.4.1 Tools/software used for marking silences

After choosing good quality recording, it is splittedinto sub recordings of about 10 to 30 seconds lengtheach. Sound forge software is used to automaticallydetect the silence in the original recording by usingthe Auto Cue feature. A single cue is added in the cen-ter of each detected silent region. The Threshold valuesets the volume level for the silence. In most cases,the value should be −40 dB or higher otherwise anybackground hiss, pops, or clicks will be treated as non-silence and no silence would be marked at all. If no cuepoints appear, we try increasing this value to −30 dBor higher.

5.4.2 Silence length

The Silence length value determines how much silenceis required before it is marked. Some recordings con-tain brief silences that you usually do not want to bemarked. This value helps to avoid marking any briefpauses within a wave file. Values between 1.0 to 1.5seconds are used to ignore these brief silences and onlymark longer silences between recordings.

5.4.3 Splitting recordings

Recordings were split into small recordings of 10 to 30seconds length each, according to cue points location.Each of these files is given unique file identificationname. Control file containing list of file name and pathfor all recordings are created, the file extension namefor this file is “.fileids” example “an4_train.fileids” asshown in Fig. 1. File extension here is not provided,its defined in the configuration file, since SPHINX-IVaccept different file format as wave, raw or NIST for-mat.

Page 11: 35622953 Speech for dummies

Int J Speech Technol (2006) 9: 133–150 143

5.4.4 Feature extraction

For every recording in the training corpus, a set of fea-tures files are computed from the audio training data.Each recording can be transformed into a sequenceof feature vectors using the front-end executable pro-vided with the SPHINX-IV training package, as ex-plained earlier.

The process start with Pre-emphasis, this is ap-plied first, in order to remove noise by applying hi-pass filter. Then hamming window is applied, in or-der to slices up Data objects into a number of overlap-ping windows, (usually referred to as “frames” in thespeech world). After applying the hamming windowthe (FFT) is applied to compute the Discrete (FT) ofthe input sequence. This is done by analyzing a signalinto its frequency components. Filters bank (MFFB) isapplied to the output of the FFT, the output is an arrayof filtered values, typically called Mel-spectrum, eachcorresponding to the result of filtering the input spec-trum through an individual filter. Therefore, the lengthof the output array is equal to the number of filters cre-ated. In order to obtain the MFCC the (DCT) is appliedand its mean is computed for the MFCC to obtain the(CMN)), this is done in order to reduce the distortioncaused by the transmission channel. After computingthe (CMN), the first and the second derivative of it arecomputed in order to model the speech signal dynam-ics (all as explained earlier).

5.4.5 Transcription file

Each recording is accurately transcribed; any error inthe transcription will mislead the training process later.The transcription process done manually, that is, welisten to the recording then we match exactly what wehear into text even the silence or the noise should berepresented in the transcription. The transcription thenis followed by the file name without the path as shownin Fig. 2 this is to map the transcription to its corre-sponding recording.

The HQC-1 corpus consists of about 7742 record-ings, these recordings were processed—down sampled

Fig. 2 Transcription file contents

to 16 kHz and divided into small utterances—and thentranscribed, resulting in a total of 59428 words and25740 unique words with about 18.35 of recordinghours. It took a total of about 732 working hours tobuild this corpus.

6 Pronunciation Dictionary creation

In order to create the PD, the APDT described earliershould be invoked; APDT needs a transcription file toproduce PD based on this transcription file. Once theAPDT is invoked, two files are created; one is the (PD)and the other is a file containing the transcription withpronunciation alignment, so that each word in the tran-scription is mapped to its pronunciation in the PD file.The PD file has all acoustic events and words in thetranscripts mapped onto the acoustic units we want totrain. Redundancy in the form of extra words is permit-ted. The dictionary must have all alternate pronuncia-tions marked with parenthesized serial numbers start-ing from (2) for the second pronunciation. The marker(1) is omitted. Each word in the dictionary is followedby its pronunciation as shown in Fig. 3.

6.1 Filler dictionary

Filler dictionary usually lists the non-speech events as“words” and maps them to user-defined phones. Thisdictionary must at least have the entries (Fig. 4).

Fig. 3 Sample of pronunciation dictionary

Page 12: 35622953 Speech for dummies

144 Int J Speech Technol (2006) 9: 133–150

Fig. 4 Filler dictionaryminimum content

Format Mono .wav files

Sampling frequency 16000 Hz; 16 bit

The entries stand for:<s> : beginning-utterance silence<sil> : within-utterance silence</s> : end-utterance silenceNote that the words <s>, </s> and <sil> are treated

as special words and are required to be present in thefiller dictionary. At least one of these must be mappedon to a phone called “SIL”. The phone SIL is treatedin a special manner and is required to be present. TheSPHINX-IV expects us to name the acoustic eventscorresponding to our general background condition asSIL. For clean speech these events may actually besilences, but for noisy speech these may be the mostgeneral kind of background noise that prevails in thedatabase. Other noises can then be modeled by phonesdefined by us.

6.2 Phone list

Phone list is a list of all acoustic units that we wantto train models for. The SPHINX-IV does not permitus to have units other than those in our dictionaries.All units in the two dictionaries must be listed here.In other words, phone list must have exactly the sameunits used in your dictionaries, no more and no less.Each phone must be listed on a separate line in thefile, beginning from the left, with no extra spaces afterthe phone. an example is shown in Fig. 5.

By creating the transcription file, the PD, the fillerdictionary, control file and the features of the audiofiles, the system now can be trained in order to test thePD accuracy.

6.3 Command and Control Corpus (CAC-1)

The development process of the CAC-1 is started bycollecting the recording for different speakers. The au-dio file recorded format is

6.3.1 Filenames conventions

Filenames of the resources contain the following in-formation:

Fig. 5 Phones list used in training

a. Speaker name.b. Serial number identifying audio file.

Example of the file name: Mohamad-001.wav, thefirst part stand for the speaker name the second part isthe serial number of the audio file. The file extension(.wav) stand for file format.

6.3.2 Directory structure

Directory of the audio files is divided into sub di-rectories for each speaker. Example: CAC/Mohamad/filename.wav.

6.3.3 Languages and character sets encodings

Language used in the transcription files is the dia-critized Arabic, the encoding used is Unicode (UTF-8)and end line with line feed only format.

6.3.4 Recording and processing

CAC-1 corpus is totally recorded for this research.CAC-1 consists of two disjoint sets of utterances: 5628training utterances collected from 103 male and 74female speakers, and 372 testing utterances from 15male and 8 female speakers details are shown in Ta-ble 5. The total length of the training utterances isabout 4248 seconds. Different speakers are used forthe training and testing data. Both the training andthe testing utterances in the CAC-1 database wererecorded simultaneously using mono single unidirec-tional microphone placed on the desktop nearby. Itis originally recorded at 16000 kHz with manuallytranscribed and annotated acoustic environment forspeaker’s dialect and other conditions.

Page 13: 35622953 Speech for dummies

Int J Speech Technol (2006) 9: 133–150 145

The CAC-1 corpus is considered as a small vocabu-lary set (approximately 30 words in lexicon), the utter-ances consist of command and control word as shownin Table 6.

Baseline system is trained using about 2 hours ofspeech in CAC-1, including all conditions together.About 10 minutes of evaluation data are used for thetest in this research.

7 Arabic Digits Corpus (ADC)

The third corpus is the ADC corpus is totally devel-oped in this research too. This corpus is built for devel-oping Arabic digit model recognition for digits zero,one . . . nine “ . . . , ”.

The ADC corpus is developed using the recordingof 142 speakers, Table 7 shows the details of thosespeakers. This corpus was developed exactly in thesame manner and the same environment of the CAC-1corpus. The ADC Consists of two disjoint sets of ut-terances: 1213 training utterances collected from 73male and 49 female speakers, and 143 testing utter-ances from 12 male and 8 female speakers details areshown in Table 7. The total length of the training ut-terances is about 0.67 hr.

Baseline system is trained using about 35 minutesof speech, including all conditions together. About 7minutes of evaluation data are used for the test in thisresearch.

Table 5 Details ofspeakers participated inCAC-1 corpus

Gender Total

Male 118

Female 82

Table 6 Words used inCAC-1 corpus

Table 7 Details ofspeakers participated inADC corpus

Gender Total

Male 85

Female 57

Dictionaries—pronunciation and filler—are devel-oped in a similar way as for HQC-1. Since the CAC-1 corpus is a planned one, transcription were doneeasily—each speaker is asked to say exactly the sameword in the same order—only mapping of the record-ing to the control file is needed. Of course making surethat the recording reflects exactly the same transcrip-tion is essential; otherwise we fool the system in thetraining phase.

7.1 Training and evaluation

Once the model definition file is ready, training isstarted by initializing the model parameter and run-ning the Baum-Welch algorithm described in next sec-tions.

7.2 Effect of number of Gaussians on the ADC

Table 8 through Table 14 show the different perfor-mance measure of the ADC, from these tables it is

Table 8 Number ofGaussians versus wordaccuracy versus

Numberof Gaus-sians

Accuracy

1 80.159

2 84.127

4 88.889

8 89.683

16 88.889

32 77.778

64 69.048

128 65.079

256 65.079

Table 9 Number of Gaussians versus number of errors

Number of Gaussians Errors

Sub Ins Del

1 24 0 1

2 18 0 2

4 12 0 2

8 9 0 4

16 9 0 5

32 9 0 19

64 9 0 30

128 9 0 35

256 9 0 35

Page 14: 35622953 Speech for dummies

146 Int J Speech Technol (2006) 9: 133–150

Table 10 Number ofGaussians versus WER

Number of WER

Gaussians

1 19.841

2 15.873

4 11.111

8 10.317

16 11.111

32 22.222

64 30.952

128 34.921

256 34.921

Table 11 Number ofGaussians versus wordsmatches

Number of Words

Gaussians matches

1 101

2 106

4 112

8 113

16 112

32 98

64 87

128 82

256 82

Table 12 Number of Gaussians versus sentences matches

Number of Gaussians Sentences matches Sentence accuracy

1 101 80.159

2 106 84.127

4 112 88.889

8 113 89.683

16 112 88.889

32 98 77.778

64 87 69.048

128 82 65.079

256 82 65.079

easily noticeable that best performance obtained whenthe distributions splitted into 8 Gaussians.

7.3 HQC-1 overall likelihood of training

SPHINX-IV provides a tool to calculate per frametraining likelihood and overall training likelihood, theover all is obtained simply by summing up the per

Table 13 Number ofGaussians Versus speed asratio of real time audio

Number of Speed X

Gaussians real time

1 0.06

2 0.06

4 0.05

8 0.07

16 0.08

32 0.09

64 0.11

128 0.13

256 0.15

Table 14 Number ofGaussians versus memoryusage

Number of Average

Gaussians memory used

1 7.4

2 7.53

4 8.44

8 9.4

16 11.39

32 16.08

64 25.17

128 43.13

256 79.02

Fig. 6 Overall Likelihood versus number of Gaussians Densi-ties with five states per HMM

frame likelihood and dividing it by number of frames.It is found that as the number of Gaussians densi-ties increases the Overall Likelihood increases too asshown on Fig. 6.

Page 15: 35622953 Speech for dummies

Int J Speech Technol (2006) 9: 133–150 147

8 Summary and conclusions

In this research we have used SPHINX-IV for Arabicspeech recognition, build speech recognition resourcesfor Arabic, build new tools that do not originally ex-ist in SPHINX-IV that are suitable for Arabic Recog-nition such as APDT and linguistic questions and wealso investigated fine-tuning SPHINX-IV parametersfor this purpose too. In this section, we present a sum-mary of this research and the relevant observationsthat we have drawn from our investigation in train-ing, fine-tuning and testing using the three acousticmodels (HQC-1, CAC-1 and ADC) based on our pro-posed dictionary. Some comments on future researchdirections and unresolved questions are presented too.The section is closed with final summary and conclu-sions. What is mostly unique about this research isthe (APDT) algorithm we developed and tested. ThreeArabic corpuses, namely HQC-1, CAC-1 and ADCwere created for providing acceptable level of train-ing and testing of our system. The recognition per-formance obtained using these corpuses and the dic-tionary obtained using our (APDT) for Arabic is verysuccessful. To the best of our knowledge, neither thistool nor the HQC-1 corpus were exist prior to this re-search. SPHINX-IV parameters were tuned using aglobal search algorithm, training data was extendedwith help of neural network, and a features-based sys-tem (not Romanized) for Arabic speech recognitionbased on SPHINX-IV technology is finally found. Oursystem could be the basis for any future open sourceresearch on Arabic Speech Recognition and we intendto keep it open for the research community. Automatictool kit for generating (PD) is fully developed andtested in this work. This tool kit is a Rule based pro-nunciation tool. (PD) (HUSDICT60) is produced for

both formal and Holly Quran. HUSDICT60 contain59424 words, this dictionary will be freely availableon Arabic. Three corpuses are entirely developed bythe author of this work. In the Holly Qura’an Corpus(HQC-1) about 7,742 recordings were processed andthen transcribed, which results in total of 59,428 wordsand 25,740 unique words, and about 18.35 of record-ing hours, this process consumes about 432 workinghours. Note that one research at Carnegie Mellon Uni-versity (CMU) is done on about 1,400 hours of speechfor training one system (CMU SPHINX Open SourceSpeech Recognition Engines 2007). Results are shownin Fig. 7 and Fig. 8

The CAC-1 corpus consists of two disjoint setsof utterances: 5628 training utterances collected from103 male and 74 female Arabic native speakers, and372 testing utterances from 15 male and 8 femalespeakers. The CAC-1 corpus is considered as a smallvocabulary set (approximately 30 words in lexicon),final results for this corpus is shown in Fig. 9.

ADC corpus is developed using the recording of142 Arabic native speakers. This corpus is concernedin developing Arabic digit model recognition for dig-its zero, one . . . nine “ . . . , ”. The ADCConsists of two disjoint sets of utterances: 1213 train-ing utterances collected from 73 male and 49 femalespeakers, and 143 testing utterances from 12 male and8 female speakers. The total length of the training ut-terances is about 2431 seconds (Fig. 10).

From the results of research throughout this work,many suggestions for future work are recommendedas shown in coming subsections. One major weaknessof conventional HMMs is that they do not provide anadequate representation of the temporal structure ofspeech. This is because the probability of state occu-

Fig. 7 Test summaries forthe HQC-1

Holly Qura’an corpus

[java] Accuracy: 70.813% Errors: 750(Sub: 467 Ins: 276Del: 7)[java] Words: 1624 Matches: 1150 WER: 46.182%[java] Sentences: 273 Matches: 57 SentenceAcc: 20.879%[java] This Time Audio: 7.62s Proc: 5.68s Speed: 0.75 Xreal time[java] Total Time Audio: 2205.66s Proc: 1638.93s Speed:0.74 X real time

Page 16: 35622953 Speech for dummies

148 Int J Speech Technol (2006) 9: 133–150

Fig. 8 Output of the performance test for HQC-1

Command and Control Corpus

Accuracy: 98.182% Errors: 1 (Sub: 1 Ins: 0 Del: 0)Words: 55 Matches: 54 WER: 1.818%Sentences: 55 Matches: 54 Sentence Acc: 98.182%Total Time Audio: 53.29s Proc: 13.24s Speed: 0.25 Xreal timeMem Total: 126.62 Mb Free: 101.30 MbUsed: This: 25.33 Mb Avg: 20.14 Mb Max: 25.49 Mb

Fig. 9 Test summaries for the CAC-1

Page 17: 35622953 Speech for dummies

Int J Speech Technol (2006) 9: 133–150 149

Performance test of the digits corpus

Accuracy: 99.213% Errors: 1 (Sub: 0 Ins: 0 Del: 1)Words: 127 Matches: 126 WER: 0.787%Sentences: 127 Matches: 126 SentenceAcc: 99.213%This Time Audio: 1.39s Proc: 0.09s Speed: 0.07 Xreal timeTotal Time Audio: 143.24s Proc: 9.91s Speed: 0.07 Xreal timeMem Total: 126.62 Mb Free: 114.17 MbUsed: This: 12.46 Mb Avg: 12.76 Mb Max: 18.44 Mb

Fig. 10 Test summaries for the ADC

pancy decreases exponentially with time. This issueis a promising area to investigate and many issues onthe HMM modeling and temporal structuring can bestudied. Another issue is the training of the HMM, al-though Ant Colony Optimization is a stochastic anddiscrete optimization algorithm, we believe that thiscould be a promising algorithm if adapted for trainingspeech recognition models or at least can be used inoptimizing the training process. Final words, it shouldbe noticed that most Arabic texts are almost neverfully diacritical, and are thus potentially unsuitable forrecognizer training except the Holly Qura’an and fewother Text Books and some religion old books. In addi-tion to that, the existence of electronic versions of suchtext is not always available. There should be an Ara-bic effort to create diacritical corpus for both speechrecognition and text to speech research. During this re-search, about 200,000 unique diacritic words are col-lected and are now available on our free website cor-pus as mentioned earlier.

References

Al-Zabibi, M. (1990). An acoustic-phonetic approach in auto-matic Arabic speech recognition. The British Library inAssociation with UMI.

Alghamdi, M. (2001). Arabic phonetics. Riyadh: AltawbahPrinting.

Alghamdi, M., Al-Muhtaseb, H., & Elshafei, M. (2004). Ara-bic phonological rules. Journal of King Saud University:Computer Sciences and Information, 16, 1–25 (in Arabic).

Andersen, O., & Kuhn, R., et al. (1996). Comparison of twotree-structured approaches for grapheme-to-phoneme con-version. In ICSLP ’96 (Vol. 3, pp. 1700–1703) Oct. 1996.

Baugh, A. C., & Cable, T. (1978). A history of the English lan-guage. Oxon: Redwood Burn Ltd.

Billa, J., et al. (2002a). Arabic speech and text in Tides On Tap.In Proceedings of HLT, 2002.

Billa, J., et al. (2002b). Audio indexing of broadcast news. InProceedings of ICASSP, 2002.

Black, A., Lenzo, K., & Pagel, V. (1998). Issues in building gen-eral letter to sound rules. In Proceedings of the ESCA work-shop on speech synthesis, Australia (p. 7780) 1998.

Christensen, H. (1996). Speaker adaptation of hidden Markovmodels using maximum likelihood linear regression. Ph.D.Thesis, Institute of Electronic Systems Department ofCommunication Technology, Aalborg University.

CMU SPHINX Open Source Speech Recognition Engines.URL:http://www.speech.cs.cmu.edu/ (2007).

CMU SPHINX trainer Open Source Speech Recognition En-gines, URL: http//:www.cmusphinx.org/trainer (2008).

Doh, S.-J. (2000). Enhancements to transformation-basedspeaker adaptation: principal component and inter-classmaximum likelihood linear regression. Ph.D. Thesis,Department of Electrical and Computer Engineering,Carnegie Mellon University.

El Choubassi, M. M., El Khoury, H. E., Jabra Alagha, C. E.,Skaf, J. A., & Al-Alaoui, M. A. (2003). Arabic speechrecognition using recurrent neural networks. Electrical andComputer Engineering Department, Faculty of Engineer-ing and Architecture—American University of Beirut.

Essa, O. (1998). Using prosody in automatic segmentation ofspeech. In Proceedings of the ACM 36th annual south eastconference (pp. 44–49). Apr. 1998.

Fukada, T., Yoshimura, T., & Sagisaka, Y. (1999). Automaticgeneration of multiple pronunciations based on neural net-works. Speech Communication, 27, 63–73.

Ganapathiraju, A., Hamaker, J., & Picone, J. (2000). HybridSVM/HMM architectures for speech recognition. In Pro-ceedings of the international conference on spoken lan-guage processing (Vol. 4, pp. 504–507). November 2000.

Gouvêa, E. B. (1996). Acoustic-feature-based frequency warp-ing for speaker normalization. Ph.D. Thesis, Departmentof Electrical and Computer Engineering, Carnegie MellonUniversity.

Hadj-Salah, A. (1983). A description of the characteristics ofthe Arabic language. In Applied Arabic linguistics, signal& information processing, Rabat, Morocco, 26 September–5 October 1983.

Page 18: 35622953 Speech for dummies

150 Int J Speech Technol (2006) 9: 133–150

Hain, T., et al. (2003). Automatic transcription of conversationaltelephone speech—development of the CU-HTK 2002 sys-tem. (Technical Report CUED/F-INFENG/TR. 465). Cam-bridge University Engineering Department. Available athttp://mi.eng.cam.ac.uk/reports/.

Hermansky, H. (1990). Perceptual linear predictive (PLP) analy-sis of speech. Journal of the Acoustic Society of America,87, 1738–1752.

Hiyassat, H., Nedhal, Y., & Asem, E. Automatic speech recog-nition system requirement using Z notation. In Proceedingsof Amse’ 05, Roan, France, 2005.

Huang, X., Alleva, F., Wuen, H., Hwang, M.-Y., & Rosen-feld, R. (2003). The SPHINX-II speech recognition sys-tem: an overview . In School of Computer Science CarnegieMellon University, Pittsburgh, 15213, 2003.

Huerta, J. M. (2000). Robust speech recognition in GSM codecenvironments. Ph.D. Thesis, Department of Electrical andComputer Engineering, Carnegie Mellon University.

Killer, M., Stüker, S., & Schultz, T. (2003). Grapheme basedspeech recognition. Eurospeech, Geneva, Switzerland,September 2003.

Killer, M., Stüker, S., & Schultz, T. (2004). A grapheme basedspeech recognition system for Russian. In SPECOM’2004:9th conference, speech and computer, St. Petersburg, Rus-sia, September 20–22.

Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G.,He, F., Henderson, J., Liu, D., Noamany, M., Schone, P.,Schwartz, R., & Vergyri, D. (2007). Novel approaches toArabic speech recognition. The 2002 Johns-Hopkins sum-mer workshop, 2002.

Lee, K., Hon, H., & Reddy, R. (1990). An overview ofthe SPHINX speech recognition. IEEE Transactions onAcoustics, Speech, and Signal Processing, ASSP-28(1),35–45.

Lee, T., Ching, P. C., & Chan, L. W. (1998). Isolated wordrecognition using modular recurrent neural networks. Pat-tern Recognition, 31(6), 751–760.

Liu, F.-H. (1994). Environmental adaptation for robust speechrecognition. Ph.D. Thesis, Department of Electrical andComputer Engineering, Carnegie Mellon University, Pitts-burgh, PA.

Mimer, B., Stuker, S., & Schultz, T. (2004). Flexible decisiontrees for grapheme based speech recognition. In Proceed-ings of the 15th conference elektronische sprach signal ve-rarbeitung (ESSV), Cottbus, Germany, 2004.

Nedel, J. P. (2004). Duration normalization for robust recog-nition of spontaneous speech via missing feature methods.

Ph.D. Thesis, Department of Electrical and Computer En-gineering, Carnegie Mellon University.

Ohshima, Y. (1993). Environmental robustness in speech recog-nition using physiologically-motivated signal processing.Ph.D. Thesis, Department of Electrical and Computer En-gineering, Carnegie Mellon University.

Pallet, D. S., et al. (1999). 1998 Broadcast news bench-mark test results. In Proceedings of the DARPA broadcastnews workshop, Herndon, Virginia, February 28–March 3,1999.

Rabiner, L. R., & Juang, B.-H. (1993). Fundamentals of speechrecognition. Englewood Cliffs: Prentice-Hall.

Raj, B. (2000). Reconstruction of incomplete spectrograms forrobust speech recognition. Ph.D. Thesis, Department ofElectrical and Computer Engineering, Carnegie MellonUniversity.

Rosti, A.-V.I. (2004). Linear Gaussian models for speech recog-nition. Ph.D. Thesis, Wolfson College, University of Cam-bridge.

Rozzi, W. A. (1991). Speaker adaptation in continuous speechrecognition via estilsiation of correlated mean vectors.Ph.D. Thesis, Department of Electrical and Computer En-gineering, Carnegie Mellon University.

Russell, S., Binder, J., Koller, D., & Kanazawa, K. (1995). Locallearning in probabilistic networks with hidden variables.Computer Science, IJCAI.

Schultz, T. (2002). Globalphone: a multilingual speech and textdatabase developed at Karlsruhe University. In Proceed-ings of the ICSLP, Denver, CO, 2002.

Schultz, T., Alexander, D., Black, A., Peterson, K., Suebvisai,S., & Waibel, A. (2004). A Thai speech translation systemfor medical dialogs. In Proceedings of the human languagetechnologies (HLT), Boston, MA, May 2004.

Seltzer, M. L. (2000). Automatic detection of corrupt spectro-graphic features for robust speech recognition. Master de-gree Thesis, Department of Electrical and Computer Engi-neering, Carnegie Mellon University.

Siegler, M. A. (1999). Integration of continuous speech recogni-tion and information retrieval for mutually optimal perfor-mance. Ph.D. Thesis, Department of Electrical and Com-puter Engineering, Carnegie Mellon University.

Young, S. J. (1994). The HTK hidden Markov model toolkit:design and philosophy (CUED/F-INFENG/TR.152). Engi-neering Department, University of Cambridge.

Zavagliakos, G., et al. (1998). The BNN Byblos 1997 large vo-cabulary conversational speech recognition system. In Pro-ceedings of ICASSP, 1998.

Page 19: 35622953 Speech for dummies