hausa large vocabulary continuous speech...

48
Hausa Large Vocabulary Continuous Speech Recognition Student Research Project at Cognitive Systems Lab Prof. Dr.-Ing. Tanja Schultz Department of Informatics Karlsruhe Institute of Technology by cand. inform. Edy Guevara Komgang Djomgang Supervisors: Prof. Dr.-Ing. Tanja Schultz Dipl.-Inform. Tim Schlippe Dipl.-Inform. Thang Vu Begin: 15.04.2011 End: 15.07.2011 KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu

Upload: others

Post on 20-Nov-2019

49 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

Hausa Large VocabularyContinuous Speech Recognition

Student Research Project at Cognitive Systems LabProf. Dr.-Ing. Tanja SchultzDepartment of Informatics

Karlsruhe Institute of Technology

by

cand. inform.Edy Guevara Komgang Djomgang

Supervisors:

Prof. Dr.-Ing. Tanja SchultzDipl.-Inform. Tim SchlippeDipl.-Inform. Thang Vu

Begin: 15.04.2011End: 15.07.2011

KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu

Page 2: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research
Page 3: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

Acknowledgements

My thanks go to my supervisors Dipl.-Inform. Tim Schlippe and Dipl.-Inform. ThangVu for their great support and valuable feedback. It was a pleasure to work underyour supervision.

I would also like to express my thanks to Prof. Dr.-Ing. Tanja Schultz for giving methe opportunity to do this research project under her reviewing and supporting mewith the necessary resources as well as valuable feedback.

Furthermore, I would like to thank the Speech Group at the Cognitive Systems Lab(CSL) for constructive discussions, helpful ideas, a comfortable working atmosphere,and a great teamwork.

Moreover, I would like to thank the Hausa community in Cameroon for the greatsupport during the speech data collection. In particular, the colonel Amadou Ba-hagobiri, the journalist Babalala Mohaman, and Samira for the valuable support.

I would like to express my thanks to Alphonse and Anderson for the detailed proof-reading of this research project.

Finally, I would like to express my gratitude to my family. First of all, my motherVictorine for her memorable support and her unconditional love. Then, I wouldespecially like to thank Yannick Jiejip, Bibiane Chimala, Jonas Chimala, Tamar,Vincent Tsamoh, Lydie Tsamoh for the formidable support and the hospitality dur-ing the data collection in Cameroon.

Without your help this research project would not have been possible! Thank youall!

Page 4: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

iv

Page 5: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

Abstract

Africa is the continent with a second largest number of languages in the world.According to [46], 32.8% of the world languages is spoken in Asia, while 30.3% isspoken in Africa. However, a lot of African languages do not have a developedwriting system and for those languages, speech is the only way of communication.As a consequence, the disappearance of the native speakers indirectly implies thedisappearance of such languages. Speech technologies such as the Automatic SpeechRecognition (ASR) allow the communication across language boundaries and alsoenable the preservation of the authenticity of languages.On the African continent ASR systems can on the one hand support the communi-cation between speakers of different languages or the communication with machines(e.g computer). On the other hand, ASR can preserve the authenticity of Africanlanguages. Thus, the need of ASR systems is essential for African languages.In the last few years, a considerable effort has been made to analyze and developASR systems for few of the huge number of African languages. For instance, theMeraca Institute and South African universities spend much effort in investigatingspeech technologies for languages spoken in the Southern part of the continent [39].In West Africa, the African Languages Technology Initiative (ALT-i) in Nigeria hasbeen investigating speech technology for Yoruba and Igbo [9]. In this part of theAfrican continent, one of the widely spoken language is the Hausa language. To thebest of our knowledge, an ASR system for this language has not been investigatedso far.In this research project, we investigate and develop a Large Vocabulary ContinuousSpeech Recognition (LVCSR) system for the Hausa language.We describe the Hausa language and speech database recently collected as a partof our GlobalPhone corpus. We achieve significant improvements by automaticallysubstituting inconsistent or flawed pronunciation dictionary entries, including toneand vowel length information, applying state-of-the art techniques for acoustic mod-eling, and crawling large quantities of text material from the Internet for languagemodeling. A system combination of the best grapheme- and phoneme-based 2-passsystems achieves a word error rate of 13.16% on the development set and 16.26% onthe test set on read newspaper speech.

Page 6: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

vi

Page 7: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

Zusammenfassung

Afrika ist der Kontinent mit der zweitgroßten Zahl von Sprachen in der Welt. Nach[46], wird 32,8% der Sprachen der Welt in Asien gesprochen, wahrend 30,3% inAfrika gesprochen wird. Allerdings haben viele afrikanische Sprachen keiner en-twickelten Schrift und fur solche Sprachen wird das Sprechen der einzige Mittel furdie Kommunikation. Als Folge impliziert das Verschwinden der Muttersprachler in-direkt das Verschwinden solcher Sprachen. Sprachtechnologien wie die automatischeSpracherkennung ermoglichen die Kommunikation uber Sprachgrenzen hinweg undermoglichen auch den Erhalt der Authentizitat von Sprachen.Auf dem afrikanischen Kontinent, konnen Spracherkennung-Systeme, auf der einenSeite die Kommunikation zwischen Menschen mit verschiedenen Sprachen oder dieKommunikation zwischen Menschen und Maschinen (z.B. Computer) unterstutzen.Auf der anderen Seite, konnen solche Systeme zur Erhaltung der Authentizitatafrikanischer Sprachen beitragen. Somit ist der Einsatz von Spracherkennung-Systemefur die afrikanische Sprachen notwendig.In den letzten Jahren sind viele Vorschritte gemacht worden, um Spracherkennung-Systemen fur einige der großen Zahl von afrikanischen Sprachen zu analysieren undzu entwickeln. Z.B die Meraca Institute und die sudafrikanischen Universitatenhaben Sprachtechnologien fur die Sprachen im sudlichen Teil des Kontinents unter-sucht [39]. In Westafrika, hat der afrikanischen Sprachtechnologie Initiative (ALT-i)in Nigeria Sprachtechnologien fur die Sprachen Yoruba und Igbo untersucht [9]. Indiesem Teil der afrikanischen Kontinent, ist die Sprache Hausa eine der am meis-ten gesprochene afrikanischen Sprache. Nach unseren Wissen wurde bisher keinSpracherkennung-System fur diese Sprache entwickelt.In dieser Studienarbeit haben wir ein Spracherkennung-System fur die SpracheHausa untersucht und entwickelt. In der vorliegende Arbeit, beschreiben wir dieHausa Sprache und die Sprachdaten die wir als Teil unserer GlobalPhone Korpusgesammelt haben. Wir haben eine signikante verbesserung erreicht, indem wir au-tomatisch inkonsistente oder fehlerhafte Aussprache aus der Worterbucheintrage fil-triert haben. Einschließlich haben wir Informationen uber Tone und die Lange desVokals verwendet, um das Hausa-System zu verbessert. Wir haben State-of-the-artVerfahren fur das akustische Model angewendet. Mit eine große Mengen von Textenaus dem Internet konnte die Perfomanz des Sprach-Models verbessert werden. Mitdem System Kombination der besten Graphem- und Phonem-basierte 2-pass Syste-men erzielten wir, eine Wort-Fehlerrate von 13,16% auf dem development set und16,26% auf dem test set, basiert auf gelesene Nachrichten-basierten Sprachdaten.

Page 8: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

viii

Page 9: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Challenge and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Structure of this Paper . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 5

3 Hausa Language 73.1 Classification and Geographic Distribution . . . . . . . . . . . . . . . 73.2 Hausa in Cameroon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Language Peculiarities . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3.1 Writing System . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.2 Phoneme System . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Large Vocabulary Continuous Speech Recognition 134.1 Components of Large Vocabulary Continuous Speech Recognition . . 13

4.1.1 Signal Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 144.1.2 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1.3 Pronunciation Dictionary . . . . . . . . . . . . . . . . . . . . 174.1.4 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.5 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1.6 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2.1 Rapid Language Adaptation Toolkit . . . . . . . . . . . . . . 214.2.2 Janus Recognition Toolkit . . . . . . . . . . . . . . . . . . . . 234.2.3 Sequitur G2P . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Data Corpora 255.1 GlobalPhone Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Text Collection and Prompts Extraction . . . . . . . . . . . . . . . . 255.3 Speech Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . 265.4 Pronunciation Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.4.1 Bootstraping . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.4.2 Manual Checking . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 ASR Experiments and Results 296.1 Baseline Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.2 Systems Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.2.1 Language Model Improvement . . . . . . . . . . . . . . . . . 30

Page 10: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

x Contents

6.2.2 Dictionary Improvement . . . . . . . . . . . . . . . . . . . . . 306.2.2.1 Automatic Rejection of Inconsistent or Flawed Entries 306.2.2.2 Tones and Vowel Lengths . . . . . . . . . . . . . . . 31

6.2.3 Speaker Adaptation and System Combination . . . . . . . . . 32

7 Conclusion and Future Work 33

Bibliography 35

Page 11: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

1. Introduction

1.1 Motivation

With today’s globalization, the communication across language boundaries becomesmore important. Automatic Speech Recognition (ASR) is one of the technologydealing with this issue. But speech recognition systems exist for only a small frac-tion of the world’s languages. According to [46], there are about 7,000 known livinglanguages in the world. Whereby 3.5% is spoken in Europe, 14.5% in Americas,18.9% in the Pacific, 32.8% in Asia, and 30.3% in Africa. Africa itself has more than2,000 languages [25] plus many different accents.

A lot of African languages do not have a developed writing system and for thoselanguages, speech is the only way of communication. As a consequence, the disap-pearance of the native speakers indirectly implies the disappearance of the language.Therefore, ASR systems could on the one hand support the speakers of those lan-guages for communicating with speakers of another language or with machines (e.gcomputer). On the other hand, ASR could preserve the authenticity of these lan-guages. Thus, the need of ASR systems is essential for African languages.

Unfortunately, the construction of speech processing systems requires resources suchas text data, a pronunciation dictionary, and transcribed audio data. On the Africancontinent, where the infrastructure such as computer networks is less developed thanon continents such as Europa or North America, the development of such speechcorpora is a significant handicap to the development of ASR systems. The lack ofsufficient linguistic resources in the African languages is also a major challenge.For only a few of Africa’s multiple languages, speech processing technologies havebeen analyzed and developed so far. For instance, some Arabic dialects in NorthAfrica have been explored in several DARPA projects.In the Southern parts of the continent, the Meraca Institute and South African uni-versities spend much effort in investigating speech technologies for languages spokenin this part of the continent. Speech systems have been investigated and developedfor Bantu languages [39] spoken in the southern part of the African continent.In East Africa, the Djibouti Center for Speech Research and Technobyte Speech

Page 12: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

2 1. Introduction

Technologies in Kenya explore speech technology for Afar (the second language ofDjibouti) and Kiswahili [9].To the best of our knowledge, in West Africa only one organization, the AfricanLanguages Technology Initiative (ALT-i) in Nigeria, has been investigating speechtechnology for Yoruba and Igbo [9]. In this part of the African continent, one of thewidely spoken language is the Hausa language.The contribution of this work is to investigate and develop a Large VocabularyContinuous Speech Recognition (LVCSR) system for the Hausa language.

1.2 Challenge and Goals

Nowadays statistical models are used to build modern ASR systems. Such mod-els are trained on corpora of relevant speech data. This speech generally needs tobe curated and transcribed prior to the development of ASR systems. In order toachieve acceptable system performance for most applications, it is necessary to col-lect speech data from a large number of speakers.

However, many African languages still come with little or no speech and text re-sources. One of the reason is the lack of necessary infrastructure such as computerand stable Internet connections in many parts of the African continent. But the mostimportant reason is the fact that, first language speakers with the relevant trainingand experience are limited in availability. Thus, the collection and annotation ofspeech corpora for African languages is a significant obstacle to the development ofASR systems.

Building these resources for each of the 2,000 African languages [25] from scratch isa strenuous and time consuming task. At the Cognitive Systems Lab (CSL) of theKarlsruhe Institute of Technology (KIT), a web-based toolkit named Rapid Lan-guage Adaptation Toolkit (RLAT)[16] has been developed in order to support therapid development of ASR systems for new languages. RLAT aims to significantlyreduce the amount of time and effort involved in building speech processing systemsfor new languages and domains. It is envisioned to be achieved by providing toolsthat enable users to develop speech processing models, collect appropriate speechand text data to build these models, as well as evaluate the results allowing foriterative improvements.

The purpose of this work is to build a Large Vocabulary Continuous Speech Recog-nition (LVCSR) system for the Hausa language. The following work aims to:

- Advance the language-dependent modules in RLAT in order to include thepeculiarities of the Hausa language,

- Collect a large text corpus using RLAT,

- Build a pronunciation dictionary for the Hausa LVCSR system,

- Collect a large speech corpus for the Hausa LVCSR system (in Cameroon andin Germany),

- Apply RLAT to build a baseline LVCSR system for the Hausa language,

Page 13: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

1.3. Structure of this Paper 3

- Apply state-of-the art techniques for the language modeling and acoustic mod-eling,

- Apply filtering techniques to improve the quality of the Hausa dictionary.

1.3 Structure of this Paper

Chapter 2 discusses the related work to the subject of Automatic Speech Recog-nition for under-resourced languages. Chapter 3 gives an overview of the Hausalanguage. In Chapter 4, the main components of a Large Vocabulary ContinuousSpeech Recognition (LVCSR) system and some useful tools to build such a systemare presented. In Chapter 5, we present the collected corpora for the processing ofthe Hausa LVCSR system. Chapter 6 discusses the experiments and results of theHausa LVCSR system. In Chapter 7, we conclude the work and give an overview ofthe future research directions.

Page 14: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

4 1. Introduction

Page 15: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

2. Related Work

In this section, we discuss the related work that deals with the rapid bootstrap forLVCSR systems in new languages and domains, especially for under-resourced lan-guages.

Work on this field has generally focused on languages with large speech resources.Nevertheless, considerable efforts have been made in the context of under-resourcedlanguages in the last few years. For instance in [51], the authors report on their ef-forts toward a LVCSR system for the Vietnamese language. As we do in this work,they apply RLAT to collect text resources and to bootstrap the speech recognitionsystem applying a multilingual phone inventory. They investigate the peculiaritiesof the Vietnamese language such as the tonal characteristics, the monosyllabic lan-guage structure, and dialectal variations. Hausa and Vietnamese have the tonalcharacteristics in common. Therefore, we also investigate the impact of tone mod-eling on the performance of the Hausa LVCSR system. With the data-driven tonemodeling approach successfully applied to the Vietnamese system, the Hausa sytemalso achieved a better performance. For the Vietnamese system, the baseline recog-nition performance of 28% Word Error Rate (WER) is improved to 12.6% on thedevelopment set and 11.7% on the evaluation set.

The technique for bootstrapping acoustic models using RLAT applied in our workhad been successfully used for five Eastern European languages in [50]. This tech-nique uses a multilingual phone inventory MM7 which was trained from seven ran-domly selected GlobalPhone languages (Chinese, Croatian, German, English, Span-ish, Japanese, and Turkish) [45]. RLAT is applied in order to build baseline recog-nizers for the five Eastern European languages. The performance results were 63%WER for Bulgarian, 60% for Croatian, 49% for Czech, 72% for Polish, and 61%for Russian. The work about the five Eastern European languages aims to deter-mine the best strategy for the language model optimization on the given domain ina short period of time and with minimal human effort. To achieve this goal, thesnapshot-function of RLAT is applied for each language to collect a large amountof text data for a 20 days period of time. During this period of time, text data arecollected daily and used to build one language model. The final language model is

Page 16: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

6 2. Related Work

built based on a linear interpolation of the collection of 20 daily language models.That improves significantly the performance of the recognizers for the five languages.The performance results were 16.9% WER for Bulgarian, 32.8% for Croatian, 23.5%for Czech, 20.4% for Polish, and 36.2% for Russian on the evaluation set.For the optimization of the Hausa baseline language model, we apply a similar linearinterpolation technique. But, because of the lack of text resources the study couldnot be conducted for a period of 20 days.

Many African languages come with little or no text and speech resources. Thisconstitutes a considerable obstacle for investigating and developing LVCSR systemsfor these languages. Despites this fact, some investigations have been done for someAfrican languages. For instance, speech technologies have been analyzed and devel-oped for the multiple Bantu languages spoken in the Southern parts of the continent[39] [11].Other examples include speech technologies for the Afar (the second language ofDjibouti), Kiswahili, Yoruba, and Igbo [9].

Page 17: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

3. Hausa Language

Hausa is spoken as a first language by about 25 million people and as a second lan-guage by about 18 million [2]. Native speakers of Hausa are called the Hausa peopleand live mainly in West Africa [18]. The Hausa language is also called Haoussa orHausawa (mostly used in Cameroon). According to English Wikipedia [2], Hausa isone of Africa’s largest spoken languages after Arabic, French, English, Portuguese,and Swahili. There are international radio stations that broadcast in Hausa likethe British Broadcasting Corporation (BBC), Radio France Internationale (RFI),China Radio International (CRI), Voice of America (VOA), Islamic Republic ofIran Broadcasting (IRIB), and Deutsche Welle. Nowadays, these radio stations offerHausa newspapers on their international news websites.

3.1 Classification and Geographic Distribution

The Hausa language belongs to the West Chadic language subgroup of the Chadiclanguage group, which places it with the Semitic, the Berber, the Omotic, and theCushitic languages in the Afro-Asiatic language stock (see Figure 3.1). Hausa iswidely spoken in Northwestern Nigeria and in Southern Niger. The cities of thisregion: Kano, Sokoto, Zari, and Katsina, to name only a few, are among the largestcommercial centers of sub-Saharan Africa. Hausa people also live in other countriesof West and Central Africa such as Cameroon, Togo, Chad, Benin, Burkina Faso,Ghana, and Ivory Coast [32]. In large areas of West Africa, Central Africa, andWestern Sudan, particularly among Muslims, Hausa is widely used in commerceand trade [27]. The Hausa dialects are categorized into following groups: EasternHausa, Western Hausa, Northern Hausa, and Southern Hausa. More Informationabout Hausa dialects are given in [2]. According to Ethnologue, the dialects Kananci(also called Kano), Katagum, and Hadejiya are the Subdialects of Eastern Hausa.The Western Hausa includes Sokoto, Katsina, Gobirawa, Adarawa, Kebbawa, andZamfarawa, while Arewa and Arawa belong to Northern Hausa. The standard Hausais based on the Kananci dialect, which is spoken originally in the capital of KanoState in Northern Nigeria. The Hausa spoken in Cameroon combines the mixeddialects of Northern Nigeria and Niger Republic, but is significantly influenced bythe French language.

Page 18: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

8 3. Hausa Language

Figure 3.1: Afroasiatic Languages (source: http://en.wikipedia.org).

3.2 Hausa in Cameroon

The Republic of Cameroon is a country in the Western part of Central Africa. Itis bordered by Nigeria to the West, Chad to the Northeast, the Central AfricanRepublic to the East, and Equatorial Guinea, Gabon, and the Republic of Congo tothe South [1]. About 19.1 million people (July 2009 estimate) live in Cameroon [1].Their official languages are English and French. In addition, many national lan-guages are spoken in Cameroon, where the exact number varies depending on theinformation source. Ethnologue, for example, estimates more than 280 languages inCameroon [5]. According to English Wikipedia [3], the number of national languagesof Cameroon is about 230 and can be categorized as follows:

1. 55 Afro-Asiatic languages

2. 2 Nilo-Saharan languages

3. 169 Niger-Congo languages with

(a) 1 West Atlantic language (Fulfulde)

(b) 32 Adamawa-Ubangui languages

(c) 142 Benue-Congo languages (130 of those are Bantu languages)

The geographical distribution of language families in Cameroon is shown in Fig-ure 3.2. As mentioned in Section 3.1, Hausa is one of the 55 Afro-Asiatic languagesspoken in Cameroon. Hausa is spoken widespreadly by about 23,500 people inCameroon (1982 SIL) [4]. The Hausa people in Cameroon are Muslims and livemainly in the Northern region of Cameroon. Due to various reasons, most of themalso live in Yaounde, Douala, and Bafoussam. While Yaounde is the political capital,

Page 19: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

3.3. Language Peculiarities 9

Douala is the commercial capital and the largest city of the country. In Cameroonseveral newspapers are written in Hausa. Many radio stations also broadcast inHausa. To the best of our knowledge, apart from the official languages (French andEnglish), Hausa is one of Cameroon’s national languages with the largest numberof online newspapers. Therefore, we have collected Hausa text data on the Internetand recorded Hausa speech data in Cameroon as well as from few native speakersliving in Germany.

Figure 3.2: Language families in Cameroon (source: http://www.ethnologue.com).

3.3 Language Peculiarities

Hausa is generically related to languages such as ancient hieroglyphic Egyptian,Assyro-Babylonian, Hebrew, and Arabic [27]. About one-fourth of Hausa wordscome from Arabic. It has been written in ajami, a variant of the Arabic script, sincethe early 17th century. There is no standard system of using ajami and differentwriters may use letters with different values. Hausa’s modern official orthographyis a Latin-based alphabet named boko, which was imposed in the 1930s by theBritish colonial administration. Furthermore, Hausa is a so-called tone language.

Page 20: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

10 3. Hausa Language

That means, that some Hausa words depending on its pronunciation have differentmeanings. Figure 3.3 illustrates the impact of tones on the word wuya. The lengthof vowel phonemes is also of great importance as shown in the Figure 3.3.

Figure 3.3: Tone and length of vowel phonemes in the Hausa language

In the standard Hausa writing system boko, neither the vowel lengths nor the tonesare marked. For this reason, the meaning of written words and sentences depends onthe occuring context. The radio stations introduced in Section 3 offer Hausa serviceson their international news websites using the boko alphabet. For our purpose, wecollected Hausa text data in their websites.In the following sections, we introduce the writing system boko and the phonemesystem of the Hausa language.

3.3.1 Writing System

The Hausa boko alphabet consists of 22 characters of the English alphabet whichare: (A/a, B/b, C/c, D/d, E/e, F/f, G/g, H/h, I/i, J/j, K/k, L/l, M/m, N/n, O/o,R/r, S/s, T/t, U/u, W/w, Y/y, Z/z). In addition to these characters, the bokoalphabet includes following characters: �/ , �/¡, �/¨, 'Y/'y and '.In many online newspapers, the characters �/ , �/¡, �/¨ are mapped respectivelyto B/b, D/d, and K/k.As mentioned in the previous section, pronunciation characteristics such as the toneand vowel length are not marked in the Hausa written text.

Figure 3.4: Consonant phonemes of Hausa (source, http://en.wikipedia.org).

Page 21: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

3.3. Language Peculiarities 11

Figure 3.5: Vowel phonemes of Hausa (source, http://en.wikipedia.org).

Figure 3.6: Consonant phonemes Set (source, http://en.wikipedia.org).

3.3.2 Phoneme System

Depending on the speaker, Hausa has between 23 and 25 consonant phonemes (seeFigure 3.4) and basically 5 vowel phonemes /a/, /e/, /i/, /o/, and /u/ (see Fig-ure 3.5).There are three lexical tones in Hausa, i.e. each of the five vowel phonemes mayhave low tone, high tone, or falling tone [10]. In addition, it is distinguished betweenshort and long vowels which can also affect the meaning of a given word.Based on the International Phonetic Alphabet (IPA) [10], we defined 33 Hausaphonemes as acoustic model units. The phone set consists of 26 consonants, 5 vow-els, and 2 diphthongs plus a Silence model SIL. The tones and length of vowels weremarked by the following tags.

� T1 for high tone

Page 22: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

12 3. Hausa Language

� T2 for low tone

� T3 for falling tone

� S for short vowels

� L for long vowels

In addition to the consonant phonemes shown in Figure 3.4, we added the phoneme/p/ to the phone set. This consonant is specific to Hausa speakers in Cameroon.Basically, the consonant f is pronounced /p/ by native speakers from Cameroon, butnative speakers from Nigeria pronounce it differently.Some of the phonemes listed in Figure 3.4 have a very few occurrences in the dic-tionary. For this reason we merged it with the most similar phoneme as illustratedin Figure 3.6.

Page 23: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

4. Large Vocabulary ContinuousSpeech Recognition

A LVCSR system aims to transcribe an unknown spoken utterance into its mostlikely written word sequence. To achieve this goal, more than one component actstogether (see Figure 4.1 ). An overview of these components is given in the followingsections. Furthermore, several tools which support the creation of these componentsare presented. The information provided is mostly based on the lecture’s materialof Multilingual-Mensch Maschine Kommunikation1 .

Figure 4.1: Diagram of an Automatic Speech Recognition System

4.1 Components of Large Vocabulary Continuous

Speech Recognition

According to [55], with a front-end signal processor (Signal Digitalization and DigitalSignal Preprocessing) an unknown spoken utterance is converted into a sequence ofacoustic vectors Y = y1, y2, ...yn. This input utterance consists of a sequence of wordsW = w1, w2, ...wn. The sequence of acoustic vectors Y = y1, y2, ...yn is processed bythe decoder to the most likely word sequence W corresponding to W = w1, w2, ...wn.

1MMMK lecture SS 2010, KIT

Page 24: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

14 4. Large Vocabulary Continuous Speech Recognition

W is the word sequence which maximizes the probability P(W|Y). Bayes Rule isused to decompose the probability P(W|Y) as shown in the following equation:

W = arg maxWP (W |Y ) = arg maxWP (W ) ∗ P (Y |W )

P (Y )(4.1)

P(W) computed with the help of the language model is the a priori probabilityof the word sequence W and does not depend on the observed acoustic vectors Y.P(Y) is the a priori probability of the observed sequence of acoustic vectors Y.P(Y|W), determined by the acoustic model, is the conditional distribution densitywhich models the sound units of a language based on acoustic vectors Y.arg maxW , given by the decoder (also called search), is the maximum of the equationfor all possible word sequences W. P(Y) does not play a role in the maximization ofEquation 4.1 and can be omitted. That leads to the Equation 4.2.

W = arg maxWP (W |Y ) = arg maxWP (W ) ∗ P (Y |W ) (4.2)

In the following sections, we discuss more about the signal preprocessing, the lan-guage model (LM), the acoustic model (AM), the pronunciation dictionary (alsocalled Lexicon), the decoding, and the evaluation method of LVCSR.

4.1.1 Signal Preprocessing

In the first step, the analog speech signal is converted into an appropriate formthat can be processed by the computer. In this process called Signal Digitalization(Sampling and Quantization), the choice of the sampling rate and the quantizationdepth is of great importance. Typically in ASR the sampling rate is 16 KHz andthe samples are quantized with 16 bits.

In the second step, suitable feature vectors for the recognition process are extractedfrom the digital speech signal. In this process, a Hamming window with a windowoverlap is applied on the digital speech signal to divide it into smaller blocks. Inthis work a Hamming window of 16ms length with a window overlap of 10ms isused. There are different approaches to extract acoustic feature vectors. For ourpurpose, the approach called Mel-Frequency Cepstral Coefficients (MFCCs) is used.It computes the Cepstral Coefficients using the Melscale. Each feature vector has43 dimensions containing 13 Melscale Frequency Ceptral Coefficients, their first andsecond derivatives, zero crossing rate, power, and delta power. A Linear Discrimi-nant Analysis (LDA) transformation is computed to reduce the feature vector sizeto 32 dimensions.

Other methods to extract acoustic features from the signal spectrum are LinearPrediction (LP) coefficients and Perceptually weighted Linear Prediction (PLP) co-efficients [26]. Each of these methods produces a sequence of feature vectors Y =y1, y2, ...yn for a given speech signal. These feature vectors are processed by theacoustic model described in the following section.

Page 25: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

4.1. Components of Large Vocabulary Continuous Speech Recognition 15

4.1.2 Acoustic Model

As introduced in Section 4.1, the issue of the acoustic modeling is to compute thelikelihood to observe the feature vectors Y = y1, y2, ...yn for a given sequence ofwords W = w1, w2, ...wn. For this task, we need a representation for W in termsof feature vectors. There is a two-part representation for it: The pronunciationdictionary, which describes W as a concatenation of phonemes (more in Section4.1.3) and the phoneme model that explains phonemes in terms of feature vectors.In the LVCSR system, the Hidden Markov Model (HMM) is the current most widelyused phoneme model. Figure 4.2 illustrates how the HMM works. By definition, anHMM is a five-tuple consisting of:

� V is the alphabet of possible emitted feature vectors. The observable featurespace can be discrete (V = x1, x2, ..., xv), or continuous (V = Rd).

� S is the set of States S = s1, s2, ..., sn, where n is the number of states.

� π is the initial probability distribution. π(si) = probabilty of si being the firststate of a state sequence.

� A is the matrix of state transition probabilities. A = (aij) where aij is theprobability of state sj following si.

� B is the set of emission probability distributions/densities. B = b1, b2, ..., bnwhere bi(x) is the probability of observing x when the system is in state si.

Figure 4.2: HMM, generating an observation of feature vectors X = x1, x2, x3(Source: [46], Chapter 4)

There are three main problems for the design of HMM [40]: The evaluation, thedecoding and the learning or optimization problem.The learning problem aims, for a given HMM λ and an observation sequence y1, y2, . . . yn,to optimize the parameters of λ and retrieve λ′, so that the probability of observ-ing the vector sequence y1, y2, . . . yn is maximized. That consists of calculatingp(y1, y2, . . . yn | λ′) > p(y1, y2, . . . yn | λ). The Baum-Welch method is the com-mon algorithm used to solve this problem.

Page 26: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

16 4. Large Vocabulary Continuous Speech Recognition

By the decoding problem, the issue is to compute the most likely state sequencesq1, sq2, . . . sqn given an HMM λ and an observation sequence y1, y2, . . . yn. That canbe summarized in the equation arg maxsq1,sq2,...sqn(sq1, sq2, . . . sqn|y1, y2, . . . yn, λ) andis solved with the Viterbi algorithm.The evaluation problem, which computes the probability of the observation sequencey1, y2, . . . yn given an HMM λ which means to calculate P(y1, y2, . . . yn | λ), is solvedwith the Forward algorithm.

An HMM can be designed using different topologies. In ASR a simple 3-stateBakis-Model for each phone is usually applied. In this model, each HMM-statemodel represents one section of a phone (begin -b, middle -m, end -e). To forma word, the HMMs can be connected as shown in Figure 4.3 for the word ”can”.Similarly, words can be connected together to cover complete utterances.However, the same phoneme sounds different dependent on the context where itoccurs. For example, the phoneme /l/ sounds different in the two words ”kill”and ”like”. Therefore, different HMMs need to be trained for each different con-text to obtain a good phonetic discrimination. To achieve it, the so called Poly-phone model is built. With the triphone model for example, every phoneme hasa distinct HMM model for each unique pair of left and right neighbors. If weassume that the notation y(x, z) represents the phoneme y occurring after an xand before a z, the phrase ”We can” would be represented by the phone sequencesil w I k ae n sil and if triphone HMMs were used, the sequence would be modeled

as sil w(sil, I) I(w, k) k(I, ae) ae(k, n) n(ae, sil) sil. To estimate which phonemessound similar in different context, we build a decision tree that asks phonetic ques-tions about the context as illustrate in Figure 4.4. The questions are related tothe acoustic properties of the phoneme and based on the International PhoneticAlphabet (IPA) classification (more about IPA in Section 4.1.3). The phonetic dis-crimination is done using the clustering algorithm as follows:

1. Initialize one cluster containing all contexts

2. For all clusters, compute distance of subclusters

3. Perform the split that gets the largest distance (information gain)

4. Continue with Step 2 until satisfied (e.g when the training data for a phoneor the information gain is small enough)

The HMMs are called ”discrete HMM”, if the emission probabilities bi(x) are discreteand called ”continuous HMM”, if the emission probabilities bi(x) are probabilitydensity functions. Discrete HMMs are rarely used in ASR. For continuous HMMsGaussian Mixture Models (GMMs) are most often used to represent the emissionprobability bi(x). A variant of this approach is called ”fully-continuous HMM”. Inthis method, the states that correspond to the same acoustic phenomenon share thesame ”acoustic model” (Figure 4.5 gives an example to illustrate this methodology).Consequently, the parameters of the emission probabilities can be estimated morerobustly and the training data can be exploited better. But it requires a larger setof training data since there are many parameters to be estimated. To overcomethis problem, the semi-continuous HMMs use parameter tying to share more data

Page 27: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

4.1. Components of Large Vocabulary Continuous Speech Recognition 17

Figure 4.3: HMM for the word ”can”, (source: MMMK lecture SS 2010, KIT)

Figure 4.4: Example of a Phonetic Decision Tree Clustering

between the parameters. With this approach, as illustrated in Figure 4.6, thereis only one codebook of Gaussians in the system. Every acoustic model has itsown set of mixture weights, but shares the same Gaussian codebook. This methodsignificantly reduces the amount of parameters to be estimated, but offers a poorresolution of the feature space.

4.1.3 Pronunciation Dictionary

As introduced in Section 4.1.2, the pronunciation dictionary describes words as con-catenation of phonemes. These phonemes are represented using a phonetic alphabet.There are different phonetic alphabets, but the general ones are:

� ASCII based (e.g: SAMPA): This alphabet has the advantage that it is simpleto write

� IPA (International Phonetic Alphabet): Universally agreed system of notationfor sounds of languages. More details about IPA in [10].

To design pronunciation dictionaries in a particular language, we need to definewords in the target language initially. Then a finite set of words (vocabulary) is

Page 28: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

18 4. Large Vocabulary Continuous Speech Recognition

Figure 4.5: A fully continuous HMM (source: MMMK lecture SS 2010, KIT)

Figure 4.6: A semi-continuous HMM (source: MMMK lecture SS 2010, KIT)

selected and the pronunciation of each of these words is determined. The productionof the pronunciations can be statistical- or rule-based. The ruled-based productioncan be completely manual. In this case, developers (often experts in linguistics orphonetics) type the phone sequence for each lexical entry. This is only viable forrelatively small vocabulary tasks and poses problems of pronunciation consistency.When the production is manually supervised, rules are used to infer pronunciationsof new entries from an existing dictionary. This method requires a reasonably sizedstarting dictionary and is mainly useful to provide pronunciations for inflected formsand compound words. One of the problems with the design of a pronunciationdictionary are the pronunciation variations: The same word can be pronounceddifferently in different situations due to context (coarticulation effects), dialects orvarious correct pronunciations. This issue can be solved in the acoustic modelingas introduced in Section 4.1.2 with building multiple, alternative state sequenceHMMs for a word. Additionally on dictionary level, multiple entries are added tothe pronunciation dictionary. Dictionary entries of some English words for examplelook as follows:

TRAIN T R EY N

Page 29: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

4.1. Components of Large Vocabulary Continuous Speech Recognition 19

TRUE T R UW

WHEN (1) W EH N

WHEN (2) HH W EH N

Depending on the application and the language, the size of the dictionary can varyfrom a few to millions of words.

4.1.4 Language Model

Language Models (LM) aim to provide a technique for estimating the probabilityP(W) of a word sequence W in a given utterance. This can be achieved in two ways:In the deterministic approach, where grammar-based language models are used, theprobability P(W) of a word sequence W is regarded either as 1.0, if the word sequenceW is accepted or 0.0, if the word sequence W is rejected. This method is inappropri-ate for LVCSR, since a grammar has no complete coverage and spoken language isoften ungrammatical. To overcome the problem, the stochastic approach describesP(W) from the probabilistic viewpoint. That means the occurrence of a word se-quence W is described by the means of a probability distribution. A grammar-basedlanguage model is useful when the range of sentences to be recognized is very smalland can be captured by a deterministic grammar, for example a limited domaindialogue system. More about grammar-based language models can be found in [28].For large vocabulary applications, it is difficult to write a grammar with sufficientcoverage of the language. For such applications the stochastic method assigns aprobability to a sequence of m words P(W) = P (w1, w2 . . . wm) by the means of aprobability distribution. The probability P(W) can be decomposed as:

P(W) = P(w1, w2 . . . wm) = P (w1)∗P (w2|w1)∗P (w3|w1, w2) . . . P (wm|w1, w2 . . . wm−1)(4.3)

For a large vocabulary V there is a huge number of possible histories, when com-puting P(w| history). To deal with this issue one of the proposed methods is toreplace the history by a limited feasible number of equivalence classes C such thatP’(w| history) = P(w|C(history)). For N-gram models, the classes are simply basedon the previous words as shown in the following equation:

P(W)N−gram = P ′(wm|w1, w2 . . . wm−1)

= P(wm|wm−(n−1), wm−(n−2) . . . wm−1)

=∏n

m=1 P (wm|wm−1 . . . wm−n+2, wm−n+1) (4.4)

For N=3, we have the so called trigram model. The probability of observing forexample the sentence ”I was in the white house” is approximated in Equation 4.5,where <s> is the start-of-sentence marker.

Page 30: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

20 4. Large Vocabulary Continuous Speech Recognition

P(I, was, in, the, white, house)= P(I|<s>, <s>)*P(was|<s>, I)*P(in |I, was)*P(the|was,in)*P(white|in, the)*P(house|the, white) (4.5)

For a trigram model, the probability of a word depends on its two preceding words.It can be estimated from simple frequency counts as follows:

P (wi|wi−2, wi−1) =C(wi−2, wi−1, wi)

C(wi−1, wi)(4.6)

For a vocabulary of V words, we have a V 3 possible trigrams, which is a very bignumber for a large vocabulary size. A huge training corpus with millions of words isnecessary to train an n-gram model in general. Sometimes many possible word com-binations appear only once or twice. Many others are not observed in the trainingcorpus so that their probability is 0. That leads to the key problem in n-gram mod-eling called data sparseness. Various smoothing techniques [28] have been developedto assign non-zero probabilities to all word combinations as it may help reducing er-rors in speech recognition. Smoothing adjusts probabilities of word sequences to geta more robust estimation for unseen trigrams. One of the smoothing method calleddiscounting reduces the trigram counts of the more frequently trigrams and redis-tributes the resulting excess probability mass amongst the less frequently occurringtrigrams. If a higher order n-gram wi|wi−n . . . wi−2wi−1 has a non-zero probability,this distribution is used for example in backoff smoothing. If the count of the higherorder n-gram is zero, we backoff to the lower order n-gram wi|wi−n−1 . . . wi−2wi−1.Equation 4.7 illustrates the backoff method for a trigam model.

Pbo(wi|wi−2, wi−1) =

{P (wi|wi−2, wi−1) ifC(wi−2, wi−1, wi) > 0

λ(wi−1)Pbo(wi|wi−1) otherwise(4.7)

The performance of a language model is usually evaluated on a text corpus calledtest set. For this evaluation, the following metrics are of great importance: Theperplexity (PPL) and the out-of-vocabulary (OOV) rate. The perplexity is theaverage branching factor of the language according to the model. It is computed inEquation 4.8, where H(W) is the cross-entropy of a word sequence W.

PP = 2H(W ) (4.8)

The OOV rate is the frequency of word tokens inside the test corpus which is not inthe language model. While a low perplexity is an indicator that the language modelis good [41], a high OOV rate may lead to poor speech recognition performance.When different language models or text corpora are available, combining it mayimprove the performance of speech recognition. The common and simplest wayof language model combination is the linear interpolation. Assuming we have twolanguage models or text corpora L1 and L2. The probability Pr(.,.) of a word wa

given a history ha for the interpolated language model is estimated as follows:

Pr(wa|ha) = (1− λ)PrL1(wa|ha) + λPrL2(wa|ha), 0 ≤ λ ≤ 1 (4.9)

Note that λ which is the interpolation coefficient can be calculated automaticallyby minimizing the perplexity on a development set text.

Page 31: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

4.2. Tools 21

4.1.5 Decoding

The main question at the decoding is, how to efficiently try all computed sequencesof words W using P(W) provided by the language model and P(Y|W) provided bythe acoustic model. As introduced in Section 4.1, the decoding component findsthe sequence of words W which maximizes the Equation 4.1. W is represented asan example pattern or as a sequence of states in an HMM. The search space is theentire set of possible HMM state sequences. For example, with a vocabulary of64,000 words with an average of 25 words per utterance, the search space may haveabout 1,000 time frames (10sec speech) and 500,000 possible sequences of pattern.Therefore, we need an intelligent algorithm that scans the search space and findsthe best hypothesis to compute the most likely sequence of words by evaluating thescores of all possible sequences. For this purpose, two main methods have beendesigned [55]: depth-first and breadth-first. While the depth-first algorithm aimsto pursue the most promising hypothesis until the end of the speech is reached, thebreadth-first approach pursues all hypotheses in parallel. Examples of the depth-first algorithm are the A* search and stack decoder. The A* search expands thenode first which gives best promise of leading to the best path to the goal usingan heuristic function. According to [55], the breadth-first decoding is often referredto Viterbi decoding. There are several implementations of the Viterbi decoding.One of them, the Viterbi beam search which uses the techniques of efficient pruningmethods is most preferred in ASR. The pruning technique is an optimization methodwhich throws away unpromising paths. This method has the advantage to save alot of time, but may lead to poor performance in case of wrong beam parameters.For example, if the beam width is too small, the correct hypothesis can be pruned.

4.1.6 Evaluation Method

To evaluate how good a speech recognition system is, a metric named word errorrate (WER) is used. The WER is the percentage of words which is recognizedincorrectly and is calculated in the following way: WER = INS+DEL+SUB

N, where

INS is the number of word insertions, DEL the number of word deletions, SUBthe number of word substitutions, and N the total number of words in the refer-ence. For a given reference: ”a speech recognition system can help people”, and thespeech recognizer output ”speech cognition can help pupil”, the WER is calculatedas: WER = 0+2+2

7= 57, 14%.

4.2 Tools

4.2.1 Rapid Language Adaptation Toolkit

The Rapid Language Adaptation Toolkit (RLAT) is a web-based platform whichallows users to develop a speech recognizer for a given language. It provides aninteractive language creation and evaluation toolkit that allows users to developspeech processing models, to collect appropriate data for model building, and toevaluate the results enabling iterative improvements. Figure 4.7 illustrates the maincomponents of RLAT.

Text and prompt selection : This component allows the user to prepare the textcorpus necessary for building an ASR system. For this purpose, the user can

Page 32: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

22 4. Large Vocabulary Continuous Speech Recognition

either upload a text corpus or collect it from the Internet. In the latter caseRLAT provides a crawl-functionality which takes as input the url of the targetwebpage and the link depth. Thus, the text data from the given webpage iscaptured, then all links of that page is followed and the content of the successorwebpages is collected. The process is repeated until the specified link depth isreached. After this process, the captured text data is cleaned using language-independent rules. It consists for example of removing all HTML-Tags, JavaScript codes, and non-text parts from the collected data. The cleaned textdata is then normalized using language-specific rules. The normalization pro-cess depends on the characteristics of the selected language. For each newlanguage, this component has to be implemented in RLAT. We implementedthis component for the Hausa language. Finally, the normalized text is usedto generate prompts or to build the language model component. Prompts aresentences easy to read, which are used for the speech data collection.

Audio collection : This component enables the user to record speech data for thesystem creation. For this issue, prompts are normally used. As the Internetconnection is not stable in many parts of Cameroon, we applied an offlineaudio recorder to record speech data for the Hausa language.

Phoneme selection : The phoneme selection component provides the user witha relatively easy-to-use interface for phonemes in IPA notation. The user canlisten to the target phoneme and select it.

Grapheme-to-phoneme rules : The user can define mapping rules between thephoneme set and the grapheme set of the target language using this component.That is useful for the creation of the pronunciation dictionary.

Lexicon pronunciation creation : The pronunciation dictionary also called lexi-con can be created interactively in RLAT with the user in the loop. Accordingto the defined grapheme-to-phoneme rules, the system infers a pronunciationfor a given word and proposes it to the user. The user accepts or corrects itand the system uses statistical methods to learn the correct pronunciation andgenerates a dictionary for the target vocabulary. This process is useful whenthe user does not have a ready to use dictionary. If a dictionary is available,then the user can simply upload it and continues with the next step.

Build language model : This component allows the creation of the languagemodel using text data collected with the Text and prompt selection com-ponent. Like other components it also provides an upload function for the caseif a ready to use language model is available. With the ”snapshot” functionof RLAT, the user can specify a time period when new language models areautomatically created using the text data collected in this period. The ”snap-shot” function provides informative feedback about the quality of the collectedtext data from the Internet in the specified time interval.

Build acoustic model : After the creation of the language model, the pronuncia-tion dictionary, the selection of the phoneme set, and with a sufficient amountof speech data, the acoustic model can be built. We divided the speech datainto three sets: The training set, the development set and the evaluation set.To bootstrap the system, the multilingual phone inventory MM7 which was

Page 33: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

4.2. Tools 23

trained from seven randomly selected GlobalPhone languages (Chinese, Croa-tian, German, English, Spanish, Japanese, and Turkish) [45] is used in RLAT.Machine learning algorithms are applied to train the initialized system basedon the training set. At the end of the training process, the evaluation setis used to test the performance of the training system. For the training andtesting process, RLAT uses components of the Janus Toolkits, presented inSection 4.2.2.

Test ASR system and Create speech synthesis voice : These two componentswere not used in this study.

Figure 4.7: Steps to build an ASR system and a synthesis voice with RLAT

Building a complete ASR system by using the RLAT components introduced above,reduces significantly the necessary time and effort for this task [16] [50].

4.2.2 Janus Recognition Toolkit

The Janus Recognition Toolkit (JRTK) [23] is a speech recognition system, that wasdeveloped by the Interactive Systems Laboratories at Carnegie Mellon University,USA and University of Karlsruhe, Germany. Although it is initially developed forspeech recognition, it has been successfully applied for many other recognition pur-poses such as handwriting and silent speech recognition. According to [23], thesetoolkit implements an object-oriented approach where tcl scripts-based environmentsallow building different recognizers [23].

Figure 4.8: Alignment for the word ”speaking” with graphones (source: MMMKlecture SS 2010, KIT)

Page 34: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

24 4. Large Vocabulary Continuous Speech Recognition

JRTK offers a programmable shell with efficient objects which allows the con-trol of speech recognition components such as codebooks, dictionaries, and speechdatabases. Objects provided by JRTK are for a wide variety of recognition ap-proaches. Janus toolkit is integrated in RLAT and is used for training and testingpurposes.

4.2.3 Sequitur G2P

Sequitur G2P is a trainable data-driven Grapheme-to-Phoneme converter developedat RWTH Aachen [15]. It uses a statistical-based approach to infer pronunciationsof new entries from an existing training dictionary. The graphone (or grapheme-phoneme joint-multigram) method is applied to the alignment problem. For exam-ple, the pronunciation of ”speaking” may be regarded as a sequence of five graphonesas shown in Figure 4.8. Sequitur has been used in this work to infer the pronuncia-tions of crawled Hausa vocabulary based on an initial dictionary with few entries.

Page 35: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

5. Data Corpora

To develop and evaluate a Hausa large vocabulary continuous speech recognizer,we collected Hausa speech data in GlobalPhone style (see Section 5.1). First, weuse RLAT to crawl text data from online Hausa newspaper articles and generatedprompted sentences from these text data. Second, we asked Hausa native speak-ers in Cameroon and in Germany to read prompted sentences which we recorded.Furthermore, an initial pronunciation dictionary was designed using 200 rules toinfer pronunciations of about 6,500 words. This dictionary was manually checkedin Cameroon from native speakers. The following sections present the GlobalPhonedatabase and give more information about the collected Hausa corpora.

5.1 GlobalPhone Data

Globalphone is a multilingual read speech and text database developed at Karl-sruhe University. Its goal is to provide a database to deal with the task of rapiddeployment of LVCSR systems for native speakers in widespread languages [43]. Thetranscription is the most time and cost consuming process of a database collectionfor LVCSR systems. To overcome this issue in the Globalphone database collection,widely read newspapers available on the Internet are used as text resources. Thechosen newspaper articles cover national and international political news as well aseconomic news. The collected transcriptions are read from native speakers mostly inquiet environments. The speech data available is sampled at 16 kHz mono quality,with a resolution of 16 bits, recorded with a close-speaking microphone and storedin PCM. GlobalPhone currently covers more than 19 languages. It includes Arabic,Chinese (Mandarin and Shanghai), Croatian, Czech, French, German, Japanese,Korean, Portuguese, Russian, Spanish, Swedish, Tamil, Turkish, Polish, Bulgarian,Vietnamese, and Hausa.

5.2 Text Collection and Prompts Extraction

To build a text corpus of Hausa words, we used the Rapid Language AdaptationToolkit to collect text from 6 websites as listed in Table 5.1, covering the main Hausanewspaper sources. RLAT enables the user to crawl text from a given webpage with

Page 36: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

26 5. Data Corpora

different link depths. The websites were crawled with a link depth of 5 or 10, i.e. wecaptured the content of the given webpage, then followed all links of that page tocrawl the content of the successor pages (link level 2) and so forth until we reachedthe specified link depth. After collecting the Hausa text content of all pages, thecrawled text was cleaned and normalized with language-independent rules (LI-rule)and language-specific rules (LS-rule) in our Rapid Language Adaptation Toolkit.LI-rule and LS-rule are itemized in Table 5.2. The websites were used to extracttext for the language model and to select prompts for recording speech data.

5.3 Speech Data Collection

To collect Hausa speech data, we asked native speakers of Hausa in Cameroonand in Germany to read prompted sentences. As our web-based recording tool inRLAT turned out to be difficult to use as many sites in Cameroon did not provideInternet connection, we used the offline version. In total, our corpus contains 7,895utterances spoken by 33 male and 69 female speakers in the age range of 16 to 60years. Figure 5.1 shows the age distribution in the speech data. All speech datawas recorded with a headset microphone in clean environmental conditions. Wecollected speech data from native speakers in Cameroon living in following cities:Maroua, Douala, Yaounde, Bafoussam, Ngaoundere and from one Hausa nativespeaker from Nigeria. Therefore our speech data contains different accents: Maroua,Douala, Yaounde, Bafoussam, Ngaoundere, and Nigeria. Figure 5.2 shows the accentdistribution of the speech data. The data is sampled at 16 kHz with a resolution of16 bits and stored in PCM encoding.

Websiteshttp://hausa.cri.cnhttp://ha1.chinabroadcast.cnhttp://www.bbc.co.uk/hausahttp://www.dw-world.de/hausahttp://www.hausa.rfi.frhttp://www.voanews.com/hausa/news

Table 5.1: List of Hausa Websites.

Text normalization with LI-rule1. Removal of all HTML-Tags, Java script codes, and non-textparts.2. Removal of empty lines.3. Removal of sentences containing more than 30% numbers.4. Removal of sentences longer than 30 tokens.5. Case normalization based on statistics.Text normalization with Hausa LS-rule1. Identify and remove pages from other languages based on thecoverage of frequent Hausa words.3. Remove special characters and empty lines.2. Delete lines which appear repeatedly.3. Filtering of line from other languages.4. Remove special characters and empty lines.5. Replacement of abbreviations with their long forms.

Table 5.2: Normalization of Hausa text.

Page 37: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

5.3. Speech Data Collection 27

Figure 5.1: Hausa Speech Data, Age Distribution

Figure 5.2: Hausa Speech Data, Accent distribution

The recorded speech data was divided into three sets: training set, development set,and test set. The training set is used for the training process of the ASR. Whilethe development set is applied for tuning the system during development, the testor evaluation set is used at the end to evaluate the final performance of the system.The Hausa portion of the GlobalPhone database is listed in Table 5.3.

Set Male Female #utterances #tokens DurationTraining 24 58 5,863 40k 6 h 36 minDevelopment 4 6 1,021 6k 1 h 02 minEvaluation 5 5 1,011 6k 1 h 06 minTotal 33 69 7,895 52k 8 h 44 min

Table 5.3: Hausa GlobalPhone Speech Corpus.

Page 38: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

28 5. Data Corpora

5.4 Pronunciation Dictionary

5.4.1 Bootstraping

We used Sequitur G2P described in Section 4.2.3 for the automatic generation ofthe initial pronunciation dictionary. 200 word-pronunciation pairs have been usedas training material for G2P models based on word-pronunciation lists from PeterLadefoged (http://archive.phonetics.ucla.edu/Language/HAU/hau.html). We gen-erated the pronunciation for 6,500 words (case-sensitive derived from all crawledtexts). Note that after the split into training set, development set and test set,pronunciations exclusively in test and development set were deleted.

5.4.2 Manual Checking

The automatic generated dictionary was manually cross-checked by 6 groups oflinguist students (native speakers). Each group received 1,000 words to cross-check.They told us that they had to correct mostly wrong tones and missing phonemes.The 500 words, which have not been checked, were given to a Hausa teacher, whoadditionally cross-checked 4,000 words which had already been checked by the Hausastudents. He reported a lot of missing pronunciation variants. Finally, a Hausalinguist, who is BBC journalist, cross-checked the whole dictionary.

Page 39: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

6. ASR Experiments and Results

6.1 Baseline Systems

To build the Hausa baseline recognizer, we defined 33 Hausa phonemes as acousticmodel units. As described in Section 3.3.2, the Hausa phone set consists of 26 con-sonants, 5 vowels, and 2 diphthongs. Our speech corpora described in Section 5.3 isdivided into training set, development set, and evaluation set. We used the 6.6 hoursof the training set to train the acoustic models (AMs). To rapidly build a baselinerecognizer for Hausa, we applied RLAT (more in Section 4.2.1) using a multilin-gual phone inventory for bootstrapping the system. This phone inventory MM7 wastrained from seven randomly selected GlobalPhone languages (Chinese, Croatian,German, English, Spanish, Japanese, and Turkish) [45]. To bootstrap the system,the Hausa phoneme models were initialized from the closest matches of the MM7inventory derived by an IPA-based phone mapping. We adopted the GlobalPhone-style preprocessing and used the selected MM7 models as seed models to produceinitial state alignments for the Hausa speech data. The preprocessing consists offeature extraction applying a Hamming window of 16ms length with a window over-lap of 10ms. Each feature vector has 143 dimensions by stacking 11 adjacent framesof 13 coefficient Melscale Frequency Ceptral Coefficients (MFCC) frames. A Lin-ear Discriminant Analysis (LDA) transformation is computed to reduce the featurevector size to 42 dimensions. The AM uses a fully-continuous 3-state left-to-rightHMM. The emission probabilities are modeled by Gaussian Mixtures with diago-nal covariances. For our context-dependent AMs with different context sizes, westopped the decision tree splitting process at 500 triphones. After context cluster-ing, a merge-and-split training was applied, which selects the number of Gaussiansaccording to the amount of data. For all models, we use one global semi-tied covari-ance (STC) matrix after LDA. To model the tones, we apply the ”Data-driven tonemodeling” which had been successfully applied to the tonal language Vietnamese asdescribed in [51]. In this method all tonal variants of a phoneme share one basemodel. However, the information about the tone is added to the dictionary in formof a tone tag. Our speech recognition toolkit allows to use these tags as questions tobe asked in the context decision tree when building context dependent AMs. Thisway, the data decides during model clustering if two tones have a similar impact

Page 40: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

30 6. ASR Experiments and Results

on the basic phoneme. If so, the two tonal variants of that basic phoneme wouldshare one common model. In case the tone is distinctive (of that phoneme and/or itscontext), the question about the tone may result in a decision tree split, such thatdifferent tonal variants of the same basic phonemes would end up being representedby different models. For the vowel lengths, we apply the same technique. With thetraining transcriptions, we built a statistical 3-gram LM (TrainTRL) which containstheir whole vocabulary (4k) plus 2k frequency-based selected words (see Table 6.1).It has a perplexity (PPL) of 282 and an out-of-vocabulary (OOV) rate of 4.7% onthe dev set. As mentioned in Section 5.4.2, the pronunciations for the 6500 wordswere created in a rule-based fashion and were manually revised and cross-checkedby Hausa native speakers. The performance of the baseline system is 23.49% on thedevelopment set.

6.2 Systems Improvement

6.2.1 Language Model Improvement

We observed that the Hausa text provided on most websites is very limited. Toimprove the n-gram estimation in the LM and reduce the OOV rate, we crawledadditional text corpora from one online newspaper (http://hausa.cri.cn) with longercrawling periods. After our text normalization steps, text with approximately 8 mil-lion tokens remained. By interpolating the invididual models built from the trainingtranscriptions and the online newspaper, we created a new LM. The interpolationweights were tuned on the development set transcriptions by minimizing the PPLof the model. We increased the vocabulary of the LM by selecting frequent wordsfrom the additional text material which are not in the transcriptions. A 3-gram LMwith a total of 42k words (TrainTRL+Web) resulted in the lowest word error rates.Table 6.1 demonstrates that we were able to severely reduce the PPL, OOV rate,and WER using the additional web data.

Language Model dev / test PPL OOV WERTrainTRL (6k) dev 281.7 4.68 22.88

test 283.3 4.88 26.98TrainTRL+Web-1 (42k) dev 154.7 0.51 14.40

test 157.0 0.46 17.83

Table 6.1: LM Improvement (Additional Web Data).

6.2.2 Dictionary Improvement

Despite a thorough coordination and supervision, we detected remaining inconsis-tencies in the pronunciations and analyzed methods to automatically reject andsubstitute inconsistent or flawed entries. Then we analyzed the impact of tone andvowel length modeling for Hausa ASR.

6.2.2.1 Automatic Rejection of Inconsistent or Flawed Entries

We investigated different methods to filter erroneous word-pronunciation pairs andsubstitute the filtered pronunciations with more reliable ones. The methods toremove incorrect entries fall into the four categories:

Page 41: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

6.2. Systems Improvement 31

1. Length Filtering (Len)

(a) Remove a pronunciation if the number of grapheme and phoneme tokens differsmore than a certain threshold.

2. Alignment Filtering (Eps)

(a) Perform a g2p alignment [49][36][17]. The alignment process involves the in-sertion of graphemic and phonemic nulls (epsilon) into the lexical entries ofwords.

(b) Remove a pronunciation if the number of graphemic and phonemic nulls is overa certain threshold.

3. g2p Filtering after Length/Alignment Filtering(G2PLen/G2PEps)

(a) Train g2p models with “reliable” word-pronunciation pairs.

(b) Apply the g2p models to convert a grapheme string into a most likely phonemestring.

(c) Remove a pronunciation if the edit distance between the synthesized phonemestring and the pronunciation in question is over a certain threshold.

Dictionary WER (%) on devBaseline (with tones and length) 23.49Length Filtering (Len) 23.20Alignment Filtering (Eps) 23.30g2p Filtering after Length Filtering (G2PLen) 22.88g2p Filtering after Alignment Filtering (G2PEps) 23.15Grapheme-based 22.52

Table 6.2: Automatic rejection of inconsistent or flawed entries.

The threshold for each filtering method depends on the mean and the standartdeviation of the measure in focus (computed on all word-pronunciation pairs), i.e.the ratio between the numbers of grapheme and phoneme tokens in Len, the ratiobetween the numbers of graphemic and phonemic nulls in Eps, and the edit distancebetween the synthesized phoneme string and the pronunciation in question inG2PLen

and G2PEps. Those word-pronunciation pairs whose resulting number is shorter thanmean − standartdeviation or longer than mean + standartdeviation are rejected.Then we use the numbers of remaining word-pronunciation pairs to build new g2pmodels and applied them to the words with rejected pronunciations. Table 6.2 showsthat we were able to reduce the word error rate (WER) with all filtered pronunciationdictionaries. G2PLen performed best and was selected for the tone and vowel lengthexperiments. Additionally, we built a grapheme-based system that was even slightlybetter than the system with the G2PLen-filtered dictionary.

6.2.2.2 Tones and Vowel Lengths

We analyzed the importance of tone and vowel length modeling by including and ex-cluding tone and vowel length information in the pronunciation dictionary. Table 6.3indicates that best performance can be obtained modeling both (Phoneme-based(tones, vowel length)).

Page 42: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

32 6. ASR Experiments and Results

Dictionary WER (%) on devPhoneme-based (tones, vowel length) 22.88Phoneme-based (no tones, no vowel length) 24.33Phoneme-based (tones, no vowel length) 23.06Phoneme-based (no tones, vowel length) 23.15Grapheme-based 22.52

Table 6.3: Experiments with Tones and Vowel Lengths.

6.2.3 Speaker Adaptation and System Combination

We experimented with multi-pass decoding strategies and system combination meth-ods. As described in [47], system combination methods are known to lower theword error rate of speech recognition systems. They require the training of systemsthat are reasonably close in performance but at the same time produce an outputthat differs in its errors. This provides complementary information which leadsto performance improvements. We trained speaker-independent (SI) and speaker-adaptive (SA) systems and obtained the necessary varying systems with grapheme-and phoneme-based systems. Our experiments with a Confusion Network Com-bination of the different systems resulted in a word error rate of 13.16% on thedevelopment set. Figure 6.1 gives an overview of our final system combination withthe results of each system. On the test set we obtained a WER of 16.26%.

Figure 6.1: System Combination Results (%) on dev (test) set

Page 43: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

7. Conclusion and Future Work

With an increasing number of applications using speech and language processingtechnologies, the performance of those applications has significantly improved in thelast few years. In most cases, these technologies are built on systems using statisticalmodels which need large corpora of resources to be efficiently trained. However, largeresources are available only for a small number of the world’s languages. For under-resources languages such as African languages, collecting the necessary resources isa major obstacle for building a speech processing system. Besides that, the rapidportability of speech processing systems to new languages involves significant hu-man effort and high cost. To address this issue, a web-based platform called RapidLanguage Adaptation Toolkit (RLAT) was developed at the Cognitive Systems Lab(CSL) of the Karlsruhe Institute of Technology. RLAT significantly reduces theamount of time and effort involved in building speech processing systems for newlanguages [16] [50]. We extended the language specific components of RLAT to theHausa language. Then, we have investigated and developed an LVCSR system forthe Hausa language. Hausa is the lingua franca in West Africa spoken by over 25million speakers. We collected almost 9 hours of speech from 102 Hausa speakersreading newspaper articles. For language modeling, we collected a text corpus ofroughly 8M words. After a rapid bootstrapping, based on a multilingual phone in-ventory, using RLAT, we improved the performance by carefully investigating thepeculiarities of Hausa. The modeling of tones and vowel lengths performs betterthan omitting tone or vowel length information. We were able to improve the pro-nunciation dictionary quality with methods to filter erroneous word-pronunciationpairs. The initial recognition performance of 23.49% WER was improved to 13.16%on the development set and 16.26% on the test set. The proposal of grapheme-based system, which avoids building a phonetic pronunciation dictionary, is alsoinvestigated in this work. The performance of the grapheme-based system achieved22.52% WER. However, this system performs better than the system using the pro-nunciation dictionary with tone and vowel length information. Future work mayconcentrate on improving our pronunciation filtering methods and enhancing theLM with online newspapers in ajami.

Page 44: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

34 7. Conclusion and Future Work

Page 45: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

Bibliography

[1] English Wikipedia - Cameroon.

[2] English Wikipedia - Hausa language.

[3] English Wikipedia - Languages of Cameroon.

[4] Ethnologue - Hausa.

[5] Ethnologue - Languages of Cameroon.

[6] GlobalPhone website.

[7] The UCLA Phonetics Lab Archive. Los Angeles, CA: UCLA Department ofLinguistics, 2007.

[8] Abate, Solomon Teferra and Wolfgang Menzel: Automatic speechrecognition for an under-resourced language - amharic. In Interspeech, pages1541–1544. ISCA, 2007.

[9] Adegbola, Tunde: Building Capacities in Human Language Technology forAfrican Languages. 2009.

[10] Association, International Phonetic: Handbook of the InternationalPhonetic Association: A Guide to the Use of the International Phonetic Al-phabet. Cambridge University Press, 1999.

[11] Badenhorst, Jaco, Charl Heerden, Marelie Davel and EtienneBarnard: Collecting and evaluating speech recognition corpora for 11 SouthAfrican languages. Lang. Resour. Eval., 45(3):289–309, 2011.

[12] Bellegarda, Jerome R.: Statistical language model adaptation: review andperspectives. Speech Communication, 42:93–108, 2004.

[13] Besling, S.: Heuristical and Statistical Methods for Grapheme-to-PhonemeConversion. In Konvens, 1994.

[14] Bisani, M. and H. Ney: Multigram-based Grapheme-to-Phoneme Conversionfor LVCSR. In In Proc. Eurospeech, pages 933–936, 2003.

[15] Bisani, M. and H. Ney: Joint-Sequence Models for Grapheme-to-PhonemeConversion. Speech Communication, 2008.

[16] Black, A. W. and T. Schultz: Rapid Language Adaptation Tools and Tech-nologies for Multilingual Speech Processing. In ICASSP, Las Vegas, USA, 2008.

Page 46: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

36 Bibliography

[17] Black, Alan W., Kevin Lenzo and Vincent Pagel: Issues in BuildingGeneral Letter to Sound Rules. pages 77–80, 1998.

[18] Burquest, Donald A.: An Introduction to the Use of Aspect in Hausa Nar-rative. Language in context: Essays for Robert E. Longacre, Shin Ja J. Hwangand William R. Merrifield (eds.), pages 393–417, 1992.

[19] Byrne, W., V. Venkataramani, T. Kamm, Zheng F, Song Z, P. Fung,Liu Y and U. Ruhi: Automatic generation of pronunciation lexicons for Man-darin spontaneous speech. In In: Proceedings of the IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP, 2001.

[20] Davel, M. and E. Barnard: The Efficient Generation of PronunciationDictionaries: Human Factors during Bootstrapping. In ICSLP, 2004.

[21] Davel, M. and E. Barnard: Developing Consistent Pronunciation Modelsfor Phonemic Variants. In Interspeech, 2006.

[22] Davel, M. and O. Martirosian: Pronunciation Dictionary Development inResource-Scarce Environments. In Interspeech, 2009.

[23] Finke, Michael, Petra Geutner, Hermann Hild, Thomas Kemp,Klaus Ries and Martin Westphal: The Karlsruhe-Verbmobil SpeechRecognition Engine, 1997.

[24] Ghoshal, A., M. Jansche, S. Khudanpurv, M. Riley and M. Ulinski:Web-derived Pronunciations. In ICASSP, 2009.

[25] Heine, B. and D. Nurse: African languages: an introduction. CambridgeUniversity Press, 2000.

[26] Hermansky, H: Perceptual linear predictive (PLP) analysis of speech. Journalof the Acoustical Society of America, 87(4):1738–1752, 1990.

[27] Hodge, Carleton T. and Ibrahim Umaru: Hausa basic course. 1963.

[28] Huang, Xuedong, Alex Acero and Hsiao-Wuen Hon: Spoken LanguageProcessing: A Guide to Theory, Algorithm, and System Development. PrenticeHall PTR, Upper Saddle River, NJ, USA, 1st edition, 2001.

[29] Kaplan, R. M. and M. Kay: Regular Models of Phonological Rule Systems.In Computational Linguistics, 1994.

[30] Kominek, J.: TTS From Zero - Building Synthetic Voices for New Languages.Doctoral Thesis, 2009.

[31] Kominek, J. and A. W Black: Learning Pronunciation Dictionaries: Lan-guage Complexity and Word Selection Strategies. In HLT Conference of theNAACL, pages 232–239, 2006.

[32] Koslow, P.: Hausaland: The Fortress Kingdoms. The Kingdoms of Africa.Chelsea House Publishers, 1995.

Page 47: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

Bibliography 37

[33] Le, Viet Bac and Laurent Besacier: First steps in fast acoustic modelingfor a new target language: application to Vietnamese, 2005.

[34] Le, Viet-Bac and Laurent Besacier: Automatic speech recognition forunder-resourced languages: application to Vietnamese language. Trans. Audio,Speech and Lang. Proc., 17(8):1471–1482, November 2009.

[35] Llitjs, A. F. and A. W Black: Evaluation and Collection of Proper NamePronunciations Online. In LREC, 2002.

[36] Martirosian, O. and M. Davel: Error Analysis of a Public Domain Pro-nunciation Dictionary. In PRASA, pages 13–18, 2007.

[37] Meyer, Charles F.: Introducing English Linguistics. Cambridge UniversityPress, 2009.

[38] Nimaan, Abdillahi, Pascal Nocera and Jean-Frans Bonastre: Auto-matic transcription of Somali language. In INTERSPEECH 2006 - ICSLP,Ninth International Conference on Spoken Language Processing, Pittsburgh,PA, USA, September 17-21, 2006. ISCA, 2006.

[39] Pauw, Guy, Gilles-Maurice Schryver, Laurette Pretorius andLori Levin: Introduction to the special issue on African Language Technology.Lang. Resour. Eval., 45:263–269, 2011.

[40] Rabiner, Lawrence R.: Readings in speech recognition. chapter A tutorialon hidden Markov models and selected applications in speech recognition, pages267–296. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990.

[41] Rosenfeld, Ronald: Two Decades of Statistical Language Modeling: Wheredo we go from here. 88(8), 2000.

[42] Schlippe, T., S. Ochs and T. Schultz: Wiktionary as a Source for Auto-matic Pronunciation Extraction. In Interspeech, 2010.

[43] Schultz, T.: GlobalPhone: A Multilingual Speech and Text Database Devel-oped at Karlsruhe University. In ICSLP, 2002.

[44] Schultz, T., A. W Black, S. Badaskar, M. Hornyak and J. Kominek:SPICE: Web-based Tools for Rapid Language Adaptation in Speech ProcessingSystems. In Interspeech, 2007.

[45] Schultz, T. and A. Waibel: Language Independent and Language AdaptiveAcoustic Modeling for Speech Recognition. Speech Commun., 35, 2001.

[46] Schultz T., Kirchhoff K.: Multilingual Speech Processing. Academic Press,2006.

[47] Stueker, Sebastian, Christian Fuegen, Susanne Burger andMatthias Woelfel: Cross-System Adaptation and Combination for Con-tinuous Speech Recognition: The Influence of Phoneme Set and Acoustic Front-End. In Interspeech 2006 - ICSLP.

Page 48: Hausa Large Vocabulary Continuous Speech Recognitioncsl.anthropomatik.kit.edu/downloads/SA_EdyGuevaraKomgang.pdf · Hausa Large Vocabulary Continuous Speech Recognition Student Research

38 Bibliography

[48] Tomokiyo, Laura Mayfield and Susanne Burger: Eliciting NaturalSpeech from Non-Native Users Collecting Speech Data for LVCSR. In Proceed-ings ACL-IALL Joint Workshop on Computer Mediated Language Assessmentand Evaluation in NLP, 1999.

[49] Viterbi, A. J.: Error Bounds for Convolutional Codes and an AsymptoticallyOptimum Decoding Algorithm. In IEEE Transactions on Information Theory,pages 260–269, 1967.

[50] Vu, Ngoc Thang, Franziska Kraus and Tanja Schultz: Rapid Boot-strapping of five Eastern European Languages using the Rapid Language Adap-tation Toolkit, 2010.

[51] Vu, Ngoc Thang and Tanja Schultz: Vietnamese Large Vocabulary Con-tinuous Speech Recognition. In Automatic Speech Recognition and Understand-ing (ASRU 2009), Dezember 2009.

[52] Wells, J.C.: SAMPA computer readable phonetic alphabet. Gibbon, D.,Moore, R. and Winski, R. (eds.), 1997. Handbook of Standards and Resourcesfor Spoken Language Systems. Berlin and New York: Mouton de Gruyter. PartIV, section B., 1997.

[53] Wolfel, M.: Channel selection by class separability measures for automatictranscriptions on distant microphones. In Interspeech, 2007.

[54] Wolff, M., M. Eichner and R. Hoffmann: Measuring the Quality ofPronunciation Dictionaries. In PMLA, 2002.

[55] Young, S.: Large Vocabulary Continuous Speech Recognition: a Review. Cam-bridge University Engineering Department Trumpington Street, Cambridge,CB2 1PZ, 1996.

[56] Zhu, X. and R. Rosenfeld: Improving Trigram Language Modeling with theWorld Wide Web. In ICAASP, 2001.