barsha deka , joyshree chakraborty , abhishek dey , shikhamoni …€¦ · are indo-aryan languages...

6
SPEECH CORPORA OF UNDER RESOURCED LANGUAGES OF NORTH-EAST INDIA Barsha Deka * , Joyshree Chakraborty * , Abhishek Dey * , Shikhamoni Nath ** , Priyankoo Sarmah ** , S.R. Nirmala * , and Samudra Vijaya ** * Department of Electronics and Communication Engineering, GUIST, Gauhati University-781014, India ** Center for Linguistic Science and Technology, Indian Institute of Technology Guwahati, Guwahati-781039, India ABSTRACT In this paper, we present an account of an ongoing effort in creation of speech corpora of under-resourced languages of North-East India, namely, Assamese, Bengali and Nepali. The speech corpora are being created for development of Automatic Speech Recognition system in Assamese as well as for Language Identification system. The text corpus of Assamese language comprises of 1000 sentences collected from different sources such as story books, novels, proverbs. Speech data are recorded over telephone channel using an interactive voice response system. Speakers were asked to read one or more sets of sentences, each set containing 20 sentences. Speech was simultaneously recorded using a hand-held audio recorder. While significant amount of speech data has been collected for Assamese language, the task has begun for Bengali, Nepali and English spoken by na- tive speakers of these 3 languages. Currently, the Assamese speech database contains more than 5000 utterances by 27 native speakers. Information about the speakers such as di- alect, gender, age-group were also collected. We discuss the methodology used in collecting speech samples, and present a descriptive statistics of the speech corpora. Index TermsSpeech Corpora, Assamese, Bengali, Nepali, English. 1. INTRODUCTION Automatic speech recognition (ASR) is the process of de- riving the transcription of a spoken utterance by a machine. During the past decade, ASR for under-resourced languages has received much attention in the speech research commu- nity [1, 2]. The term under-resourced languages as introduced by Bermant [3] refers to a language which has some of (if not all) the following aspects: lack of unique writing system, lack of linguistic expertise, limited resources on web, limited elec- tronic resources for speech and language technologies. There are over 7, 000 languages in the world [4] and only a small fraction offers the resources required for implementation of Human Language Technologies. Fig. 1. Geographical locations of Endangered languages in India [5]. Fig. 1 shows the geographical locations of the lan- guages of India that have been marked as endangered by UNESCO [5]. Majority of such endangered languages are spoken in the North Eastern part of India. In this paper we report the creation of speech corpora in Assamese, Bengali, Nepali and English languages. Assamese, Bengali and Nepali are Indo-Aryan languages spoken primarily in the states of North-East India. Assamese is the official language of As- sam which is also spoken in parts of Arunachal Pradesh and other North-Eastern states of India by over 15 million native speakers. Bengali is spoken in various parts of the Indian subcontinent. It is the official language of West Bengal and also widely spoken in Tripura and Assam. Nepali is the offi- cial language of Nepal which is widely spoken in North East India especially in Sikkim. English is the medium of instruc- tion in most colleges in India. It is also an official language of Government of India and a few states of India. So, English is spoken by a significant number of Indians and sometimes is used as a language for communication between persons whose mother tongues are different. So, in addition to the three languages of North East India, we recorded spoken Oriental COCOSDA 2018 7-8 May 2018, Miyazaki, Japan

Upload: others

Post on 04-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Barsha Deka , Joyshree Chakraborty , Abhishek Dey , Shikhamoni …€¦ · are Indo-Aryan languages spoken primarily in the states of North-East India. Assamese is the official language

SPEECH CORPORA OF UNDER RESOURCED LANGUAGES OF NORTH-EAST INDIA

Barsha Deka ∗, Joyshree Chakraborty ∗, Abhishek Dey ∗, Shikhamoni Nath ∗∗,Priyankoo Sarmah ∗∗, S.R. Nirmala ∗, and Samudra Vijaya ∗∗

∗ Department of Electronics and Communication Engineering,GUIST, Gauhati University-781014, India

∗∗ Center for Linguistic Science and Technology,Indian Institute of Technology Guwahati, Guwahati-781039, India

ABSTRACTIn this paper, we present an account of an ongoing effort increation of speech corpora of under-resourced languages ofNorth-East India, namely, Assamese, Bengali and Nepali.The speech corpora are being created for development ofAutomatic Speech Recognition system in Assamese as wellas for Language Identification system. The text corpus ofAssamese language comprises of 1000 sentences collectedfrom different sources such as story books, novels, proverbs.Speech data are recorded over telephone channel using aninteractive voice response system. Speakers were askedto read one or more sets of sentences, each set containing20 sentences. Speech was simultaneously recorded usinga hand-held audio recorder. While significant amount ofspeech data has been collected for Assamese language, thetask has begun for Bengali, Nepali and English spoken by na-tive speakers of these 3 languages. Currently, the Assamesespeech database contains more than 5000 utterances by 27native speakers. Information about the speakers such as di-alect, gender, age-group were also collected. We discuss themethodology used in collecting speech samples, and presenta descriptive statistics of the speech corpora.

Index Terms— Speech Corpora, Assamese, Bengali,Nepali, English.

1. INTRODUCTION

Automatic speech recognition (ASR) is the process of de-riving the transcription of a spoken utterance by a machine.During the past decade, ASR for under-resourced languageshas received much attention in the speech research commu-nity [1, 2]. The term under-resourced languages as introducedby Bermant [3] refers to a language which has some of (if notall) the following aspects: lack of unique writing system, lackof linguistic expertise, limited resources on web, limited elec-tronic resources for speech and language technologies. Thereare over 7, 000 languages in the world [4] and only a smallfraction offers the resources required for implementation ofHuman Language Technologies.

Fig. 1. Geographical locations of Endangered languages inIndia [5].

Fig. 1 shows the geographical locations of the lan-guages of India that have been marked as endangered byUNESCO [5]. Majority of such endangered languages arespoken in the North Eastern part of India. In this paper wereport the creation of speech corpora in Assamese, Bengali,Nepali and English languages. Assamese, Bengali and Nepaliare Indo-Aryan languages spoken primarily in the states ofNorth-East India. Assamese is the official language of As-sam which is also spoken in parts of Arunachal Pradesh andother North-Eastern states of India by over 15 million nativespeakers. Bengali is spoken in various parts of the Indiansubcontinent. It is the official language of West Bengal andalso widely spoken in Tripura and Assam. Nepali is the offi-cial language of Nepal which is widely spoken in North EastIndia especially in Sikkim. English is the medium of instruc-tion in most colleges in India. It is also an official languageof Government of India and a few states of India. So, Englishis spoken by a significant number of Indians and sometimesis used as a language for communication between personswhose mother tongues are different. So, in addition to thethree languages of North East India, we recorded spoken

Oriental COCOSDA 2018 7-8 May 2018, Miyazaki, Japan

Page 2: Barsha Deka , Joyshree Chakraborty , Abhishek Dey , Shikhamoni …€¦ · are Indo-Aryan languages spoken primarily in the states of North-East India. Assamese is the official language

English from literate Indians. Thus, written and spoken lan-guage resources are collected for four languages: Assamese,Bengali, Nepali and English.

The remaining part of the paper is organized as follows:Section 2 discusses the methodology used for collecting thespeech database. In section 3, statistical analysis of the re-ported database is presented. Finally a summary of the workis written in section 4.

2. SPEECH DATA COLLECTION

This section discusses the methodology used in the creationof Assamese, Bengali, Nepali and English speech corpora.Firstly, the details of the text corpora of the four languages arediscussed. Secondly, the procedure of speech data recordingis presented. Finally, the convention of naming speech files isdescribed.

2.1. Text corpora

The text corpora reported in this paper comprises of textin four languages, namely, Assamese, Bengali, Nepali andEnglish. In this preliminary effort, sentences were collectedfrom different sources such as articles, story books, onlinenewspapers. In an ongoing effort, sentences are being se-lected to enhance phonetic richness, i.e., to enhance thefrequency of rare triphones. In addition to the sentences,proverbs and digit sequences were added to the text corpora.The sentences are segmented such that their length lies be-tween 5 to 10 words. The details of the text corpora used tocreate speech data in four languages are given below.

2.1.1. Assamese

The Assamese text corpus contains 1000 unique sentenceswith a vocabulary of 2777 unique words. The distributionof the sources of sentences is shown in Fig. 2. The 1000 sen-tences are arranged in 50 different sets. Each set comprises of1 digit sequence, 4 proverbs and sentences randomly selectedfrom various sources.

2.1.2. Bengali

The Bengali text corpus comprises of 400 unique sentences.The sentences are arranged in 20 different sets. Each set con-sists of 20 sentences out of which 4 sentences are sourcedfrom poems, 5 sentences from news, 5 from travel(tourism)blogs and 2 from miscellaneous topics; each set also contains2 Bengali digit sequences and 2 proverbs.

2.1.3. Nepali

The Nepali text corpus comprises of 400 unique sentences.The sentences are selected to make 20 different sets. Eachset consists of 20 sentences out of which 2 were Nepali digit

Fig. 2. Distribution of the Assamese sentences from differentsources.

sequences, 2 were proverbs and the rest 16 were from poems,news, and other miscellaneous topics.

2.1.4. English

The English text corpus consists of 400 unique sentencesgrouped into 20 sets. The sentences in each set are selectedrandomly and sourced from online news, poems, story booksand digit sequences.

2.2. Data Recording Methodology

In this work, the speech data was simultaneously recorded us-ing two channels: narrowband and wideband. An interactivevoice response system was specifically designed for recordingspeech over mobile telephony channel. A hand held recorderwas used to record the wideband speech while the speakerswere talking using their mobile phones. The speakers wereprovided with data-sheets that contained sentences to be read.They were asked to call a centralized voice-server with theirown mobile phones, and respond to the predefined prompts.The voice-server is configured with Asterisk software [6] thatfacilitate the callers with a toll-free number for recording thespeech data.

Fig. 3 shows the flow diagram of the IVR based datarecording system. Information about the speaker is collectedin DTMF mode before recording speech data. When a personcalls the IVRS, (s)he is greeted with a welcome message. Atfirst, the system prompts the caller to enter single digit genderidentification number: 0 for male and 1 for female. Secondly,the system prompts the caller to enter a digit denoting his/herage group (0/1/2). During adolescence, the voice qualitychanges significantly. It is easier to obtain voice samplesfrom adult persons below the age of 30 years. Based on theseobservations, three age groups were predefined as:

• 0 (Junior) : Speakers below 15 years of age

• 1 (Adult) : Speakers whose age is greater than 15 yearsbut less than 30 years

Oriental COCOSDA 2018 7-8 May 2018, Miyazaki, Japan

Page 3: Barsha Deka , Joyshree Chakraborty , Abhishek Dey , Shikhamoni …€¦ · are Indo-Aryan languages spoken primarily in the states of North-East India. Assamese is the official language

Fig. 3. Flow-diagram of the data-recording IVRS.

• 2 (Senior) : Speakers older than 30 years

The labels ”Junior/Adult/Senior” were assigned so as tofacilitate the construction of self-explanatory file name thatcontains contains a single letter (J/A/S) indicating the agegroup of the speaker. Thirdly, the system prompts the callerto type his/her four digit speaker identification number, fol-lowed by four digit Id of the first sentence to be read. Then,the speaker is prompted to utter his/her mother tongue; thisspeech will be listened to by a volunteer later in order to notedown the caller’s mother tongue as part of meta data. Thespeaker is then instructed to read the first sentence after a beepsound. After recording is over, the caller is asked to read thenext sentence. This process is repeated till the speaker readsout 20 sentences is one session. The recorded speech files arestored in speaker dependent directories. The recording timefor capturing the response of each sentence is set to 8 sec-onds. A log file is generated that captures all the meta dataentered by the caller. In addition to this, another log file isgenerated that keeps track of the mapping between speakeridentification number and mobile number of the caller.

2.3. File Naming Convention

The long term goal of this effort is to enable automatic re-trieval of linguistic, para-linguistic and non-linguistic in-formation from speech data. Supervised training of such

systems using scripts would be easy if the file name is self-explanatory, i.e., the file name contains as much meta dataas possible. Accordingly, we followed a file naming conven-tion used by a consortium of Indian institutes that developedspeech recognition systems in Indian languages [7]. The filename is constructed as a sequence of codes, encoded alter-nately in terms of Roman alphabets and numerals so that itcan be parsed unambiguously by a computer program.

The names of speech files follow a pattern that captureseight different types of information as described below.

(1) Single letter denoting gender of the speaker(M for Male / F for Female)

(2) Two digit language code (01 for Assamese, 02 for Ben-gali, 11 for Nepali, 00 for English)

(3) Two letter state code (AS for Assam)

(4) Two digit district code

(5) Single letter (J/A/S) indicating the age group

(6) Four digit speaker id (0000− 9999)

(7) Single letter indicating the type of speech recorded (C forContinuous speech)

(8) Four digit serial number of the sentence read by thespeaker (0000− 9999).

An exemplar file name is ”F01AS01A0001C0001.wav”.

3. STATISTICAL ANALYSIS OF THE SPEECHCORPORA

This section gives the statistics of the speech corpora col-lected so far.

3.1. Phonetic composition

This subsection briefly discusses the phonetic composition ofthe Assamese database as an exemplar database. Similar anal-ysis was done for other languages as well.

The Assamese speech database consists of 35 uniquephonetic units out of which 8 are vowels (including diph-thongs) and 27 are consonants. The representation of thephonetic units in 3 different forms is shown in Fig. 4. Thefirst column shows the Assamese character(s) of 35 units.The second column lists the corresponding International Pho-netic Alphabet (IPA) [8]. The third column shows the labelsdesigned for computer processing of phonetic transcriptionsof speech files. These labels are called Indian LanguageSound Labels (ILSL12) [9]. The pronunciation dictionaryof all the languages are being created following the ILSL12convention. The Assamese speech database has 67899 occur-rences of vowels and 107783 occurrences of consonants. Thefrequency distributions of vowels and consonants in the As-samese database are shown in Fig. 5 and Fig. 6 respectively.

Oriental COCOSDA 2018 7-8 May 2018, Miyazaki, Japan

Page 4: Barsha Deka , Joyshree Chakraborty , Abhishek Dey , Shikhamoni …€¦ · are Indo-Aryan languages spoken primarily in the states of North-East India. Assamese is the official language

Fig. 4. Representation of the Assamese phonetic units in 3notations: Assamese script, IPA, ILSL12. [8], [9]

3.2. Speaker Information

In this subsection, we present the statistics of the personswhose speech samples are used to create the speech corpora.The speech data for all the four languages (Assamese, Ben-gali, Nepali and English) were recorded from speakers resid-ing in the city of Guwahati. Since Guwahati is the largest cityof Assam and North Eastern region of India, people from dif-ferent parts of this region speaking various languages residein this city. Table 1 contains information about gender andmother-tongue of the speakers associated with all 4 databases.

3.2.1. Assamese speech database

The Assamese speech data was collected from 27 speakers.The mother-tongue of 3 of these 27 speakers is Bengali; how-ever, these speakers can speak fluent Assamese. Consider-ing the influence of dialectal variations from one region toanother, the dialect of the speakers were also noted whilerecording the speech data. The 27 speakers hail from 3 broad

Fig. 5. Frequency of occurrences of vowels in Assamesedatabase.

Fig. 6. Frequency of occurrences of consonants in Assamesedatabase.

dialectal regions of Assam, namely, upper Assam, lower As-sam and central Assam. The dialect wise distribution of thespeakers is shown in Fig. 7. It can be seen that majority ofthe speakers belong to the central Assam dialectal region, fol-lowed by speakers from lower Assam and upper Assam. Theacoustic characteristics of speech sounds depend on the ageof the speaker. The breakup of speakers belonging to the 3age groups (Junior, Adult and Senior) is shown in Fig. 8.

3.2.2. Other language speech databases

Speech data collection has begun for other 3 languages:Bengali, Nepali and English. The Bengali speech data wasrecorded from 21 Bengali speakers residing in the city ofGuwahati. Out of the 21 speakers, 17 speakers belong tothe senior age group while 4 speakers belong to the adult

Oriental COCOSDA 2018 7-8 May 2018, Miyazaki, Japan

Page 5: Barsha Deka , Joyshree Chakraborty , Abhishek Dey , Shikhamoni …€¦ · are Indo-Aryan languages spoken primarily in the states of North-East India. Assamese is the official language

Table 1. Gender wise distribution of speakers of the four lan-guage databases.

Language MotherTongue

Number of SpeakersMale Female Total

Assamese Assamese 14 10 24Bengali 0 3 3

Bengali Bengali 8 13 21Nepali Nepali 4 2 6

English Assamese 2 4 6Bengali 2 8 10

Fig. 7. Pie chart depicting the distribution of the dialectalregions of speakers associated with the Assamese speech cor-pus.

group. Similarly, speech data was also recorded in Nepalilanguage from native speakers of Nepali. Nepali speech datawere collected from 6 speakers out of which 5 speakers’ agewere more than 30 years while the remaining 1 speaker wasless than 30 years. Speech data for English language wascollected from 6 Assamese and 10 Bengali speakers, whocould read and speak English. Out of 16 speakers, 3 speakersbelonged to the senior age group, 11 speakers belonged toadult group and 2 belonged to the junior group.

3.3. Speech Database Composition

In this section, we discuss the composition of the recordedspeech files of the speech corpora.

3.3.1. Assamese

The Assamese text corpus comprised of 50 sentence sets.Each speaker was asked to read sentence set(s) selected ran-domly from 50 sets. Speakers took part in this exercisevoluntarily; no honorarium was given to any speaker. So,varying number of sets were spoken by different speakersaccording to their convenience. For instance, speaker 1 hasspoken 10 sets that comprised of 200 utterances (20 sentences

Fig. 8. Age-group wise break up of the speakers associatedwith the Assamese speech corpus.

in each set) while speaker 6 has spoken 500 sentences from25 different sets. Currently, the Assamese speech databasecontains a total of 5658 speech data files spoken by the 27speakers. From the funds of a sponsored project, there is aplan to offer honorarium to speakers for their efforts. Thenspeech data will be collected from a large number of speakerssuch that each sentence of the phonetically rich text corpus iscovered nearly equally in the speech corpus.

3.3.2. Bengali, Nepali and English

In addition to the Assamese database, the speech corpora con-sists of 3 more languages that include Bengali, Nepali andEnglish. The Bengali database consists of 2500 speech datafiles which are collected from the 21 speakers. The currentNepali and English speech database comprises of 660 and2500 speech files collected from 6 and 16 speakers respec-tively.

4. CONCLUSION

In this paper, we presented an account of an ongoing speechcorpora creation effort in the state of Assam in North Easternpart of India. In this work, the speech data of three under-resourced languages, namely, Assamese, Bengali and Nepali,along with English spoken by Indians of this region were col-lected. These speech corpora will be used for the develop-ment of Automatic Speech Recognition systems and Auto-matic Language Identification system for these languages.

5. ACKNOWLEDGEMENT

The authors would like to thank all the speakers who havespared their valuable time in giving their speech data in thecreation of the speech corpora.

Oriental COCOSDA 2018 7-8 May 2018, Miyazaki, Japan

Page 6: Barsha Deka , Joyshree Chakraborty , Abhishek Dey , Shikhamoni …€¦ · are Indo-Aryan languages spoken primarily in the states of North-East India. Assamese is the official language

6. REFERENCES

[1] Laurent Besacier, Etienne Barnard, Alexey Karpov, andTanja Schultz, “Automatic speech recognition for under-resourced languages: A survey,” Speech Commun., vol.56, pp. 85–100, Jan. 2014.

[2] Mark J. F. Gales, Kate M. Knill, Anton Ragni, andShakti P. Rath, “Speech recognition and keyword spot-ting for low-resource languages: Babel project researchat CUED,” in 4th Workshop on Spoken Language Tech-nologies for Under-resourced Languages, SLTU 2014, St.Petersburg, Russia, May 14-16, 2014, 2014, pp. 16–23.

[3] Vincent Berment, Methods to computerize “littleequipped” languages and groups of languages, The-ses, Universite Joseph-Fourier - Grenoble I, May 2004,https://tel.archives-ouvertes.fr/tel-00006313.

[4] “Ethnologue, Languages of the World”,https://www.ethnologue.com.

[5] Christopher Moseley, “UNESCO Atlas of theWorld’s Languages in Danger”, Online version:http://www.unesco.org/languages-atlas/index.php.

[6] Asterisk, Open Source Communication Software,http://www.asterisk.org/.

[7] Speech-Based Automated CommodityPrice Helpline in Six Indian Languages,http://asrmandi.wixsite.com/asrmandi.

[8] International Phonetic Association, Handbook of the In-ternational Phonetic Association: A guide to the use ofthe International Phonetic Alphabet, Cambridge Univer-sity Press, 1999.

[9] Indian Language Speech sound Label set (ILSL12),https://www.iitm.ac.in/donlab/tts/downloads/cls/cls v2.1.6.pdf.

Oriental COCOSDA 2018 7-8 May 2018, Miyazaki, Japan